## Shared reference point

This tests the potential upside to a shared reference point similarity system. As described in [these blog posts](https://softwaredoug.com/blog/2023/03/12/reconstruct-dot-product-from-other-dot-products.html)

In [1]:
import numpy as np

def load_sentences():
    # From
    # https://www.kaggle.com/datasets/softwaredoug/wikipedia-sentences-all-minilm-l6-v2
    with open('wikisent2_all.npz', 'rb') as f:
        vects = np.load(f)
        vects = vects['arr_0']
        # vects = np.stack(vects)
        all_normed = (np.linalg.norm(vects, axis=1) > 0.99) & (np.linalg.norm(vects, axis=1) < 1.01)
        assert all_normed.all(), "Something is wrong - vectors are not normalized!"

    with open('wikisent2.txt', 'rt') as f:
        sentences = f.readlines()

    return sentences, vects

sentences, vects = load_sentences()

# Shrink by 50% for the RAM savings
sentences = sentences[::2]
vects = vects[::2]
vects.shape, len(sentences)

((3935913, 384), 3935913)

## Generating random vectors

Generate random unit vectors for our sentences

In [2]:
def random_vector(num_dims=768):
    """ Sample a unit vector from a sphere in N dimensions.
    It's actually important this is gaussian
    https://stackoverflow.com/questions/59954810/generate-random-points-on-10-dimensional-unit-sphere
    IE Don't do this
        projection = np.random.random_sample(size=num_dims)
        projection /= np.linalg.norm(projection)
    """
    projection = np.random.normal(size=num_dims)
    projection /= np.linalg.norm(projection)
    return projection

random_vector(num_dims=vects.shape[1])

array([ 0.00760355,  0.00564115,  0.00307477,  0.00794051,  0.01702631,
       -0.02395061, -0.03561788,  0.08518245,  0.00295837,  0.00715736,
        0.04918901, -0.01853059, -0.06036055,  0.04139638,  0.06934402,
        0.02232178, -0.05855475, -0.06564847,  0.01759007, -0.07336281,
        0.08830979,  0.00411016, -0.03485923, -0.03820476,  0.00479814,
        0.01169138, -0.00220678,  0.06106189, -0.05743143,  0.02980933,
        0.02837479,  0.09437317,  0.10529041, -0.00140369, -0.01190975,
       -0.00450068, -0.04502719, -0.09858508,  0.00428388, -0.09976617,
       -0.00809057,  0.00092912,  0.04912673, -0.00480625,  0.03038182,
        0.02902806, -0.04036919,  0.00361974,  0.02427557,  0.00165492,
        0.04295609, -0.07705088, -0.03549943,  0.03061947,  0.01605988,
       -0.010294  ,  0.04588686, -0.00726947, -0.02004883, -0.09639053,
       -0.0604207 , -0.03903797, -0.03630089, -0.03908148,  0.05315883,
        0.00076439, -0.10764345, -0.04615793,  0.06200037, -0.01

## Generate N reference points

Can we reconstruct dot products of two vectors from encoded wikipedia sentences (`vect`) using their relationship to intermediate reference point vectors. Here we generate a handful of references to test that.

In [13]:
num_projections = 1000

refs = np.zeros((num_projections, vects.shape[1]), dtype=np.float32)

for ref_ord in range(0, num_projections):
    refs[ref_ord] = random_vector(num_dims=vects.shape[1])
    
refs

array([[-0.02580631,  0.01371638, -0.02826618, ...,  0.08710439,
        -0.04491221,  0.03845857],
       [ 0.13163552, -0.03846619,  0.01430145, ..., -0.02832336,
         0.01199875, -0.07175732],
       [ 0.03133534,  0.02555911, -0.02665497, ...,  0.07151171,
        -0.00102358,  0.09846599],
       ...,
       [-0.0186919 , -0.01071524,  0.02203381, ...,  0.02008144,
         0.07727219, -0.00406071],
       [ 0.062606  , -0.01165775,  0.02850094, ..., -0.14749484,
         0.0262267 , -0.0266069 ],
       [-0.00778042,  0.04405979,  0.03943392, ...,  0.00355453,
        -0.04206917, -0.04399458]], dtype=float32)

## Generate a ground truth 

Given a query, lets look at its ground truth, so we can check out the recall using just the reference points.

In [14]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

query = "mary had a little lamb"

def search_ground_truth(query, at=10):
    query_vector = model.encode(query)
    nn = np.dot(vects, query_vector)
    top_n = np.argpartition(-nn, at)[:at]
    top_n = top_n[nn[top_n].argsort()[::-1]]
    return sorted(zip(top_n, nn[top_n]),
                  key=lambda scored: scored[1],
                  reverse=True)

gt_ords = set()
for vect_ord, score in search_ground_truth(query):
    gt_ords.add(vect_ord)
    print(vect_ord, score, sentences[vect_ord])

1996387 0.700519 "Mary Had a Little Lamb", who wrote the novel under the name of Sara J. Hale.

1997224 0.6153624 Mary then went into labor.

1418816 0.61314523 It begins with the melody of the popular children's song "Mary Had a Little Lamb" and then cuts into the main riff, punctuated with a high trumpet trill.

1887627 0.5563892 Lamb and is wife, Sara, have one son.

775918 0.54064065 For the more domestic and intimate iconic representations of Mary with the infant Jesus on her lap, see Madonna and Child.

1393341 0.5362675 In this variant he shows the Christ Child in Mary's lap.

611431 0.5288173 Did Jesus Have a Dog?

3447108 0.52027595 The songs exclusive to this release are "Call Me Claus," "Mary Had a Little Lamb," and "'Zat You, Santa Claus?".

2991842 0.5185638 The final two lines detail the former's lamb feast, which resuscitates it.

3120137 0.5133202 The "Lady" is the Virgin Mary.



## Every sentence to every reference

Our 'index' is every sentence dot producted with every sentence.

In [15]:
index = np.dot(vects, refs.T)
index.shape

(3935913, 1000)

## Most similar reference points

Most similar reference points to our query.

In [16]:
query_vector = model.encode(query)

query_ords = 200

dotted = np.dot(refs, query_vector)
best_ref_ords = np.argsort(-dotted)[:query_ords]
best_ref_ords

array([ 78, 573, 694, 473, 580, 625, 945, 353, 190, 361, 965, 704, 185,
       326, 734, 156, 829, 564, 278, 794, 671, 253, 342, 976, 408, 521,
       842, 152, 178, 465, 374, 992, 243, 275, 475, 635, 636,   5, 780,
       314, 517, 520,  54, 948, 677, 352, 562, 423, 659, 876, 674, 552,
       695, 387, 405, 522, 400, 343, 401, 134, 280, 246, 878, 706, 295,
        14, 954, 582,  17, 499, 617, 907, 634, 225,  41, 274, 160, 109,
        77, 284,  39, 942, 144, 944, 175,   0,  34, 810, 538, 306, 568,
        67,  66, 157, 549, 894,   7, 858, 439, 494, 898, 488, 115, 608,
       479, 168, 449,  15, 737, 136, 777, 199, 828, 528, 729, 228, 241,
       188, 445, 904, 363, 203, 108, 825, 526, 482, 658, 657, 554, 772,
       357,  90, 269,  57, 434,   8, 715, 392,  31, 261, 571, 512, 768,
       953, 678, 334,  58, 796,  10, 266, 579, 153, 698, 435, 739,  84,
       586, 775, 500, 883, 836, 652, 341, 701, 927, 289, 711,  60, 999,
       626,  29, 957,  99, 922, 708, 184, 394, 133,  20, 373, 89

In [17]:
best_ref_dotted = dotted[best_ref_ords]
best_ref_dotted

array([0.15665406, 0.14806001, 0.14440572, 0.1398163 , 0.13917927,
       0.12176482, 0.12141026, 0.11831698, 0.11796774, 0.11738214,
       0.11579491, 0.11318826, 0.11149937, 0.11138718, 0.11067988,
       0.10883573, 0.10764483, 0.10531989, 0.10447416, 0.103785  ,
       0.10335778, 0.10333985, 0.10209368, 0.10177754, 0.1015517 ,
       0.10016114, 0.10009142, 0.09980313, 0.0990949 , 0.09804557,
       0.09727894, 0.09713944, 0.0966715 , 0.09591183, 0.09546015,
       0.09545984, 0.0948897 , 0.09486027, 0.09418116, 0.09417465,
       0.09311642, 0.09292433, 0.0926735 , 0.0925833 , 0.0920528 ,
       0.09146754, 0.08990863, 0.08815593, 0.08677662, 0.08645993,
       0.08341567, 0.08198702, 0.08167663, 0.08113563, 0.07996854,
       0.07991424, 0.07966755, 0.07879308, 0.07738514, 0.07689548,
       0.07688595, 0.07655959, 0.07613555, 0.07608298, 0.07551803,
       0.07508269, 0.07493024, 0.07453327, 0.07352698, 0.07284524,
       0.0727124 , 0.07237699, 0.07210187, 0.07173758, 0.07168

## Combine with query vector dotted with ref

We simply combine with query vector dotted with reference points

In [18]:
every_dotted = index[:, best_ref_ords] * best_ref_dotted
every_dotted[0]

array([-1.43851514e-03,  7.85502233e-03, -4.77811648e-03, -1.14727404e-03,
        1.15502216e-02, -1.51472576e-02, -1.28390053e-02,  6.00535981e-03,
        5.04331011e-03,  9.99870594e-04,  9.01861116e-04,  2.44721723e-05,
       -1.41904503e-03,  1.07748900e-03,  3.02147539e-03,  4.87114117e-03,
        4.76205675e-03,  3.76953464e-03, -8.64543719e-04, -7.12581386e-04,
        7.20937224e-03, -7.14131398e-04, -3.55101074e-03,  9.44371743e-04,
       -6.87509729e-03,  3.92995396e-04, -1.88368373e-03, -5.48755331e-03,
       -8.31113255e-04,  1.79437315e-03,  1.27115415e-03, -9.22204112e-04,
       -3.97980586e-03, -8.08526855e-03,  8.70212039e-04,  6.34620246e-03,
       -3.27110454e-03, -4.48477734e-03, -3.41251725e-03,  3.69645143e-03,
       -3.78966884e-04,  1.08029284e-02, -2.72582099e-03, -6.10991847e-03,
       -9.86705441e-03,  2.43428236e-04, -1.33861776e-03,  3.68477195e-03,
        7.11842673e-04,  1.83926499e-03, -5.29453624e-04,  1.08456115e-04,
        7.04920338e-03, -

## Sum to combine each ref for each vector

Since at 10 refs over hundreds of dimensions, they're most likely orthogonal, let's just quickly sum them as a first pass. This is now just a columns sum.


In [19]:
vects_scored = np.sum(every_dotted, axis=1)

## Compare to ground truth for recall

In [20]:
best_vect_ords = np.argsort(-vects_scored)[:10]
dotted = vects_scored[best_vect_ords]

list(zip(best_vect_ords, dotted))

[(1996387, 0.7717469),
 (1418816, 0.681876),
 (1997224, 0.65832865),
 (775918, 0.6508553),
 (2306229, 0.61892146),
 (1436739, 0.5988822),
 (1887625, 0.59862524),
 (1383224, 0.59599453),
 (1642415, 0.5866407),
 (2991842, 0.5858491)]

In [21]:
set(best_vect_ords) & gt_ords

{775918, 1418816, 1996387, 1997224, 2991842}