From shared reference point [proof of concept](http://localhost:8888/lab/tree/vector_search/Shared%20Reference%20Proof%20of%20Concept.ipynb) we saw we got OK'ish recall using a few thousand reference points and querying with 100. We would like to

1. Increase number of reference points
2. Decrease the number needed at query time

In a system that uses this, this setup would increase recall with the least impact to performance.

This notebook performs a grid search over the possible values.

## Load sentences

Reminder we use minilm encoded sentences from wikipedia, sampled down by 50% due to memory constraints.

In [None]:
import numpy as np

def load_sentences():
    # From
    # https://www.kaggle.com/datasets/softwaredoug/wikipedia-sentences-all-minilm-l6-v2
    with open('wikisent2_all.npz', 'rb') as f:
        wiki_vects = np.load(f)
        wiki_vects = wiki_vects['arr_0']
        # vects = np.stack(vects)
        all_normed = (np.linalg.norm(wiki_vects, axis=1) > 0.99) & (np.linalg.norm(wiki_vects, axis=1) < 1.01)
        assert all_normed.all(), "Something is wrong - vectors are not normalized!"

    with open('wikisent2.txt', 'rt') as f:
        wiki_sentences = f.readlines()

    return wiki_sentences, wiki_vects

sentences, vects = load_sentences()
del sentences # dont really care about the output here

# Shrink by 50% for the RAM savings
# sentences = sentences[::2]
vects = vects[::2]
vects.shape

## Build index

As per the proof of concept:

- Function to generate random vectors
- Build index of reference points with dot products back to main vectors

In [None]:
def random_vector(num_dims=768):
    """ Sample a unit vector from a sphere in N dimensions.
    It's actually important this is gaussian
    https://stackoverflow.com/questions/59954810/generate-random-points-on-10-dimensional-unit-sphere
    IE Don't do this
        projection = np.random.random_sample(size=num_dims)
        projection /= np.linalg.norm(projection)
    """
    projection = np.random.normal(size=num_dims)
    projection /= np.linalg.norm(projection)
    return projection

random_vector(num_dims=vects.shape[1])

In [None]:
def build_index(vects, num_refs=1000, refs_factory=random_vector):

    refs = np.zeros((num_refs, vects.shape[1]), dtype=np.float32)

    for ref_ord in range(0, num_refs):
        refs[ref_ord] = refs_factory(num_dims=vects.shape[1])
    # Memory gets sucked up here :)
    index = np.dot(vects, refs.T)

    return refs, index

## Search ground truth

Here's the ground truth for the search, using MiniLM (how the vectors are encoded)

In [8]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

query = "mary had a little lamb"

def search_ground_truth(vects, query, at=10):
    query_vector = model.encode(query)
    nn = np.dot(vects, query_vector)
    top_n = np.argpartition(-nn, at)[:at]
    top_n = top_n[nn[top_n].argsort()[::-1]]
    return sorted(zip(top_n, nn[top_n]),
                  key=lambda scored: scored[1],
                  reverse=True)

gt_ords = set()
for vect_ord, score in search_ground_truth(vects, query):
    gt_ords.add(vect_ord)
    print(vect_ord, score)

1996387 0.700519
1997224 0.6153624
1418816 0.61314523
1887627 0.5563892
775918 0.54064065
1393341 0.5362675
611431 0.5288173
3447108 0.52027595
2991842 0.51856375
3120137 0.5133202


## (Inefficient) search function

We use the most accurate (though most inefficient) form of the reference point function that gets every vectors dot product to the reference points.

In [24]:
from project import project
import math


def best_refs(refs, query_vector, num_refs=200):
    dotted = np.dot(refs, query_vector)
    best_ref_ords = np.argsort(-dotted)[:num_refs]
    ref_ords_span = []
    sins = []
    
    # Compute sin(theta) where theta is the angle between the span of the vectors
    # inserted thusfar, so we can know hhow much added information this contains
    # see
    # https://softwaredoug.com/blog/2023/03/12/reconstruct-dot-product-from-other-dot-products.html
    # Generally when vectors are near orthogonal, the sins are near 1.0, and it doesn't matter.
    # but it DOES matter if the ref vectors are NOT random
    for idx, ref_ord in enumerate(best_ref_ords):
        if len(ref_ords_span) == 0:
            sins.append(1.0)
        else:
            proj = project(refs[ref_ord], refs[ref_ords_span])
            dot = np.dot(proj, refs[ref_ord])
            if dot > 1.0 and dot < 1.0001:
                dot = 1.0
            if dot < 0.0 and dot > -0.0001:
                dot = 0.0
            assert (dot <= 1.0 and dot >= 0.0), f"Dot product out of range - {dot}"
            angle = math.acos(dot)
            sin_theta = math.sin(angle)
            sins.append(sin_theta)
        ref_ords_span.append(ref_ord)
        
    return best_ref_ords, dotted[best_ref_ords], sins

def search(index, refs, query, num_refs=200, use_sins=True, debug=False):

    query_vector = model.encode(query)    
    best_ref_ords, dotted, sins = best_refs(refs, query_vector, num_refs=num_refs)
    # print(dotted, best_ref_ords, sins)
    
    every_dotted = index[:, best_ref_ords] * dotted
    if use_sins:
        every_dotted *= sins
        
    vects_scored = np.sum(every_dotted, axis=1)
        
    best_vect_ords = np.argsort(-vects_scored)[:10]
    scored = vects_scored[best_vect_ords]
    if debug:
        return list(zip(best_vect_ords, scored)), best_ref_ords, dotted, sins
    else:
        return list(zip(best_vect_ords, scored))

refs, index = build_index(vects, num_refs=100)
search(index, refs, query="mary had a litle lamb", num_refs=10)

[(2179728, 0.07834275),
 (498610, 0.06783752),
 (426544, 0.06624307),
 (1996890, 0.0654427),
 (1428519, 0.06461048),
 (1012713, 0.06454661),
 (3883316, 0.06427097),
 (3492758, 0.064007975),
 (901803, 0.063778706),
 (2938803, 0.06312483)]

## Search over sample of queries

Using a handful of queries lets do a search varying:

* `num_query_refs` - the query time refs to score against the query's vector
* `num_index_refs` - the number of index time refs to use when constructing the index

### Generate ground truths

Get a ground truth for each test query to let us compute recall against

In [10]:
from collections import defaultdict

test_queries = ["what is a cat", "where is spain", "what is the capital of spain", 
"who framed roger rabbit", "free willy", "bed bath and beyond", "hats and stuff", "bed bath beyond",
"do you even paginate bro?", "mary had a little lamb"]

ground_truths = defaultdict(set)
for query in test_queries:
    for vect_ord, score in search_ground_truth(vects, query):
        ground_truths[query].add(vect_ord)


## Grid search - observing refs, but not index size, matters

It seems index size does not matter beyond a point (~500). This seems to point at finding 'better' vectors does not matter in terms of more efficiently reconstructing the dot product. Further the 250-500 range of starting to matter eerily corresponds to the dimensionality of 384.

In [12]:
from collections import defaultdict
import pandas as pd
from statistics import mean


def grid_search(refs_factory=random_vector):

    num_search_rounds = 10

    results = []

    search_index_refs = [1500, 1250, 1000, 750, 500, 250]
    for num_index_refs in search_index_refs:
        refs, index = build_index(vects,
                                  num_refs=num_index_refs,
                                  refs_factory=refs_factory)
        for num_query_refs in [10, 20, 30, 100, 200]:
            for use_sins in [True, False]:
                test_results = defaultdict(set)

                recalls = []
                for query in test_queries:
                    query_search_results = search(index, refs, query, 
                                                  num_refs=num_query_refs, use_sins=use_sins)
                    test_results[query] = set([vect_ord for vect_ord, _ in query_search_results])
                    intersection = test_results[query] & ground_truths[query]
                    recalls.append(len(intersection) / 10)

                print(num_index_refs, num_query_refs, use_sins, mean(recalls))
                results.append({'num_index_refs': num_index_refs,
                                'num_query_refs': num_query_refs,
                                'use_sins': use_sins,
                                'mean': mean(recalls), 
                                'max': max(recalls),
                                'min': min(recalls)})

    return results

pd.DataFrame(grid_search())

1500 10 True 0.32
1500 10 False 0.32
1500 20 True 0.48
1500 20 False 0.48
1500 30 True 0.5
1500 30 False 0.5
1500 100 True 0.6
1500 100 False 0.61
1500 200 True 0.65
1500 200 False 0.65
1250 10 True 0.23
1250 10 False 0.23
1250 20 True 0.39
1250 20 False 0.39
1250 30 True 0.47
1250 30 False 0.45999999999999996
1250 100 True 0.62
1250 100 False 0.61
1250 200 True 0.7
1250 200 False 0.68
1000 10 True 0.19
1000 10 False 0.19
1000 20 True 0.36
1000 20 False 0.36
1000 30 True 0.44
1000 30 False 0.44
1000 100 True 0.61
1000 100 False 0.6
1000 200 True 0.68
1000 200 False 0.63
750 10 True 0.21
750 10 False 0.21
750 20 True 0.31
750 20 False 0.31
750 30 True 0.33999999999999997
750 30 False 0.33999999999999997
750 100 True 0.56
750 100 False 0.57
750 200 True 0.64
750 200 False 0.66
500 10 True 0.19
500 10 False 0.19
500 20 True 0.35
500 20 False 0.35
500 30 True 0.42
500 30 False 0.42
500 100 True 0.59
500 100 False 0.6
500 200 True 0.57
500 200 False 0.57
250 10 True 0.16
250 10 False 0.16
2

Unnamed: 0,num_index_refs,num_query_refs,use_sins,mean,max,min
0,1500,10,True,0.32,0.7,0.0
1,1500,10,False,0.32,0.7,0.0
2,1500,20,True,0.48,1.0,0.2
3,1500,20,False,0.48,1.0,0.2
4,1500,30,True,0.5,0.9,0.2
5,1500,30,False,0.5,0.9,0.2
6,1500,100,True,0.6,1.0,0.3
7,1500,100,False,0.61,1.0,0.3
8,1500,200,True,0.65,1.0,0.2
9,1500,200,False,0.65,1.0,0.2


## Try building refs from text from the corpus (but outside the index)

As we sampled every other sentence, what happens when we sample sentences not included in the index as our reference points?

In [18]:
_, vects_sample = load_sentences()
del _
np.random.shuffle(vects_sample)
vects_sample = vects_sample[1:20000:2]

def vectors_from_text(num_dims):
    ref_from_vects = np.random.randint(0, len(vects_sample))
    # print(np.array(vects_sample[ref_from_vects]).shape)
    return np.array(vects_sample[ref_from_vects])

results = pd.DataFrame(grid_search(refs_factory=vectors_from_text))
results

1500 10 True 0.060000000000000005
1500 10 False 0.07
1500 20 True 0.1
1500 20 False 0.07
1500 30 True 0.07
1500 30 False 0.07
1500 100 True 0.030000000000000002
1500 100 False 0.02
1500 200 True 0.01


KeyboardInterrupt: 

## We observe pretty poor performance, why?

It seems intentionally putting in 'better' / 'closer' vectors does not matter. This seems to go against an intuition that it would get 'most' of the dot product information sooner. Why is this?

Let's debug one case.

In [63]:
refs, index = build_index(vects,
                          num_refs=500,
                          refs_factory=vectors_from_text)

query = 'mary had a little lamb'
query_search_results, best_ref_ords, refs_dotted, _ = search(index, refs, query, num_refs=10, use_sins=True, debug=True)

In [64]:
query_search_results

[(2523985, 0.7074686),
 (1044716, 0.68161774),
 (2474909, 0.6744265),
 (928745, 0.6611039),
 (1144829, 0.6544002),
 (2523986, 0.65337944),
 (2474895, 0.6525349),
 (2515250, 0.649726),
 (1041612, 0.64874685),
 (1044637, 0.646878)]

In [65]:
ground_truths['mary had a little lamb']

{611431,
 775918,
 1393341,
 1418816,
 1887627,
 1996387,
 1997224,
 2991842,
 3120137,
 3447108}

## Examine the debug output

These are the best refs relative to the query vector, and their dot products to query vector. We also confirm the returned dot product to each query vector is as expected.

In [66]:
best_ref_ords

array([327, 115, 229,  56, 462,  77, 363, 236, 452, 471])

In [70]:
refs_dotted

array([0.28926173, 0.28270322, 0.2686716 , 0.24036624, 0.22792214,
       0.22556499, 0.21199271, 0.2117053 , 0.205437  , 0.20000759],
      dtype=float32)

In [69]:
query_vector = model.encode(query)
np.dot(refs, query_vector)[327]

0.28926173

## Reconstruct the refs dotted with index

Here columns are each ref, ordered by similarity, and columns are each indexed vector.

In [30]:
every_vect_dotted = index[:, best_ref_ords] * dotted
every_vect_dotted

array([[-0.00285271,  0.01095413,  0.00565327, ...,  0.00184634,
         0.00163272, -0.02223657],
       [ 0.00059355,  0.04823959,  0.07170182, ...,  0.02003475,
         0.03177835,  0.03683331],
       [-0.01185192,  0.00865418,  0.03142527, ..., -0.00641909,
         0.02141719, -0.01012504],
       ...,
       [-0.02139655,  0.00270634,  0.01438915, ..., -0.00806189,
         0.02504144,  0.01950342],
       [ 0.0306883 ,  0.03082296,  0.04067402, ...,  0.05346516,
         0.02163279,  0.04414181],
       [-0.03700887, -0.00142959,  0.02253393, ...,  0.00854083,
         0.00781648,  0.02723801]], dtype=float32)

In [32]:
every_dotted.shape

(3935913, 10)

## Expected rank / dot of ground truth vs output

We observe something surprising:

- The rank of the ground truth most similar to our query, is more dissisimilar to our refs
- As the refs are not random, they're sampled from the corpus, similarity to our query, seems to have dissisimilarity to our indexed value

So when the reference vectors are NOT random, and your observer (query) is close to a landmark (ref), it may mean the closest thing to that landmark is NOT the thing closest to the query?

In [73]:
def ref_dotted(idx):
    ref_0_ords = np.argsort(-every_dotted[:, idx])
    ref_0_dotted = every_dotted[ref_0_ords, idx]
    return ref_0_ords, ref_0_dotted

top_result = 1903917
expected_top_result = 611431

print(np.dot(query_vector, vects[top_result]))
print(np.dot(query_vector, vects[expected_top_result]))


output = []
for i in range(0, 10):
    ords, dotted = ref_dotted(i)
    rank_top_result = np.where(ords == top_result)[0][0]
    rank_expected_top_result = np.where(ords == expected_top_result)[0][0]
    output.append({'ref': i, 
                   'query_to_ref': refs_dotted[i],
                   'ref_to_output': dotted[rank_top_result],
                   'ref_to_ground_truth': dotted[rank_expected_top_result],
                   'top_rank_output': rank_top_result,
                   'top_rank_expected': rank_expected_top_result})
output = pd.DataFrame(output)
output

0.34889573
0.5288173


Unnamed: 0,ref,query_to_ref,ref_to_output,ref_to_ground_truth,top_rank_output,top_rank_expected
0,0,0.289262,0.066397,0.040628,32432,286418
1,1,0.282703,0.029907,0.033404,794120,663651
2,2,0.268672,0.032889,0.019059,769336,1481044
3,3,0.240366,0.014134,0.001334,1705350,2795982
4,4,0.227922,0.071331,0.053631,4948,27071
5,5,0.225565,0.074768,0.055362,2817,22408
6,6,0.211993,0.03479,0.028854,433181,589428
7,7,0.211705,0.019519,0.014662,957668,1315977
8,8,0.205437,0.022211,0.035527,933253,359575
9,9,0.200008,0.032051,0.009996,461450,1793365


In [75]:
output.mean()

ref                         4.500000
query_to_ref                0.236363
ref_to_output               0.039800
ref_to_ground_truth         0.029246
top_rank_output        609455.500000
top_rank_expected      933491.900000
dtype: float64

## Can we recreate this situation

Here's a little baby 6 dimensional index

In [100]:
near_vects = np.array([[1   ,0.75  ,0  ,0   ,0   ,0.1],
                       [0.9 ,1  ,0  ,0   ,0   ,0.05],
                       [1   ,0.9,0  ,0   ,0.05,0 ],
                       [0.95,0.1,0.1,0.01,0.04,0]])

near_vects /= np.linalg.norm(near_vects, axis=0)

vects_to_index = near_vects[::2]
refs = near_vects[1::2]

vects_to_index

array([[0.5189993 , 0.48589766, 0.        , 0.        , 0.        ,
        0.89442719],
       [0.5189993 , 0.58307719, 0.        , 0.        , 0.78086881,
        0.        ]])

## Query, closest to the 2nd indexed vector

In [106]:
query_vector = np.array([0.9   ,0.9  ,0.01  ,0.01   ,0.002   ,0.001])
query_vector /= np.linalg.norm(query_vector)
query_vector

array([0.70706205, 0.70706205, 0.00785624, 0.00785624, 0.00157125,
       0.00078562])

In [105]:
np.dot(vects_to_index, query_vector)

array([0.71122718, 0.7804634 ])

## Pass through refs, get the closest ref

It's closest to the 0th ref... but that ref is closest to the WRONG answer! 

In [107]:
np.dot(refs, query_vector)

array([0.7886993 , 0.41111848])

In [108]:
np.dot(vects_to_index, refs[0])

array([0.95721963, 0.6201787 ])

In [109]:
np.dot(vects_to_index, refs[1])

array([0.28737179, 0.78147258])

## Theory, things nearby, create 'unexpected' situations

Consider teh following query Q, ref R, vectors V1 and V2:

```
|--------------------------------------|
|                R                     |
|                   V2                 |
|            Q                         |
|               V1                     |
|                                      |
|--------------------------------------|
```

To this 'close' ref `V2` is closer to the ref, despite being fartherst from `V1`.

Yet consider when the query AND refs are nearby (as per our ranking of refs), with the ref being used to examine distant points


```
|--------------------------------------|
|                R                     |
|                             V2       |
|            Q             V1          |
|                                      |
|                                      |
|--------------------------------------|
```

Here the query and ref are closeby, and therefore what's near R is also near Q.
