# Vector similarity search through shared reference points

It was noted in [this blog post](https://softwaredoug.com/blog/2023/03/02/shared-dot-product.html) that if we know `u.A` and `v.A` we can estimate `u.v`. As an exercise, can we use that to prototype a vector similarity search?

Why would this be useful?

* We can compress a large vector space to a much reduced few thousand reference vectors, called `refs` here
* We can index a set of vectors, `v`, by noting the most similar vectors to these `refs`, and storing the id and dot product `v.refs`
* We might put that index in a traditional index like a search system, and just let traditional text retrieval's similarity work to create cosine similarity between dense vectors


## Import wikipedia sentences and vectors

Every 10 sentence of a wikipedia dump of sentences, totalling ~8m sentences/vectors. This is encoded with miniLM

```
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')
    model.encode(sentence)
```

In [3]:
import numpy as np

with open('wikisent10.npz', 'rb') as f:
    vects = np.load(f)
    vects = np.stack(vects)
    all_normed = (np.linalg.norm(vects, axis=1) > 0.99) & (np.linalg.norm(vects, axis=1) < 1.01)
    assert all_normed.all(), "Something is wrong - vectors are not normalized!"

with open('wikisent10.txt', 'rt') as f:
    sentences = f.readlines()
    
vects.shape

(787183, 384)

## Ground truth similarity

Take a dot product similarity to the query vector as the ground truth.

In [176]:
query = 100
query_vector = vects[query]
query_sentence = sentences[query]
query_sentence

'10 in the UK Singles chart, however it was a bigger hit for Amazulu in 1986 from their album Amazulu.\n'

In [177]:
nn = np.dot(vects, query_vector)
top_n = np.argpartition(-nn, 10)[:10]
top_n = top_n[nn[top_n].argsort()[::-1]]

for idx in top_n:
    print(idx, nn[idx], sentences[idx])

100 1.0 10 in the UK Singles chart, however it was a bigger hit for Amazulu in 1986 from their album Amazulu.

349218 0.6210802 It was released as the album's fourth single in March 1986 and reached #15 on the Australian singles chart, becoming the band's tenth Top 20 hit in their country.

545896 0.5978068 The album reached #18 in the UK Albums Chart and spawned two singles, both of which made the top 40 on the UK Singles Chart.

324576 0.5959153 It reached number 16 in the UK Singles Chart in 1983, the band's biggest singles chart success prior to 1985.

685786 0.5938797 The single was released on 13 March 1985 and entered the top 10 in Germany on 13 May 1985, after spending three weeks within the top-5, the single reached the top eventually going gold and selling well over 250,000 units in Germany alone.

349159 0.5915458 It was released as a single in the UK in August 1985 where it reached number 74 in the singles charts and remained in the charts for 1 week.

761042 0.58592045 Upo

## Select a set of random vectors as reference points

We ensure we sample *randomly* otherwise the similarity below (summing similarities) won't work, as we'll be summing correlated similarities.

In [178]:
num_vectors = len(vects)

def centroid():
    """ Sample a unit vector from a sphere in N dimensions.
    It's actually important this is gaussian
    https://stackoverflow.com/questions/59954810/generate-random-points-on-10-dimensional-unit-sphere
    IE Don't do this
        projection = np.random.random_sample(size=num_dims)
        projection /= np.linalg.norm(projection)
    """
    num_dims = len(vects[0])
    projection = np.random.normal(size=num_dims)
    projection /= np.linalg.norm(projection)
    return projection   

centroid()

array([ 4.04418222e-02,  4.29905835e-02,  2.94846971e-02,  6.37413950e-02,
        4.16853180e-03,  6.76388274e-02,  3.63467822e-02, -2.78905778e-02,
        1.07696273e-01, -7.86593509e-02,  1.11071982e-01,  4.89706509e-02,
       -1.09267697e-04,  2.96135333e-02,  2.32052021e-02, -4.47172151e-02,
        6.61178519e-02, -1.61450268e-02, -1.38930192e-02, -6.20297679e-02,
        6.61775297e-02, -2.21872601e-02, -3.58267370e-02,  3.20256973e-02,
       -7.85287797e-02,  2.67882422e-02, -9.60284139e-02, -5.52966925e-02,
        2.39018031e-02, -2.68202409e-02, -2.17792422e-02, -2.64472702e-02,
        2.72490978e-02,  7.17630921e-02, -2.70215090e-02,  8.57689512e-02,
        3.80243500e-02, -4.40758526e-02,  4.69071434e-03,  1.19036679e-02,
       -1.88135365e-02,  4.26781438e-03,  1.50710805e-02, -7.08824250e-02,
        4.15535225e-03, -1.82117029e-03, -4.75601741e-02,  3.18162279e-02,
       -3.98048573e-02,  8.94637326e-02,  4.84469477e-02, -1.35898666e-02,
        2.52745515e-02,  

## Most similar vectors to centroid

Get most similar vectors, with a specified floor in specificity. The top 10 here should correspond to the top 10 ground truth above.

In [179]:
def most_similar(centroid, floor):

    nn = np.dot(vects, centroid)
    idx_above_thresh = np.argwhere(nn >= floor)[: ,0]

    return sorted(zip(idx_above_thresh, nn[idx_above_thresh]),
                key=lambda vect: vect[1],
                reverse=True)

nn_above_thresh = most_similar(query_vector, 0.001)
nn_above_thresh[:10]

[(100, 1.0),
 (349218, 0.6210802),
 (545896, 0.5978068),
 (324576, 0.5959153),
 (685786, 0.5938797),
 (349159, 0.5915458),
 (761042, 0.58592045),
 (546282, 0.58320194),
 (685464, 0.5829248),
 (689124, 0.58184844)]

## Create a compressed index based on shared reference points

As mentioned in [this blog article](https://softwaredoug.com/blog/2023/03/02/shared-dot-product.html) we can use shared reference points between query and vector to estimate their similarity. Below we store

- A table of these reference vectors (`refs`) that can stand in for the full vector space
- A mapping of these `refs` -> a set of indexed vectors.

In [None]:
ref_neighbors = {}   # reference pts -> neighbors
from time import perf_counter

num_refs = 2000
refs = np.zeros( (num_refs, vects.shape[1]) )

all_indexed_vectors = set()

start = perf_counter()

for ref_ord in range(0, num_refs):
    specificity = 0.10
    
    center = centroid()    
    top_n = most_similar(center, specificity)
    
    if ref_ord % 100 == 0:
        print(ref_ord, len(set(all_indexed_vectors)) / len(vects), perf_counter() - start)
        
    refs[ref_ord, :] = center
    idx = []
    for vector_ord, dot_prod in top_n:
        all_indexed_vectors.add(vector_ord)
        idx.append((vector_ord, dot_prod))
    ref_neighbors[ref_ord] = idx
    if vector_ord == query:
        print('Q', ref_ord, vector_ord)


0 0.0 0.3291675839573145
100 0.9029526806346173 43.776729874894954
200 0.9919116647590205 88.12234558397904
300 0.9993965824973354 132.4025780420052
400 0.9999682411840702 176.65073737490457
500 0.9999911075315396 220.59495983389206
600 0.9999974592947256 264.75591770897154
700 1.0 308.87807195889764
800 1.0 352.69318049994763
900 1.0 396.41712533391546
1000 1.0 440.5252767499769
1100 1.0 484.54106754192617
1200 1.0 528.2496463749558
1300 1.0 572.71075762494
1400 1.0 616.7152932919562
1500 1.0 661.9096008338965


In [None]:
nn = np.dot(refs, query_vector)
nn[nn > 0]

# Search time!

Now when we search we go through the following steps, using just our reference points.

## Similarity to reference points

Compute similarity to `refs` from above...

In [None]:
# Query vect -> refs similarity
nn = np.dot(refs, query_vector)

top_n_ref_points = np.argpartition(-nn, 10)[:10]
scored = nn[top_n_ref_points]

scored, top_n_ref_points

## Using reference points, As, estimate q.v

We have query vector `q`, which is similar to a set of reference points `A`, can we estimate `q.v`. We expect `q.v` to [approach `q.A*v.A` as we implement below](https://softwaredoug.com/blog/2023/03/02/shared-dot-product.html).

In [170]:
candidates = {}
cutoff = 0.0
for ref_ord, ref_score in zip(top_n_ref_points, scored):
    ref = refs[ref_ordinal]

    for vect_id, score in ref_neighbors[ref_ord]:
        # print(vect_id, score, score*ref_score)
        combined = score * ref_score
        if combined > cutoff:
            try:
                candidates[vect_id].append(combined)
            except KeyError:
                candidates[vect_id] = [combined]
            
list(candidates.items())[:10]

[(723905, [0.03571343463128332, 0.012700629623260861]),
 (503548, [0.03522915437314725]),
 (613207, [0.03332362985738621]),
 (743038, [0.03329816556012523]),
 (770521, [0.032652819867609234]),
 (665689, [0.03220477492385285]),
 (252842, [0.03208397976780289]),
 (755218, [0.031997149848823646]),
 (2046, [0.03154499853027196]),
 (552387, [0.03150694500555567])]

## Sum the shared candidates

Should we sum the shared reference points?

Out of N reference points A0...AN we observe `u.A0...u.AN` and `v.0...vN`. We assume `u.v` would correlate to the dot product of these `u.A0*v.A0 + u.A1*v.A1 + ... + u.AN*v.AN`.

Note this only applies because we generate the reference points *randomly* introducing some bias in the reference points would create a case where many terms of the summation correlated heavily (the similarit yof `ref` `A0` and `A1` were so similar, that it was double counting). For example if `A0` and `A1` both occured towards the center of the data, we would be biased towards more general responses.

In [171]:
summed_candidates = {}
for vect_id, scored in candidates.items():
    summed_candidates[vect_id] = sum(scored)

In [175]:
results = summed_candidates.items()
results = sorted(results,
                 key=lambda scored: scored[1],
                 reverse=True)[:10]
# 21340
print("ZF -- ", query, sentences[query])
rank = -1
for idx, result in enumerate(results):
    print(idx, result, sentences[result[0]])

ZF --  100 10 in the UK Singles chart, however it was a bigger hit for Amazulu in 1986 from their album Amazulu.

0 (100, 0.16053842822469955) 10 in the UK Singles chart, however it was a bigger hit for Amazulu in 1986 from their album Amazulu.

1 (707142, 0.1110197564947739) The track was released as a single, on EMI and reached 71 in the UK Singles Chart, spending just two weeks in the listing.

2 (544532, 0.10054923103439038) The album debuted at #2 on the RIANZ albums chart, and after seven weeks within the top 10 would finally reach the #1 position.

3 (323191, 0.09353954548607514) It peaked at number 10 and remained on the charts for 12 weeks, and it was also included on the Ice Age film.

4 (690165, 0.09333054109079071) The song was originally released in February 1980, reaching #56 in the UK charts, before being re-released to top ten success in August of the same year.

5 (545629, 0.09237140430525709) The album made its first appearance on Billboard magazine's album chart in t

## Putting search together

Let's put the code above into one function.

In [195]:
def _search(query_vector):
    
    # query vect -> refs similarity
    nn = np.dot(refs, query_vector)

    top_n_ref_points = np.argpartition(-nn, 30)[:30]
    scored = nn[top_n_ref_points]

    # Candidates via our index
    candidates = {}
    cutoff = 0.0
    for ref_ord, ref_score in zip(top_n_ref_points, scored):
        ref = refs[ref_ordinal]

        for vect_id, score in ref_neighbors[ref_ord]:
            # print(vect_id, score, score*ref_score)
            combined = score * ref_score
            if combined > cutoff:
                try:
                    candidates[vect_id].append(combined)
                except KeyError:
                    candidates[vect_id] = [combined]
                    
    summed_candidates = {}
    for vect_id, scored in candidates.items():
        summed_candidates[vect_id] = sum(scored)
        
    results = summed_candidates.items()
    return sorted(results,
                  key=lambda scored: scored[1],
                  reverse=True)[:10]



from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

def search(query):
    query_vector = model.encode(query)
    return _search(query_vector)

def search_ground_truth(query):
    query_vector = model.encode(query)
    nn = np.dot(vects, query_vector)
    top_n = np.argpartition(-nn, 10)[:10]
    top_n = top_n[nn[top_n].argsort()[::-1]]
    return sorted(zip(top_n, nn[top_n]),
                  key=lambda scored: scored[1],
                  reverse=True)[:10]

results = search("ai chatbots")
rank = -1
for idx, result in enumerate(results):
    print(idx, result, sentences[result[0]])

0 (478462, 0.14556285696774182) Robot9000 (r9k) is an open-source chat moderation script developed in 2008 by Randall Munroe.

1 (566992, 0.13481037399967608) The chat show consisted of various games and quizzes presented towards celebrities who were guests on the episode, and began airing from 20 November 2013.

2 (107215, 0.13402052882769422) Conversation theory is a cybernetic and dialectic framework that offers a scientific theory to explain how interactions lead to "construction of knowledge", or "knowing": wishing to preserve both the dynamic/kinetic quality, and the necessity for there to be a "knower".

3 (107210, 0.13322484355276576) Conversation analysis (CA) is an approach to the study of social interaction, embracing both verbal and non-verbal conduct, in situations of everyday life.

4 (169547, 0.13167264686562835) Google Talk (also known as Google Chat) is an instant messaging service that provides both text and voice communication.

5 (489317, 0.13032269999154028) Semant

In [196]:
results = search_ground_truth("ai chatbots")
rank = -1
for idx, result in enumerate(results):
    print(idx, result, sentences[result[0]])

0 (194492, 0.6189599) He is currently CEO and co-founder of marketing-tech company Botworx.ai an artificial intelligence (AI) and natural language processing company that aims to help businesses interact with their customers via AI-powered chatbots.

1 (478462, 0.57323676) Robot9000 (r9k) is an open-source chat moderation script developed in 2008 by Randall Munroe.

2 (701694, 0.565655) The term "ChatterBot" was originally coined by Michael Mauldin (creator of the first Verbot, Julia) in 1994 to describe these conversational programs.

3 (786907, 0.5553341) Zo is an artificial intelligence English-language chatbot developed by Microsoft.

4 (331359, 0.529726) Its stated aim is to "simulate natural human chat in an interesting, entertaining and humorous manner".

5 (730360, 0.51735985) This game is one of many simple games created by Google that are AI based as part of a project known as 'A.I. Experiments'.

6 (64447, 0.5083422) A web chat is a system that allows users to communicate in

## Next steps

This is just a toy prototype of course, and would need to be evaluated for recall.

* Consider how you'd treat the reference points in a traditional search index (like Solr, Elasticsearch etc)
* Benchmark with more data (9m -> 90m wikipedia sentences)
* Study the relationship of needed reference points to get good recall
* Test and ensure increasing reference points increases recall