<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# `Introduction to Semantic Search and Vector Databases` `1`

This is lesson `2` of 3 in the educational series on `Semantic Search and Vector Databases`. This notebook is intended `to teach the basic concepts of vector databases`.

**Skills:** 
* Data analysis
* Machine learning
* Text analysis
* spaCy
* Vector databases
* Semantic search
* Python

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
* Regular Expressions (`re`, character classes)

These should be general skills but can mention a particular library
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Data cleaning with `Pandas`
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Learn how to create a normal search engine with Python using best match 25 (BM25)
2. Learn about vector databases
3. Learn about creating a basic vector database with Annoy
4. Learn how to query a vectory database with Annoy
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* spacy for tokenization and sentence chunking
* srsly for loading json data
* annoy for creating a vector database
* txtai (for outside of class work)
* sentence-transformers for loading a model and encoding texts
* pandas for working withd ata

## Install Required Libraries

In [100]:
### Install Libraries ###

# Using !pip installs
!pip install spacy scikit-learn srsly annoy txtai sentence-transformers pandas rank-bm25

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2


In [101]:
import srsly
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm
import torch
import pandas as pd
from annoy import AnnoyIndex
import spacy
from rank_bm25 import BM25Okapi

# Required Data

For this lesson, we are working with a small sample from the Founders Online archive. If you would like to work with a larger sample, I have provided the scripts necessary for using the `metadata.json` provided by Founders Online to download a sample from their website.

# Introduction

In this notebook, we will build on the concepts we learned in the first notebook. In the first notebook, we learned about the fundamental concepts being vector databases and semantic searching, that is vectorization itself and doing so with various types of machine learning models. We learned that transformers are best suited for this particular problem because they produce dynamic vectors, rather than static vectors. Dynamic vectors allow for a given token's vector to change depending on context.

In this notebook, we will use this knowledge to create our first vector database and even query it semantically! It is important to note that while the work in this notebook produces good results, this is not what you would do should you wish to build a formal project. A formal project will often use a cloud-based solution and a proper server. We will cover these steps in the next notebook.

# Loading the Data

To create a vector database, you must start with data. We will be working with data from [Founders Online](https://founders.archives.gov/). This data is available in the `../data/processed/` folder. I have provided us with a sample of 10, 100, and 1,000. For this notebook, we will be working with a sample of 1,000. It is important to note that I have seeded the random sample. This means that the data you work with each time will be the same. To get different data, change the seed of the script, located in `./src/data/` called `download_data.py`.

This data is a collection of writings of the Founders. The data is useful for doing social network analysis as many of the writings are letters. It's also useful for mapping writings across time and space as many of the writings are dated and contain information about specific locations. For our purposes, however, we will be working with the main content of the letters to create a vector database. 

To get started, let's load the data. We will be using `srsly`. I'm including this in this tutorial as a way to introduce students to the library. You can also use the standard `json` package here. `srsly` has a few advantages, namely it loads the data as a generator. This is useful when you start working with larger datasets (as you typically do with vector databases) because the entire dataset is not loaded into memory at once. Because of this, though, we want to convert it to a list just to make it a bit easier to use for our purposes, so when loading the data, we convert it to a list with the `list()` function.


In [20]:
data = list(srsly.read_json("../data/processed/sample_1000_42.json"))

Now that we have loaded up our data, let's take a brief look at it by examining the first index.

In [21]:
data[0]

{'title': 'Thomas Jefferson to Joseph Milligan, 22 December 1815',
 'permalink': 'https://founders.archives.gov/documents/Jefferson/03-09-02-0174',
 'project': 'Jefferson Papers',
 'authors': ['Jefferson, Thomas'],
 'recipients': ['Milligan, Joseph'],
 'date-from': '1815-12-22',
 'date-to': '1815-12-22',
 'content': 'Monticello Dec. 22. 15.\nDear Sir\nOn my return here from Bedford a few days ago, I found the Hutton and Requisite tables, bound to my mind. by this mail I send you an Ovid’s metamorphoses almost entirely worne out & defaced, yet of sovaluable and rareaneditionthat I wish you to put it into as good a state of repair as it is susceptible of. by the next mail I will forward a Cornelius Nepos to be bound. be so good as to procure and forward to me by stage the underwritten books.I salute you with friendship & esteem\nTh: Jefferson\nAinsworth’sLat. & Eng. dict. abridged. to be bound[. . .]\nthe Lat. & Eng in one, & the Eng. & Lat.[. . .]\nOvid’s metamorphoses. the Delphin edn 

Notice that we have some important metadata here including the title, permalink (link to the website where this particular entry appears), project, authors, recipients, date-from, date-to, and content. Everything here is as presented in the original metadata.json file with the exception of `content`. I have added this after pulling the data from the website. We will learn about how we can make these extra attributes more useful in the next notebook. For now, let's focus on the `content` attribute.

In [89]:
print(data[0]["content"])

Monticello Dec. 22. 15.
Dear Sir
On my return here from Bedford a few days ago, I found the Hutton and Requisite tables, bound to my mind. by this mail I send you an Ovid’s metamorphoses almost entirely worne out & defaced, yet of sovaluable and rareaneditionthat I wish you to put it into as good a state of repair as it is susceptible of. by the next mail I will forward a Cornelius Nepos to be bound. be so good as to procure and forward to me by stage the underwritten books.I salute you with friendship & esteem
Th: Jefferson
Ainsworth’sLat. & Eng. dict. abridged. to be bound[. . .]
the Lat. & Eng in one, & the Eng. & Lat.[. . .]
Ovid’s metamorphoses. the Delphin edn in 8vo
Cornelius Nepos. the Delphin edn if to be had; if not some other good one.
Virgil. the Delphin edn lately printed in Phil. with English notes.
Mair’s Tyro’s dictionary.
I observe a mrRichardsonadvertises in the National Intelligencer the Scientific dialogues: if the edition be compleat comprehending theChemical part,

This gives us a better sense of this letter. As we can see this is a raw-string representation of the data. The odd formatting is due to how the data is rendered on the main page, likely to capture the structure of the original document. If you want to verify what the original document looks like, use the link below. Here is what it looks like as of the writing of this notebook.

![founders](../assets/founders.png)

In [90]:
print(data[0]["permalink"])

https://founders.archives.gov/documents/Jefferson/03-09-02-0174


# Vectorizing Documents

Now that we have all our documents, it comes time to vectorize them, or convert them into a sequence of vectors. This is where we pass the texts to a machine learning model and capture the output vector for each of them. To do this, though, we need a model loaded. We will be using the `sentence-transformers` library. It makes this process as simple as possible with only two lines of code. First, we will need to load the model.

In [91]:
# Initialize the sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')  # This is a standard, efficient model



Now that we have loaded our model we can (optionally) specify the specific device. The code below will put it onto your GPU, if available. If you don't know if this is enabled on your device, then it likely is not. The steps to activate `cuda` are very specific and require you to install certain packages in a certain way. If you do not have cuda, then the default will be the `cpu`. With 1,000 documents, this will not be an issue.

In [25]:
# Set device (use GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

Now that we have our model loaded, we need to get just the texts from all our data, we can do that with list comprehension or we can use more verbose code. I'll provide both here. Note, both approaches are perfectly fine.

In [92]:
# texts = [item["content"] for item in data]

texts = []
for item in data:
    texts.append(item["content"])

Let's take a look at the new data we created.

In [93]:
texts[:2]

['Monticello Dec. 22. 15.\nDear Sir\nOn my return here from Bedford a few days ago, I found the Hutton and Requisite tables, bound to my mind. by this mail I send you an Ovid’s metamorphoses almost entirely worne out & defaced, yet of sovaluable and rareaneditionthat I wish you to put it into as good a state of repair as it is susceptible of. by the next mail I will forward a Cornelius Nepos to be bound. be so good as to procure and forward to me by stage the underwritten books.I salute you with friendship & esteem\nTh: Jefferson\nAinsworth’sLat. & Eng. dict. abridged. to be bound[. . .]\nthe Lat. & Eng in one, & the Eng. & Lat.[. . .]\nOvid’s metamorphoses. the Delphin edn in 8vo\nCornelius Nepos. the Delphin edn if to be had; if not some other good one.\nVirgil. the Delphin edn lately printed in Phil. with English notes.\nMair’s Tyro’s dictionary.\nI observe a mrRichardsonadvertises in the National Intelligencer the Scientific dialogues: if the edition be compleat comprehending theCh

As we can see `texts` now corresponds to a list of each of our texts. We should have 1,000 of these.

In [94]:
len(texts)

1000

Now that we have all our texts, let's encode them! We can do that with a single line. I like to set `show_progress_bar` to `True`. This allows me to see how long things take on my machine.

In [95]:
embeddings = model.encode(texts, show_progress_bar=True)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

With our embeddings now created, let's convert our original dataset into a `Pandas` DataFrame. This will make it just a bit easier to visualize and work with our data.

In [35]:
df = pd.DataFrame(data)
df

Unnamed: 0,title,permalink,project,authors,recipients,date-from,date-to,content
0,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,Monticello Dec. 22. 15.\nDear Sir\nOn my retur...
1,"To Alexander Hamilton from James McHenry, 3 Ma...",https://founders.archives.gov/documents/Hamilt...,Hamilton Papers,"[McHenry, James]","[Hamilton, Alexander]",1791-05-03,1791-05-03,[Baltimore] 3 May 1791.\nMy dear Sir.\nI did n...
2,John Adams to John Quincy Adams and Thomas Boy...,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]","[Adams, John Quincy, Adams, Thomas Boylston]",1794-09-14,1794-09-14,Quincy Septr.14. 1794\nMy dear Sons\nI once mo...
3,From George Washington to Major General Horati...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[Gates, Horatio]",1776-12-23,1776-12-23,"Head Quarters [Bucks County, Pa.] 23d Decr 177..."
4,[Diary entry: 5 July 1795],https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]",[],1795-07-05,1795-07-05,Could not find the main content
...,...,...,...,...,...,...,...,...
995,"From John Adams to Boston Patriot, 4 November ...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[Boston Patriot],1809-11-04,1809-11-04,"Quincy, November 4, 1809.\nSirs,\nIn my last l..."
996,"From John Adams to United States Senate, 14 Ma...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[United States Senate],1798-03-14,1798-03-14,United States March 14th 1798:\nGentlemen of t...
997,"To Benjamin Franklin from William Henly, [Apri...",https://founders.archives.gov/documents/Frankl...,Franklin Papers,"[Henly, William]","[Franklin, Benjamin]",1772-04-01,1772-04-30,"Sunday Eve. [April?, 1772]\nDear Sir:\nI have ..."
998,From George Washington to Major General Alexan...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[McDougall, Alexander]",1779-05-20,1779-05-20,Head Quarters Middle Brook May 20th 1779\nDr S...


Now that we have loaded our data, let's add the embeddings into our DataFrame. To do that, we can convert our embeddings to a list.

In [96]:
df["embedding"] = list(embeddings)

In [97]:
df

Unnamed: 0,title,permalink,project,authors,recipients,date-from,date-to,content,embedding
0,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,Monticello Dec. 22. 15.\nDear Sir\nOn my retur...,"[-0.11585674, -0.031811118, 0.054534256, -0.04..."
1,"To Alexander Hamilton from James McHenry, 3 Ma...",https://founders.archives.gov/documents/Hamilt...,Hamilton Papers,"[McHenry, James]","[Hamilton, Alexander]",1791-05-03,1791-05-03,[Baltimore] 3 May 1791.\nMy dear Sir.\nI did n...,"[-0.077039875, 0.06889965, 0.05614029, -0.0018..."
2,John Adams to John Quincy Adams and Thomas Boy...,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]","[Adams, John Quincy, Adams, Thomas Boylston]",1794-09-14,1794-09-14,Quincy Septr.14. 1794\nMy dear Sons\nI once mo...,"[-0.13977431, 0.04176549, 0.06941472, -0.06869..."
3,From George Washington to Major General Horati...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[Gates, Horatio]",1776-12-23,1776-12-23,"Head Quarters [Bucks County, Pa.] 23d Decr 177...","[0.006098511, 0.048442684, 0.046325486, -0.071..."
4,[Diary entry: 5 July 1795],https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]",[],1795-07-05,1795-07-05,Could not find the main content,"[0.041819364, -0.009509875, -0.019032704, -0.0..."
...,...,...,...,...,...,...,...,...,...
995,"From John Adams to Boston Patriot, 4 November ...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[Boston Patriot],1809-11-04,1809-11-04,"Quincy, November 4, 1809.\nSirs,\nIn my last l...","[-0.026789177, -0.004801429, 0.06341488, -0.11..."
996,"From John Adams to United States Senate, 14 Ma...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[United States Senate],1798-03-14,1798-03-14,United States March 14th 1798:\nGentlemen of t...,"[-0.08672993, 0.036208663, 0.036546204, -0.038..."
997,"To Benjamin Franklin from William Henly, [Apri...",https://founders.archives.gov/documents/Frankl...,Franklin Papers,"[Henly, William]","[Franklin, Benjamin]",1772-04-01,1772-04-30,"Sunday Eve. [April?, 1772]\nDear Sir:\nI have ...","[-0.06445475, 0.060675066, 0.07255046, 0.07447..."
998,From George Washington to Major General Alexan...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[McDougall, Alexander]",1779-05-20,1779-05-20,Head Quarters Middle Brook May 20th 1779\nDr S...,"[0.0034233045, 0.039196797, 0.052478287, -0.07..."


At this stage, we now have all the data necessary to create a vector database! Before we do that, let's learn a little bit about vector databases.

# What are Vector Databases?

Vector databases are specialized storage and retrieval systems designed to handle high-dimensional vector data efficiently. Unlike traditional databases that work with structured data like numbers and text, vector databases are optimized for managing and querying vector embeddings - numerical representations of data points in a multi-dimensional space.

At their core, vector databases address the challenge of similarity search in large datasets. They excel at finding the most similar items to a given query, which is crucial for applications like recommendation systems, image recognition, and natural language processing. For instance, in a vector database storing product information, you could easily find similar products based on various attributes, all encoded as vectors.

The key advantage of vector databases lies in their ability to perform fast approximate nearest neighbor (ANN) searches. Traditional databases might struggle with the "curse of dimensionality" when dealing with high-dimensional data, but vector databases employ specialized indexing techniques to maintain efficiency. This makes them particularly useful for AI and machine learning applications, where data is often represented in high-dimensional vector spaces.

For those new to the concept, you can think of a vector database as a system that organizes information in a way that mirrors how our brains associate related concepts. Just as we can quickly recall words or images that are similar to a given prompt, vector databases can rapidly retrieve data points that are "close" to each other in a mathematical sense. This capability opens up exciting possibilities for creating more intelligent and intuitive data-driven applications across various domains.

# BM25

BM25 (Best Matching 25) is a ranking function used in information retrieval systems, particularly in search engines. It's an advanced form of TF-IDF (Term Frequency-Inverse Document Frequency) that provides a way to rank documents based on their relevance to a given search query.

BM25 improves upon simpler ranking methods by incorporating document length normalization. This means it can account for the fact that longer documents are more likely to contain a given term simply due to their length, rather than because of relevance.

The algorithm calculates a score for each document based on the query terms it contains. It considers both how often a term appears in a document (term frequency) and how rare the term is across all documents (inverse document frequency). However, it also applies a saturation function to prevent common terms from dominating the score.

For those new to information retrieval, BM25 can be thought of as a way of determining which documents in a collection are most relevant to a user's search query. It's widely used in practice due to its effectiveness and relatively simple implementation.

Now, let's look at how we can implement BM25 in Python using a DataFrame's content field.

To do that, we first need to tokenize our text. To keep things simple for this notebook, we will just be using `split()`. Normally, you would use something more reliable, like spaCy's tokenizer and perform some basic data cleaning, such as removing punctuation.

In [120]:
# tokenized_docs = [doc.split() for doc in df["content"]]

tokenized_docs = []
for doc in df["content"]:
    split_doc = doc.split()
    tokenized_docs.append(split_doc)

In [122]:
print(tokenized_docs[:1])

[['Monticello', 'Dec.', '22.', '15.', 'Dear', 'Sir', 'On', 'my', 'return', 'here', 'from', 'Bedford', 'a', 'few', 'days', 'ago,', 'I', 'found', 'the', 'Hutton', 'and', 'Requisite', 'tables,', 'bound', 'to', 'my', 'mind.', 'by', 'this', 'mail', 'I', 'send', 'you', 'an', 'Ovid’s', 'metamorphoses', 'almost', 'entirely', 'worne', 'out', '&', 'defaced,', 'yet', 'of', 'sovaluable', 'and', 'rareaneditionthat', 'I', 'wish', 'you', 'to', 'put', 'it', 'into', 'as', 'good', 'a', 'state', 'of', 'repair', 'as', 'it', 'is', 'susceptible', 'of.', 'by', 'the', 'next', 'mail', 'I', 'will', 'forward', 'a', 'Cornelius', 'Nepos', 'to', 'be', 'bound.', 'be', 'so', 'good', 'as', 'to', 'procure', 'and', 'forward', 'to', 'me', 'by', 'stage', 'the', 'underwritten', 'books.I', 'salute', 'you', 'with', 'friendship', '&', 'esteem', 'Th:', 'Jefferson', 'Ainsworth’sLat.', '&', 'Eng.', 'dict.', 'abridged.', 'to', 'be', 'bound[.', '.', '.]', 'the', 'Lat.', '&', 'Eng', 'in', 'one,', '&', 'the', 'Eng.', '&', 'Lat.[.', 

As we can see, our documents have now been naively tokenzed. We can now past this list of lists to BM250kapi directly with a single line.

In [103]:
bm25_index = BM25Okapi(tokenized_docs)

This has now created a BM25 index for us! At this stage, we have essentially created a search engine index. To query it, we need to create a query, tokenize the query, and then use the `get_scores()` method to retrieve the results. It's also best practice to sort these based on the scores. Finally, we can retrieve the results from the original DataFrame. In the cell below, we will have all the code necessary to perform these operations. I've done this so that you can more easily test this out with multiple queries. To understand what's happening in the code, though, let's breakdown each step here.

## Query Definition and Tokenization

```python
query = "war"
tokenized_query = query.split()
```

- `query = "war"`: This line defines the search query. In this case, we're searching for documents related to "war".
- `tokenized_query = query.split()`: This line tokenizes the query string. The `split()` method without arguments splits the string on whitespace, creating a list of individual words. For our simple query, this results in `["war"]`. For multi-word queries, it would separate each word.

## Scoring Documents

```python
doc_scores = bm25_index.get_scores(tokenized_query)
```

- This line uses the pre-computed BM25 index (`bm25_index`) to score each document in the corpus based on the tokenized query.
- `get_scores()` method calculates a relevance score for each document in relation to the query.
- The result `doc_scores` is a list of floating-point numbers, where each number represents the relevance score of the corresponding document in the original corpus.

## Ranking Documents

```python
ranked_docs = sorted(enumerate(doc_scores), key=lambda x: x[1], reverse=True)[:5]
```

- `enumerate(doc_scores)`: This creates pairs of (index, score) for each document.
- `sorted(...)`: This sorts these pairs based on a specific key.
- `key=lambda x: x[1]`: This lambda function tells `sorted()` to use the score (the second item in each pair) as the sorting key.
- `reverse=True`: This sorts in descending order, so highest scores come first.
- `[:5]`: This slices the result to get only the top 5 results.
- The final `ranked_docs` is a list of tuples, where each tuple contains the document index and its score, sorted by score in descending order.

## Printing Results

```python
print(f"Search results for query: '{query}'")
print("------------------------------", "\n")
for idx, score in ranked_docs:
    print(f"Score: {score:.4f} - {df['content'][idx]}")
    print("-------------", "\n")
```

- The first `print()` statement displays the search query.
- The second `print()` creates a visual separator.
- The `for` loop iterates over the `ranked_docs`:
  - `idx` is the index of the document in the original DataFrame.
  - `score` is the BM25 relevance score for that document.
- Inside the loop:
  - We print the score formatted to 4 decimal places.
  - We retrieve and print the content of the document using `df['content'][idx]`.
  - We print another separator between results.


Spend a few minutes testing out different queries. What do you notice when you query `warfare`? What about `War` capitalized?

In [127]:
query = "War"
tokenized_query = query.split()
doc_scores = bm25_index.get_scores(tokenized_query)

# Sort documents by score
ranked_docs = sorted(enumerate(doc_scores), key=lambda x: x[1], reverse=True)[:5]

# Print results
print(f"Search results for query: '{query}'")
print("------------------------------", "\n")
for idx, score in ranked_docs:
    print(f"Score: {score:.4f} - {df['content'][idx]}")
    print("-------------", "\n")

Search results for query: 'War'
------------------------------ 

Score: 5.7643 - Havre de Grace June 17th 1800
Sir
My particular Situation will I trust plead my apology for this indirect channel of approach—Will you oblige me by directing the Secretary of War to suspend any operation upon my Letter of Resignation, addressed to Major General Pinckney, untill the arrival of Brigadier General Wilkinson, who is, I am informed, shortly expected in this quarter, or untill the state of my case shall have been candidly submitted to your observation—The Major General who has been apprised of my situation, writes me that my Letter is forwarded to the Secretary of War “to meet your final determination and pleasure”
With sentiments of real personal respect I have the honor to be, Sir—your most Obedt Servt
Campbell Smith
------------- 

Score: 5.3324 - 17 August 1813, “War Office.” “D. Parker has the honor to inform the President of the United States that nothing of moment has been received at the 

# Creating a Vector Database

Now that we have seen how to create a tradition text database and even learned to query it, let's compare these results to a vector database. To create our database, we will be using `Annoy` from Spotify. Annoy is a nearest neighbor algorithm that allows for us to easily and efficiently store millions of vectors in an index that we can then retrieve very efficiently. This is achieved by using a computationally efficient algorithm that is written in C. To use annoy, we first need to know the dimensions of our vectors. We can use `.shape` and examine index 1 to get our vector dimensions. Note, these will change from model to model.

In [46]:
vector_dim = embeddings.shape[1]
vector_dim

384

As we can see, we have a dimension of 384. Now that we know that, we can create our index which we will populate with vectors. When creating an AnnoyIndex, we need to specify two things: the number of vectors and the way in which want to measure similarity. We have a few options here. We are using `angular`.

```
AnnoyIndex(f, metric) returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot". - Annoy docs
```

In [128]:
annoy_index = AnnoyIndex(vector_dim, 'angular')

Now that we have created our index, it's time to populate it with data. When we do this, e need to use the `add_item()` method. This will take two arguments: the index number and the embedding itself. We can use `enumerate()` to create our `i` variable that will tick up by one each time we loop over our data. This is the equivalent of doing `i=i+1` inside our loop.

In [129]:
# Add items to the index
for i, embedding in enumerate(embeddings):
    annoy_index.add_item(i, embedding)

At this stage, our index has the data but it is not yet built. To build it, we can use the `build()` method. This will take 1 argument, the number of trees we want to use. In theory, the more trees, the better, but there's a point where you have diminished returns. Unless you are working with very complex data, `10` is usually a good starting number.

In [130]:
annoy_index.build(10)

True

Now that we have our index built, we can query it. Again, I'll write all the code in a single cell and explain it row by row in markdown.

## Query Definition and Encoding

```python
query = "warfare"
query_vector = model.encode(query)
```

- `query = "warfare"`: This line defines the search query. We're searching for documents related to "warfare".
- `query_vector = model.encode(query)`: This line uses a pre-trained model (likely a sentence transformer) to encode the query into a vector. The `encode()` method transforms the text query into a high-dimensional numerical vector that represents its semantic meaning.

## Searching the Annoy Index

```python
similar_item_ids = annoy_index.get_nns_by_vector(query_vector, 5)
```

- `annoy_index`: This is a pre-built Annoy (Approximate Nearest Neighbors Oh Yeah) index, which allows for efficient similarity search in vector space.
- `get_nns_by_vector(query_vector, 5)`: This method searches the Annoy index for the 5 nearest neighbors to our query vector.
  - The first argument is our encoded query vector.
  - The second argument (5) specifies that we want the top 5 most similar items.
- `similar_item_ids`: This variable stores the indices of the 5 most similar items found in the index.

## Retrieving Results from DataFrame

```python
df.iloc[similar_item_ids]
```

- This line uses the indices returned by Annoy to fetch the corresponding rows from our DataFrame `df`.
- `iloc[]` is used for integer-location based indexing.
- The result is a new DataFrame containing only the rows that match our search results.

## Printing Results

```python
print(f"Search results for query: '{query}'")
print("------------------------------", "\n")
for i, result in enumerate(df.iloc[similar_item_ids].content.tolist()):
    print(f"Result {i}")
    print(result.replace("\n", "\n\n"))
    print("--------")
```

- The first `print()` statement displays the search query.
- The second `print()` creates a visual separator.
- `df.iloc[similar_item_ids].content.tolist()`: This gets the 'content' column from our result rows and converts it to a list.
- The `for` loop iterates over this list of content:
  - `enumerate()` is used to get both the index and the content of each result.
  - `i` is the index (0-4 for our 5 results).
  - `result` is the content of each matching document.
- Inside the loop:
  - We print the result number.
  - We print the content, replacing single newlines with double newlines for better readability.
  - We print a separator between results.

In [131]:
query = "warfare"
query_vector = model.encode(query)
similar_item_ids = annoy_index.get_nns_by_vector(query_vector, 5)
df.iloc[similar_item_ids]
print(f"Search results for query: '{query}'")
print("------------------------------", "\n")
for i, result in enumerate(df.iloc[similar_item_ids].content.tolist()):
    print(f"Result {i}")
    print(result.replace("\n", "\n\n"))
    print("--------")

Search results for query: 'warfare'
------------------------------ 

Result 0
In Council16th. Novem: 1782.

Gentlemen

I have your favor by the last post, and think with you that it is problematical whether the British quit Charles Town or not, tho’ on the 25th. of last Month they had made such advances towards it that hopes are to be entertain’d of their being embarked before the countermanding orders arrive. If this should be the case, & they still entertain hopes of conquest in America may they not call on us, if they should, we were never less prepared, some demon or other certainly possessed us when we disposed of the ships that were prepared to bring over the arms & ammunition from France, both which are now so much wanted that if you do not think it improper you will do your Country a great Service by again pressing the Chevilier to use his Interest to have them brought over in a Frigate. The Scarcity of musket powder is so great in the State owing to our losses by the Enemy, an

Notice that using `warfare` here, returns results similar to warfare. This is because we are querying the dataset by vectors, rather than keywords. This means the word `warfare` does not have to appear in order to get results. Unfortunately, this can make the results (especially longer documents) difficult to understand. What specifically in the text is giving us the result. Vectors can also dilute the document. Maybe the document speaks about warfare, but only in a very specific region while the rest of it speaks of farming. How do we handle problems like this? The answer comes down to chunking our data.

# Chunking

We have many ways to chunk our data: character-level, token-level, sentence-level, paragraph-level etc. For most vector databases, sentence-level chunking works best. This is because sentences function a syntactic and discrete unit in a text. Unlike paragraphs, sentences are formatted across texts the same way in a given language. In English we use `.` to represent the end of a distinct sentence. In chunking, we take an input document and break it into smaller chunks.

We have two ways to perform this process as a sliding window with or without overlap. Overlap is the degree to which the next item overlaps with the previous. Let's assume we have 9 sentences in our document, labeled from sentence1 to sentence9. We'll demonstrate how chunking works with different levels of overlap.

## No Overlap

With no overlap and a chunk size of 3, our chunks would look like this:

chunk1: [sentence1, sentence2, sentence3]
chunk2: [sentence4, sentence5, sentence6]
chunk3: [sentence7, sentence8, sentence9]

## Overlap of 1

Now, let's see how it looks with an overlap of 1:

chunk1: [sentence1, sentence2, sentence3]
chunk2: [sentence3, sentence4, sentence5]
chunk3: [sentence5, sentence6, sentence7]
chunk4: [sentence7, sentence8, sentence9]

## Overlap of 2

Finally, here's how it would look with an overlap of 2:

chunk1: [sentence1, sentence2, sentence3]
chunk2: [sentence2, sentence3, sentence4]
chunk3: [sentence3, sentence4, sentence5]
chunk4: [sentence4, sentence5, sentence6]
chunk5: [sentence5, sentence6, sentence7]
chunk6: [sentence6, sentence7, sentence8]
chunk7: [sentence7, sentence8, sentence9]

In this last example with an overlap of 2, each chunk (except the first) includes the two previous sentences along with a new sentence. This creates a sliding window effect where each chunk shares a significant amount of content with its neighboring chunks, potentially preserving more context and continuity in the chunked data.

The choice of chunk size and overlap depends on the specific requirements of your application. A larger chunk size with some overlap can help maintain context across chunks, which can be beneficial for tasks that require understanding broader context. However, it also increases the number of chunks, which can impact processing time and storage requirements.

To chunk our data into sentences, we'll first need to separate the text into sentences. To do that, we will use spaCy

In [77]:
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

  _torch_pytree._register_pytree_node(


<spacy.pipeline.sentencizer.Sentencizer at 0x3c6322800>

With this pipeline we can now chunk our data. In the cell below, we will be creating a new dataset from our original one while preserving the original index (document) index. This is so that we can make our chunks to the original data. We can also preserve the subindex for each chunk within the document. This means that for any given index in our chunked data, we have access to its position within the document to which it belongs and the corpus as a whole.

In [132]:
# Function to chunk text into groups of 3 sentences
def chunk_text(text, chunk_size=3):
    doc = nlp(text)
    sentences = list(doc.sents)
    chunks = []
    for i in range(0, len(sentences), chunk_size):
        chunk = sentences[i:i+chunk_size]
        chunks.append(" ".join([sent.text for sent in chunk]))
    return chunks

# Create chunks
chunked_data = []
for idx, row in tqdm(df.iterrows(), total=len(df), desc="Chunking texts"):
    chunks = chunk_text(row['content'])
    for chunk_idx, chunk in enumerate(chunks):
        chunk_data = row.to_dict()
        chunk_data['content'] = chunk
        chunk_data['document_index'] = idx
        chunk_data['chunk_index'] = chunk_idx
        chunked_data.append(chunk_data)

Chunking texts: 100%|██████████| 1000/1000 [00:01<00:00, 800.75it/s]


Once chunked, we can then create a new dataset with our chunked data.

In [79]:
# Create new DataFrame with chunks
chunked_df = pd.DataFrame(chunked_data)
chunked_df

Unnamed: 0,title,permalink,project,authors,recipients,date-from,date-to,content,embedding,document_index,chunk_index
0,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,Monticello Dec. 22. 15. \nDear Sir\nOn my retu...,"[-0.11585661, -0.031811137, 0.05453429, -0.047...",0,0
1,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,by this mail I send you an Ovid’s metamorphose...,"[-0.11585661, -0.031811137, 0.05453429, -0.047...",0,1
2,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,I salute you with friendship & esteem\nTh: Jef...,"[-0.11585661, -0.031811137, 0.05453429, -0.047...",0,2
3,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,abridged. to be bound[. . .] \nthe Lat. &,"[-0.11585661, -0.031811137, 0.05453429, -0.047...",0,3
4,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,"Eng in one, & the Eng. & Lat.[. . .] \nOvid’s ...","[-0.11585661, -0.031811137, 0.05453429, -0.047...",0,4
...,...,...,...,...,...,...,...,...,...,...,...
4562,"To Benjamin Franklin from William Henly, [Apri...",https://founders.archives.gov/documents/Frankl...,Franklin Papers,"[Henly, William]","[Franklin, Benjamin]",1772-04-01,1772-04-30,Believe me Dear Sir to be with real regard you...,"[-0.064454764, 0.060675174, 0.072550446, 0.074...",997,6
4563,From George Washington to Major General Alexan...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[McDougall, Alexander]",1779-05-20,1779-05-20,Head Quarters Middle Brook May 20th 1779\nDr S...,"[0.0034233108, 0.03919683, 0.05247831, -0.0776...",998,0
4564,"From Thomas Jefferson to João, Prince Regent o...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[João, Prince Regent of Portugal]",1801-10-12,1801-10-12,"To our Great and Good Friend, His Royal Highne...","[-0.062964484, 0.11988714, 0.085458025, -0.058...",999,0
4565,"From Thomas Jefferson to João, Prince Regent o...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[João, Prince Regent of Portugal]",1801-10-12,1801-10-12,Whilst under this mournful visitation we mingl...,"[-0.062964484, 0.11988714, 0.085458025, -0.058...",999,1


Unfortunately, though, our embeddings are precisely the same for each chunk. This isn't good. We want new embeddings for each chunk. Let's go ahead and repeat the steps from earlier in the notebook. This will take slightly longer as we now have ~4 times as many documents.

In [133]:
# Vectorize the chunks
chunks = chunked_df['content'].tolist()
embeddings = model.encode(chunks, show_progress_bar=True)

Batches:   0%|          | 0/143 [00:00<?, ?it/s]

Now, let's go ahead and replace the original embeddings with our new ones!

In [134]:
# Add embeddings to the chunked DataFrame
chunked_df['embedding'] = list(embeddings)

We will again repeat the same steps as above.

In [135]:
annoy_index = AnnoyIndex(vector_dim, 'angular')
# Add items to the index
for i, embedding in enumerate(embeddings):
    annoy_index.add_item(i, embedding)
annoy_index.build(10)

True

Let's again query our new index.

In [138]:
query = "warfare"
query_vector = model.encode(query)
similar_item_ids = annoy_index.get_nns_by_vector(query_vector, 5)
chunked_df.iloc[similar_item_ids]
print(f"Search results for query: '{query}'")
print("------------------------------", "\n")
for i, result in enumerate(chunked_df.iloc[similar_item_ids].content.tolist()):
    print(f"Result {i}")
    print(result.replace("\n", "\n\n"))
    print("--------")

Search results for query: 'warfare'
------------------------------ 

Result 0
an alliance offensive & defensive is concluded, & wh. embarks her in the war of course agnst. Engld.; &
--------
Result 1
The Weapons employed in this War are law Suits, doors and locks and bolts. To neglect to employ these weapons is to forfeit those blessings. Nations in like manner can exist <on> with all the proragatives of nations only by War.
--------
Result 2
Wars, among them. Captives. 

A Right to destroy them, if necessary to secure themselves.
--------
Result 3
Id.

1. to enter into war & peace with foreign powers 2 to enter into alliances with foreign powers and with one another, Not prejudicial to their engagements to the Empire Code d’Hum—3 to make laws, levy taxes, raisetroops, to determine on life & death. Savage.
--------
Result 4
Painfull. Humanity, common Justice, and eternal Morality. 

Conquest and Rights of War.
--------


Notice that our results are much more targeted. We only get results for a specific chunk of data.