# A Vector Database of Classic English Literature

In this notebook I'll show how to create a vector database of passages from classic authors. We will start by downloading a copy of three books by Charles Dickens from <a href="https://www.gutenberg.org">Project Gutenberg</a> and breaking the text down into paragraph-long passages using regex.

In [1]:
import urllib.request
import regex

url_list = ['https://www.gutenberg.org/cache/epub/98/pg98.txt',
            'https://www.gutenberg.org/cache/epub/24022/pg24022.txt',
            'https://www.gutenberg.org/cache/epub/1400/pg1400.txt']

def read_book(url):
    try:
        response = urllib.request.urlopen(url)
    except Exception as e:
        print(e)
    else:
        if response is None:
            print("Error: no response.")
            return None, None

        content = response.read().decode("utf-8")
        book_title = regex.findall("Title: (.*?)[\n|\r]", content)
        author = regex.findall("Author: (.*?)[\n|\r]", content)

        if len(book_title) == 0:
            print("Could not find name of book.")
            book_title = None
        else:
            print("Book title: ", book_title[0])

        if len(author) == 0:
            print("Could not find name of author.")
            author = None
        else:
            print("author: ", author[0])
        
        content = regex.split(r'[\*]+[^\*]+[\*]+', content)
        if len(content) < 2:
            print("Error: could not read book.")
            return None
        blocks = regex.split(r'[\n\r]{3,}', content[1])

        samples = []
        for block in blocks:
            block = regex.sub(r'[\r\n]+', ' ', block)
            block = regex.sub(r'[\s]+', ' ', block)
            if len(block) > 0:
                samples.append({'author': author[0], 'title': book_title[0], 'text': block})
        response.close()
        
        return samples

In [2]:
import time

samples = []

for url in url_list:
    new_samples = read_book(url)
    if new_samples is None:
        print("No text added.")
    else:
        samples = [*samples, *new_samples]
    print(f"Added {len(new_samples)} new passages.")
    time.sleep(15)

Book title:  A Tale of Two Cities
author:  Charles Dickens
Added 2236 new passages.
Book title:  A Christmas Carol
author:  Charles Dickens
Added 794 new passages.
Book title:  Great Expectations
author:  Charles Dickens
Added 3915 new passages.


In [3]:
samples[100]['text']

'Yet even when his eyes were opened on the mist and rain, on the moving patch of light from the lamps, and the hedge at the roadside retreating by jerks, the night shadows outside the coach would fall into the train of the night shadows within. The real Banking-house by Temple Bar, the real business of the past day, the real strong rooms, the real express sent after him, and the real message returned, would all be there. Out of the midst of them, the ghostly face would rise, and he would accost it again.'

We are now ready to create a vector database. Using a huggingface transformer, we will convert the text passages into vector embeddings. We'll be using all-MiniLM-L6-v2.

In [4]:
texts = [sample['text'] for sample in samples]

In [5]:
#!pip install sentence_transformers

In [6]:
from sentence_transformers import SentenceTransformer

sentence_transformer_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange
2024-08-19 20:08:17.798448: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-19 20:08:18.206368: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-19 20:08:18.246898: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-19 20:08:18.720371: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
embeddings = sentence_transformer_model.encode(texts)
embeddings.shape

KeyboardInterrupt: 

To facilitate the vector search, we will create a search index for the embeddings using FAISS (Facebook Artificial Intelligence Similarity Search).

In [None]:
#!pip install faiss-cpu

In [None]:
import faiss
import numpy as np

index_L2 = faiss.IndexFlatL2(embeddings.shape[-1])
index_L2.add(embeddings)

In [None]:
def vector_search(query_text, k):
    query_vector = sentence_transformer_model.encode(query_text)
    distances, sorted_ids = index_L2.search(np.array([query_vector]), k)
    results = [{'author': samples[sorted_ids[0][i]]['author'],
                'title': samples[sorted_ids[0][i]]['title'],
                'text': samples[sorted_ids[0][i]]['text']} for i in range(k)]
    return results

In [None]:
res = vector_search("He felt sad.", 10)
for i in range(len(res)):
    record = res[i]
    print(f"Result {i}:\n\tauthor: {record['author']}\n\ttitle: {record['title']}\n\ttext: {record['text']}")

In [None]:
res = vector_search("He eats something he does not like in the least.", 10)
for i in range(len(res)):
    record = res[i]
    print(f"Result {i}:\n\tauthor: {record['author']}\n\ttitle: {record['title']}\n\ttext: {record['text']}")

In [None]:
res = vector_search("I had an interesting dream last night.", 10)
for i in range(len(res)):
    record = res[i]
    print(f"Result {i}:\n\tauthor: {record['author']}\n\ttitle: {record['title']}\n\ttext: {record['text']}")

We can now see the utility of performing a vector search on text data. The top result for the query "he felt sad" was a passage where Dickens poetically describes someone who felt sad, without ever using the words "he", "felt" or "sad." In fact, the word "sad" does not occur in any of the search results, although the passages communicate this precisely, expressed in Dickens' celebrated literary style. It is similar with the other queries: for "he eats something he does not like in the least," passages are returned that do not contain the word "eat" or "ate", but still describe this situation, and for "I had an interesting dream last night," some of the search results do not contain the word "dream," but describe night-time fantasies in the first person.

In [None]:
for i in range(len(samples)):
    samples[i]['id'] = i

We can now save the downloaded books as a Spark dataframe and write it to a .csv file.

In [None]:
#!pip install pyspark

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SaveBooks").getOrCreate()

df = spark.createDataFrame(samples)
df.show()

In [None]:
df.repartition(1).write.csv("books.csv", header=True)
spark.stop()

We will also save the embeddings and search indices.

In [None]:
np.save('embeddings.npy', embeddings)

In [None]:
faiss.write_index(index_L2, "index_L2.index")

In the [next notebook](https://github.com/tommyliphysics/tommyli-ml/blob/main/literature_vdb/notebooks/read_update.ipynb) we will look at how to load the text, embeddings and search index to perform search and add more books to the database.