# A Vector Database of Classic English Literature

In the [previous notebook](https://github.com/tommyliphysics/tommyli-ml/blob/main/literature_vdb/notebooks/create.ipynb) we created a vector database of English texts downloaded from [Project Gutenberg](https://www.gutenberg.org). We will now look at accessing this database to perform a vector search and add new texts. We will do this by creating a class called LiteratureSearch with the functions add(), save() and search().

In [1]:
!pip install pyspark sentence_transformers faiss-cpu

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-

In [2]:
from pyspark.sql import SparkSession
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import faiss

import gc

import urllib.request
import regex

def read_book(url):
    try:
        response = urllib.request.urlopen(url)
    except Exception as e:
        print(e)
    else:
        if response is None:
            print("Error: no response.")
            return None, None

        content = response.read().decode("utf-8")
        book_title = regex.findall("Title: (.*?)[\n|\r]", content)
        author = regex.findall("Author: (.*?)[\n|\r]", content)

        if len(book_title) == 0:
            print("Could not find name of book.")
            book_title = None
        else:
            book_title = book_title[0]
            print("Book title: ", book_title)

        if len(author) == 0:
            print("Could not find name of author.")
            author = None
        else:
            author = author[0]
            print("author: ", author)
        if len(regex.findall(r'[\*]+[\s]+START OF THE PROJECT GUTENBERG EBOOK', content)) > 0:
            content = regex.split(r'[\*]+[\s]+START OF THE PROJECT GUTENBERG EBOOK[^\*]+[\*]+', content)[1]
            if len(regex.findall(r'[\*]+[\s]+END OF THE PROJECT GUTENBERG EBOOK', content)) > 0:
                content = regex.split(r'[\*]+[\s]+END OF THE PROJECT GUTENBERG EBOOK[^\*]+[\*]+', content)[0]

        blocks = regex.split(r'[\n\r]{3,}', content)

        samples = []
        for block in blocks:
            block = regex.sub(r'[\r\n]+', ' ', block)
            block = regex.sub(r'[\s]+', ' ', block)
            if len(block) > 0:
                samples.append({'author': author[0], 'title': book_title[0], 'text': block})
        response.close()

        return samples, book_title, author

class LiteratureSearch:
    def __init__(self, books_fn, embeddings_fn, index_fn):
      # load the Spark DataFrame
        self.spark = SparkSession.builder.appName("Read").getOrCreate()
        self.df = self.spark.read.csv(books_fn, header=True, inferSchema=True)
        self.df.show()

      # get list of authors and book titles
        authors = self.df.select('author').distinct().collect()
        self.author_list = []

        for author in authors:
            filtered = self.df.filter(self.df.author == author['author']).select('title').distinct().collect()
            for row in filtered:
                self.author_list.append({'author': author['author'], 'title': row['title']})

        print("Found titles:")
        for row in self.author_list:
            print(f"{row['title']}\tby {row['author']}")
        self.author_list = pd.DataFrame(self.author_list)

      # load the sentence transformer model
        self.sentence_transformer_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
      # load embeddings
        self.embeddings = np.load(embeddings_fn)
      # load search index
        self.index_L2 = faiss.read_index(index_fn)

        self.max_id = self.df.count()

    def add(self, url):
        new_samples, title, author = read_book(url)
        if author in self.author_list['author'].unique():
            if title in self.author_list[self.author_list['author']==author]['title'].unique():
                print(f"Error: {title} by {author} already exists in the collection.")
                return -1

        for sample in new_samples:
            sample['id'] = self.max_id
            self.max_id += 1
        new_samples_df = self.spark.createDataFrame(new_samples)
        self.df = self.df.union(new_samples_df)
        print(f"Added {new_samples_df.count()} new samples to the pyspark dataframe.")

        new_texts = [sample['text'] for sample in new_samples]
        new_embeddings = self.sentence_transformer_model.encode(new_texts)
        self.embeddings = np.concatenate((self.embeddings, new_embeddings))
        print(f"Added new embeddings of shape {new_embeddings.shape}. New shape: {self.embeddings.shape}")
        self.index_L2.add(new_embeddings)
        print(f"Added to index. New index size: {self.index_L2.ntotal}")

    def save(self, books_fn, embeddings_fn, index_fn):
        self.df.repartition(1).write.mode('overwrite').csv(books_fn, header=True)
        np.save('embeddings.npy', self.embeddings)
        faiss.write_index(self.index_L2, "index_L2.index")

    def search(self, query_text, k):
        query_vector = self.sentence_transformer_model.encode(query_text)
        distances, sorted_ids = self.index_L2.search(np.array([query_vector]), k)

        sorted_ids = sorted_ids[0].tolist()
        results = self.df.filter(self.df.id.isin(sorted_ids)).toPandas()
        results['result'] = results['id'].apply(lambda x: sorted_ids.index(x)+1)
        return results.sort_values(by='result').to_dict(orient='records')

    def close(self):
        del(self.index_L2)
        del(self.embeddings)
        self.spark.stop()
        print("Stopped Spark session.")
        gc.collect()

  from tqdm.autonotebook import tqdm, trange


Let's create an instance of LiteratureSearch, which will load the text as a Spark DataFrame, embeddings and search index.

In [3]:
litsearch = LiteratureSearch('books.csv', 'embeddings.npy', 'index_L2.index')

  self.pid = _posixsubprocess.fork_exec(


+---------------+---+--------------------+--------------------+
|         author| id|                text|               title|
+---------------+---+--------------------+--------------------+
|Charles Dickens|  0|A TALE OF TWO CITIES|A Tale of Two Cities|
|Charles Dickens|  1|A STORY OF THE FR...|A Tale of Two Cities|
|Charles Dickens|  2|  By Charles Dickens|A Tale of Two Cities|
|Charles Dickens|  3|            CONTENTS|A Tale of Two Cities|
|Charles Dickens|  4|Book the First--R...|A Tale of Two Cities|
|Charles Dickens|  5|CHAPTER I The Per...|A Tale of Two Cities|
|Charles Dickens|  6|Book the Second--...|A Tale of Two Cities|
|Charles Dickens|  7|CHAPTER I Five Ye...|A Tale of Two Cities|
|Charles Dickens|  8|Book the Third--t...|A Tale of Two Cities|
|Charles Dickens|  9|CHAPTER I In Secr...|A Tale of Two Cities|
|Charles Dickens| 10|Book the First--R...|A Tale of Two Cities|
|Charles Dickens| 11|CHAPTER I. The Pe...|A Tale of Two Cities|
|Charles Dickens| 12|It was the best o..

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Let's add a few more books to the database:

In [4]:
litsearch.add('https://www.gutenberg.org/cache/epub/730/pg730.txt')

Book title:  Oliver Twist
author:  Charles Dickens
Added 3893 new samples to the pyspark dataframe.
Added new embeddings of shape (3893, 384). New shape: (10838, 384)
Added to index. New index size: 10838


In [5]:
litsearch.add(
            'https://www.gutenberg.org/cache/epub/766/pg766.txt')

Book title:  David Copperfield
author:  Charles Dickens
Added 7192 new samples to the pyspark dataframe.
Added new embeddings of shape (7192, 384). New shape: (18030, 384)
Added to index. New index size: 18030


We can now save the database to file.

In [6]:
litsearch.save('books.csv', 'embeddings.npy', 'index_L2.index')

Let's now load a new instance of LiteratureSearch and see if our new books are in the database.

In [7]:
litsearch.close()

Stopped Spark session.


In [8]:
litsearch_new = LiteratureSearch('books.csv', 'embeddings.npy', 'index_L2.index')

+---------------+---+--------------------+--------------------+
|         author| id|                text|               title|
+---------------+---+--------------------+--------------------+
|Charles Dickens|  0|A TALE OF TWO CITIES|A Tale of Two Cities|
|Charles Dickens|  1|A STORY OF THE FR...|A Tale of Two Cities|
|Charles Dickens|  2|  By Charles Dickens|A Tale of Two Cities|
|Charles Dickens|  3|            CONTENTS|A Tale of Two Cities|
|Charles Dickens|  4|Book the First--R...|A Tale of Two Cities|
|Charles Dickens|  5|CHAPTER I The Per...|A Tale of Two Cities|
|Charles Dickens|  6|Book the Second--...|A Tale of Two Cities|
|Charles Dickens|  7|CHAPTER I Five Ye...|A Tale of Two Cities|
|Charles Dickens|  8|Book the Third--t...|A Tale of Two Cities|
|Charles Dickens|  9|CHAPTER I In Secr...|A Tale of Two Cities|
|Charles Dickens| 10|Book the First--R...|A Tale of Two Cities|
|Charles Dickens| 11|CHAPTER I. The Pe...|A Tale of Two Cities|
|Charles Dickens| 12|It was the best o..

We see that we have successfully updated the database. Let's now perform a vector search.

In [9]:
litsearch_new.search("Let him cook!", 10)

[{'author': 'C',
  'id': 11486,
  'text': '‘Can you cook this young gentleman’s breakfast for him, if you please?’ said the Master at Salem House.',
  'title': 'D',
  'result': 1},
 {'author': 'Charles Dickens',
  'id': 3247,
  'text': 'Must they! Let them not hope to taste it!',
  'title': 'Great Expectations',
  'result': 2},
 {'author': 'C',
  'id': 11435,
  'text': '‘What have we got here?’ he said, putting a fork into my dish. ‘Not chops?’',
  'title': 'D',
  'result': 3},
 {'author': 'C',
  'id': 14122,
  'text': 'What with the novelty of this cookery, the excellence of it, the bustle of it, the frequent starting up to look after it, the frequent sitting down to dispose of it as the crisp slices came off the gridiron hot and hot, the being so busy, so flushed with the fire, so amused, and in the midst of such a tempting noise and savour, we reduced the leg of mutton to the bone. My own appetite came back miraculously. I am ashamed to record it, but I really believe I forgot Dora 

In this series of notebooks we've seen how to create a vector database using a combination of PySpark, a sentence transformer and FAISS. Another option is to use existing vector search capabilities in cloud database services, e.g. AWS OpenSearch or MongoDB Atlas. The procedure there is similar to the one described here, but with a few differences: rather than storing the text and embeddings in separate files, one can store the embeddings as well as the text as attributes within the same collection. A search index can then be created using cloud services, and a vector search can be performed via an API.