# A Vector Database of Classic English Literature

In the [previous notebook](https://github.com/tommyliphysics/tommyli-ml/blob/main/literature_vdb/notebooks/create.ipynb) we created a vector database of English texts downloaded from [Project Gutenberg](https://www.gutenberg.org). We will now look at accessing this database to perform a vector search and add new texts. We will do this by creating a class called LiteratureSearch with the functions add(), save() and search().

In [1]:
#!pip install pyspark

In [21]:
from pyspark.sql import SparkSession
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import faiss

import urllib.request
import regex

def read_book(url):
    try:
        response = urllib.request.urlopen(url)
    except Exception as e:
        print(e)
    else:
        if response is None:
            print("Error: no response.")
            return None, None

        content = response.read().decode("utf-8")
        book_title = regex.findall("Title: (.*?)[\n|\r]", content)
        author = regex.findall("Author: (.*?)[\n|\r]", content)

        if len(book_title) == 0:
            print("Could not find name of book.")
            book_title = None
        else:
            print("Book title: ", book_title[0])

        if len(author) == 0:
            print("Could not find name of author.")
            author = None
        else:
            print("author: ", author[0])
        if len(regex.findall(r'[\*]+[\s]+START OF THE PROJECT GUTENBERG EBOOK', content)) > 0:
            content = regex.split(r'[\*]+[\s]+START OF THE PROJECT GUTENBERG EBOOK[^\*]+[\*]+', content)[1]
            if len(regex.findall(r'[\*]+[\s]+END OF THE PROJECT GUTENBERG EBOOK', content)) > 0:
                content = regex.split(r'[\*]+[\s]+END OF THE PROJECT GUTENBERG EBOOK[^\*]+[\*]+', content)[0]

        blocks = regex.split(r'[\n\r]{3,}', content)

        samples = []
        for block in blocks:
            block = regex.sub(r'[\r\n]+', ' ', block)
            block = regex.sub(r'[\s]+', ' ', block)
            if len(block) > 0:
                samples.append({'author': author[0], 'title': book_title[0], 'text': block})
        response.close()

        return samples

class LiteratureSearch:
    def __init__(self, books_fn, embeddings_fn, index_fn):
        self.spark = SparkSession.builder.appName("Read").getOrCreate()
        self.df = self.spark.read.csv(books_fn, header=True, inferSchema=True)
        self.df.show()
        self.sentence_transformer_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
        self.embeddings = np.load(embeddings_fn)
        self.index_L2 = faiss.read_index(index_fn)
        self.max_id = self.df.count()

    def add(self, url):
        new_samples = read_book(url)
        for sample in new_samples:
            sample['id'] = self.max_id
            self.max_id += 1
        new_samples_df = self.spark.createDataFrame(new_samples)
        self.df = self.df.union(new_samples_df)
        print(f"Added {new_samples_df.count()} new samples to the pyspark dataframe.")

        new_texts = [sample['text'] for sample in new_samples]
        new_embeddings = self.sentence_transformer_model.encode(new_texts)
        self.embeddings = np.concatenate((self.embeddings, new_embeddings))
        print(f"Added new embeddings of shape {new_embeddings.shape}. New shape: {self.embeddings.shape}")
        self.index_L2.add(new_embeddings)
        print(f"Added to index. New index size: {self.index_L2.ntotal}")

    def save(self, books_fn, embeddings_fn, index_fn):
        self.df.repartition(1).write.mode('overwrite').csv(books_fn, header=True)
        np.save('embeddings.npy', self.embeddings)
        faiss.write_index(self.index_L2, "index_L2.index")

    def search(self, query_text, k):
        query_vector = self.sentence_transformer_model.encode(query_text)
        distances, sorted_ids = self.index_L2.search(np.array([query_vector]), k)

        sorted_ids = sorted_ids[0].tolist()
        results = self.df.filter(self.df.id.isin(sorted_ids)).toPandas()
        results['result'] = results['id'].apply(lambda x: sorted_ids.index(x)+1)
        return results.sort_values(by='result').to_dict(orient='records')

    def __del__(self):
        self.spark.stop()

Let's create an instance of LiteratureSearch, which will load the text as a Spark DataFrame, embeddings and search index.

In [6]:
litsearch = LiteratureSearch('books.csv', 'embeddings.npy', 'index_L2.index')

+---------------+---+--------------------+--------------------+
|         author| id|                text|               title|
+---------------+---+--------------------+--------------------+
|Charles Dickens|  0|A TALE OF TWO CITIES|A Tale of Two Cities|
|Charles Dickens|  1|A STORY OF THE FR...|A Tale of Two Cities|
|Charles Dickens|  2|  By Charles Dickens|A Tale of Two Cities|
|Charles Dickens|  3|            CONTENTS|A Tale of Two Cities|
|Charles Dickens|  4|Book the First--R...|A Tale of Two Cities|
|Charles Dickens|  5|CHAPTER I The Per...|A Tale of Two Cities|
|Charles Dickens|  6|Book the Second--...|A Tale of Two Cities|
|Charles Dickens|  7|CHAPTER I Five Ye...|A Tale of Two Cities|
|Charles Dickens|  8|Book the Third--t...|A Tale of Two Cities|
|Charles Dickens|  9|CHAPTER I In Secr...|A Tale of Two Cities|
|Charles Dickens| 10|Book the First--R...|A Tale of Two Cities|
|Charles Dickens| 11|CHAPTER I. The Pe...|A Tale of Two Cities|
|Charles Dickens| 12|It was the best o..

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


We can now perform a search for a particular text:

In [7]:
litsearch.search("But, happy Sissy’s happy children loving her; all children loving her", 10)

[{'author': 'Charles Dickens',
  'id': 20304,
  'text': 'But, happy Sissy’s happy children loving her; all children loving her; she, grown learned in childish lore; thinking no innocent and pretty fancy ever to be despised; trying hard to know her humbler fellow-creatures, and to beautify their lives of machinery and reality with those imaginative graces and delights, without which the heart of infancy will wither up, the sturdiest physical manhood will be morally stark death, and the plainest national prosperity figures can show, will be the Writing on the Wall,—she holding this course as part of no fantastic vow, or bond, or brotherhood, or sisterhood, or pledge, or covenant, or fancy dress, or fancy fair; but simply as a duty to be done,—did Louisa see these things of herself? These things were to be.',
  'title': 'Hard Times',
  'result': 1},
 {'author': 'Charles Dickens',
  'id': 15755,
  'text': 'I never was so happy. I never was so pleased as when I saw those two sit down togeth

In [8]:
del(litsearch)

In [22]:
litsearch_new = LiteratureSearch('books.csv', 'embeddings.npy', 'index_L2.index')
litsearch_new.add('https://www.gutenberg.org/cache/epub/1023/pg1023.txt')
litsearch_new.save('books.csv', 'embeddings.npy', 'index_L2.index')
del(litsearch_new)

+---------------+---+--------------------+--------------------+
|         author| id|                text|               title|
+---------------+---+--------------------+--------------------+
|Charles Dickens|  0|A TALE OF TWO CITIES|A Tale of Two Cities|
|Charles Dickens|  1|A STORY OF THE FR...|A Tale of Two Cities|
|Charles Dickens|  2|  By Charles Dickens|A Tale of Two Cities|
|Charles Dickens|  3|            CONTENTS|A Tale of Two Cities|
|Charles Dickens|  4|Book the First--R...|A Tale of Two Cities|
|Charles Dickens|  5|CHAPTER I The Per...|A Tale of Two Cities|
|Charles Dickens|  6|Book the Second--...|A Tale of Two Cities|
|Charles Dickens|  7|CHAPTER I Five Ye...|A Tale of Two Cities|
|Charles Dickens|  8|Book the Third--t...|A Tale of Two Cities|
|Charles Dickens|  9|CHAPTER I In Secr...|A Tale of Two Cities|
|Charles Dickens| 10|Book the First--R...|A Tale of Two Cities|
|Charles Dickens| 11|CHAPTER I. The Pe...|A Tale of Two Cities|
|Charles Dickens| 12|It was the best o..

In [23]:
litsearch_new_new = LiteratureSearch('books.csv', 'embeddings.npy', 'index_L2.index')

+---------------+---+--------------------+--------------------+
|         author| id|                text|               title|
+---------------+---+--------------------+--------------------+
|Charles Dickens|  0|A TALE OF TWO CITIES|A Tale of Two Cities|
|Charles Dickens|  1|A STORY OF THE FR...|A Tale of Two Cities|
|Charles Dickens|  2|  By Charles Dickens|A Tale of Two Cities|
|Charles Dickens|  3|            CONTENTS|A Tale of Two Cities|
|Charles Dickens|  4|Book the First--R...|A Tale of Two Cities|
|Charles Dickens|  5|CHAPTER I The Per...|A Tale of Two Cities|
|Charles Dickens|  6|Book the Second--...|A Tale of Two Cities|
|Charles Dickens|  7|CHAPTER I Five Ye...|A Tale of Two Cities|
|Charles Dickens|  8|Book the Third--t...|A Tale of Two Cities|
|Charles Dickens|  9|CHAPTER I In Secr...|A Tale of Two Cities|
|Charles Dickens| 10|Book the First--R...|A Tale of Two Cities|
|Charles Dickens| 11|CHAPTER I. The Pe...|A Tale of Two Cities|
|Charles Dickens| 12|It was the best o..

In [24]:
litsearch_new_new.search("But, happy Sissy’s happy children loving her; all children loving her", 10)

[{'author': 'Charles Dickens',
  'id': 20304,
  'text': 'But, happy Sissy’s happy children loving her; all children loving her; she, grown learned in childish lore; thinking no innocent and pretty fancy ever to be despised; trying hard to know her humbler fellow-creatures, and to beautify their lives of machinery and reality with those imaginative graces and delights, without which the heart of infancy will wither up, the sturdiest physical manhood will be morally stark death, and the plainest national prosperity figures can show, will be the Writing on the Wall,—she holding this course as part of no fantastic vow, or bond, or brotherhood, or sisterhood, or pledge, or covenant, or fancy dress, or fancy fair; but simply as a duty to be done,—did Louisa see these things of herself? These things were to be.',
  'title': 'Hard Times',
  'result': 1},
 {'author': 'Charles Dickens',
  'id': 25712,
  'text': '“Now, be happy, child, under better circumstances. Be beloved and happy!”',
  'title

In [25]:
del(litsearch_new_new)