# A Vector Database of Classic English Literature

In this notebook I'll show how to create a vector database of passages from classic authors. We will start by downloading a copy of three books by Charles Dickens from <a href="https://www.gutenberg.org">Project Gutenberg</a> and breaking the text down into paragraph-long passages using regex.

In [1]:
import urllib.request
import regex

url_list = ['https://www.gutenberg.org/cache/epub/98/pg98.txt',
            'https://www.gutenberg.org/cache/epub/24022/pg24022.txt',
            'https://www.gutenberg.org/cache/epub/1400/pg1400.txt']

def read_book(url):
    try:
        response = urllib.request.urlopen(url)
    except Exception as e:
        print(e)
    else:
        if response is None:
            print("Error: no response.")
            return None, None

        content = response.read().decode("utf-8")
        book_title = regex.findall("Title: (.*?)[\n|\r]", content)
        author = regex.findall("Author: (.*?)[\n|\r]", content)

        if len(book_title) == 0:
            print("Could not find name of book.")
            book_title = None
        else:
            print("Book title: ", book_title[0])

        if len(author) == 0:
            print("Could not find name of author.")
            author = None
        else:
            print("author: ", author[0])

        content = regex.split(r'[\*]+[^\*]+[\*]+', content)
        if len(content) < 2:
            print("Error: could not read book.")
            return None
        blocks = regex.split(r'[\n\r]{3,}', content[1])

        samples = []
        for block in blocks:
            block = regex.sub(r'[\r\n]+', ' ', block)
            block = regex.sub(r'[\s]+', ' ', block)
            if len(block) > 0:
                samples.append({'author': author[0], 'title': book_title[0], 'text': block})
        response.close()

        return samples

In [2]:
import time

samples = []

for url in url_list:
    new_samples = read_book(url)
    if new_samples is None:
        print("No text added.")
    else:
        samples = [*samples, *new_samples]
    print(f"Added {len(new_samples)} new passages.")
    time.sleep(15)

Book title:  A Tale of Two Cities
author:  Charles Dickens
Added 2236 new passages.
Book title:  A Christmas Carol
author:  Charles Dickens
Added 794 new passages.
Book title:  Great Expectations
author:  Charles Dickens
Added 3915 new passages.


In [3]:
samples[100]['text']

'Yet even when his eyes were opened on the mist and rain, on the moving patch of light from the lamps, and the hedge at the roadside retreating by jerks, the night shadows outside the coach would fall into the train of the night shadows within. The real Banking-house by Temple Bar, the real business of the past day, the real strong rooms, the real express sent after him, and the real message returned, would all be there. Out of the midst of them, the ghostly face would rise, and he would accost it again.'

We are now ready to create a vector database. Using a huggingface transformer, we will convert the text passages into vector embeddings. We'll be using all-MiniLM-L6-v2.

In [4]:
texts = [sample['text'] for sample in samples]

In [5]:
pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.met

In [6]:
from sentence_transformers import SentenceTransformer

sentence_transformer_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
embeddings = sentence_transformer_model.encode(texts)
embeddings.shape

(6945, 384)

To facilitate the vector search, we will create a search index for the embeddings using FAISS (Facebook Artificial Intelligence Similarity Search).

In [8]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1


In [9]:
import faiss
import numpy as np

index_L2 = faiss.IndexFlatL2(embeddings.shape[-1])
index_L2.add(embeddings)

In [10]:
def vector_search(query_text, k):
    query_vector = sentence_transformer_model.encode(query_text)
    distances, sorted_ids = index_L2.search(np.array([query_vector]), k)
    results = [{'author': samples[sorted_ids[0][i]]['author'],
                'title': samples[sorted_ids[0][i]]['title'],
                'text': samples[sorted_ids[0][i]]['text']} for i in range(k)]
    return results

In [11]:
res = vector_search("He felt sad.", 10)
for i in range(len(res)):
    record = res[i]
    print(f"Result {i}:\n\tauthor: {record['author']}\n\ttitle: {record['title']}\n\ttext: {record['text']}")

Result 0:
	author: Charles Dickens
	title: A Tale of Two Cities
	text: Looking gently at him again, she was surprised and saddened to see that there were tears in his eyes. There were tears in his voice too, as he answered:
Result 1:
	author: Charles Dickens
	title: A Tale of Two Cities
	text: His cry was so like a cry of actual pain, that it rang in Charles Darnay’s ears long after he had ceased. He motioned with the hand he had extended, and it seemed to be an appeal to Darnay to pause. The latter so received it, and remained silent.
Result 2:
	author: Charles Dickens
	title: A Christmas Carol
	text: Then, with a rapidity of transition very foreign to his usual character, he said, in pity for his former self, 'Poor boy!' and cried again.
Result 3:
	author: Charles Dickens
	title: A Tale of Two Cities
	text: He shook his head.
Result 4:
	author: Charles Dickens
	title: Great Expectations
	text: I felt his hand tremble as it held mine, and he turned his face away as he lay in the botto

In [12]:
res = vector_search("He eats something he does not like in the least.", 10)
for i in range(len(res)):
    record = res[i]
    print(f"Result {i}:\n\tauthor: {record['author']}\n\ttitle: {record['title']}\n\ttext: {record['text']}")

Result 0:
	author: Charles Dickens
	title: Great Expectations
	text: He ate in a ravenous way that was very disagreeable, and all his actions were uncouth, noisy, and greedy. Some of his teeth had failed him since I saw him eat on the marshes, and as he turned his food in his mouth, and turned his head sideways to bring his strongest fangs to bear upon it, he looked terribly like a hungry old dog. If I had begun with any appetite, he would have taken it away, and I should have sat much as I did,—repelled from him by an insurmountable aversion, and gloomily looking at the cloth.
Result 1:
	author: Charles Dickens
	title: A Christmas Carol
	text: 'I don't mind going if a lunch is provided,' observed the gentleman with the excrescence on his nose. 'But I must be fed if I make one.'
Result 2:
	author: Charles Dickens
	title: Great Expectations
	text: “You don’t eat ’em,” returned Mr. Pumblechook, sighing and nodding his head several times, as if he might have expected that, and as if absti

In [13]:
res = vector_search("I had an interesting dream last night.", 10)
for i in range(len(res)):
    record = res[i]
    print(f"Result {i}:\n\tauthor: {record['author']}\n\ttitle: {record['title']}\n\ttext: {record['text']}")

Result 0:
	author: Charles Dickens
	title: Great Expectations
	text: For a reason that I had, I felt as if my eyes would start out of my head. I acknowledged his attention incoherently, and began to think this was a dream.
Result 1:
	author: Charles Dickens
	title: Great Expectations
	text: As he was fast making jam of his fruit by wrestling with the door while the paper-bags were under his arms, I begged him to allow me to hold them. He relinquished them with an agreeable smile, and combated with the door as if it were a wild beast. It yielded so suddenly at last, that he staggered back upon me, and I staggered back upon the opposite door, and we both laughed. But still I felt as if my eyes must start out of my head, and as if this must be a dream.
Result 2:
	author: Charles Dickens
	title: Great Expectations
	text: With this project formed, we went to bed. I had the wildest dreams concerning him, and woke unrefreshed; I woke, too, to recover the fear which I had lost in the night, of

We can now see the utility of performing a vector search on text data. The top result for the query "he felt sad" was a passage where Dickens poetically describes someone who felt sad, without ever using the words "he", "felt" or "sad." In fact, the word "sad" does not occur in any of the search results, although the passages communicate this precisely, expressed in Dickens' celebrated literary style. It is similar with the other queries: for "he eats something he does not like in the least," passages are returned that do not contain the word "eat" or "ate", but still describe this situation, and for "I had an interesting dream last night," some of the search results do not contain the word "dream," but describe night-time fantasies in the first person.

In [14]:
for i in range(len(samples)):
    samples[i]['id'] = i

We can now save the downloaded books as a Spark dataframe and write it to a .csv file.

In [15]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=617fc5516cb6b8a4e9116b75f5bf99643d576faa01d3b75272e52c58d03a0082
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


In [16]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SaveBooks").getOrCreate()

df = spark.createDataFrame(samples)
df.show()

  self.pid = _posixsubprocess.fork_exec(


+---------------+---+--------------------+--------------------+
|         author| id|                text|               title|
+---------------+---+--------------------+--------------------+
|Charles Dickens|  0|A TALE OF TWO CITIES|A Tale of Two Cities|
|Charles Dickens|  1|A STORY OF THE FR...|A Tale of Two Cities|
|Charles Dickens|  2|  By Charles Dickens|A Tale of Two Cities|
|Charles Dickens|  3|            CONTENTS|A Tale of Two Cities|
|Charles Dickens|  4| Book the First--...|A Tale of Two Cities|
|Charles Dickens|  5| CHAPTER I The Pe...|A Tale of Two Cities|
|Charles Dickens|  6| Book the Second-...|A Tale of Two Cities|
|Charles Dickens|  7| CHAPTER I Five Y...|A Tale of Two Cities|
|Charles Dickens|  8| Book the Third--...|A Tale of Two Cities|
|Charles Dickens|  9| CHAPTER I In Sec...|A Tale of Two Cities|
|Charles Dickens| 10|Book the First--R...|A Tale of Two Cities|
|Charles Dickens| 11|CHAPTER I. The Pe...|A Tale of Two Cities|
|Charles Dickens| 12|It was the best o..

In [17]:
df.repartition(1).write.csv("books.csv", header=True)
spark.stop()

We will also save the embeddings and search indices.

In [18]:
np.save('embeddings.npy', embeddings)

In [19]:
faiss.write_index(index_L2, "index_L2.index")

In the [next notebook](https://github.com/tommyliphysics/tommyli-ml/blob/main/literature_vdb/notebooks/read_update.ipynb) we will look at how to load the text, embeddings and search index to perform search and add more books to the database.