# RAG Prototype

This notebook includes a RAG Prototype which reads text files containing book descriptions and create embeddings based on the book description and then store them in a vector DB

- read book descriptions from the text files stored under data/
- use fastembed to create the vector embedding for each book description
- store the embeddings and the book metadata in a local weaviate database


Import libraries

In [16]:
%%capture
%pip install fastembed weaviate-client ipython

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [17]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [18]:
import os
import json
from IPython.display import JSON

from fastembed import TextEmbedding

import weaviate
from weaviate.classes.data import DataObject

from helper import suppress_output

Set variables

In [19]:
COLLECTION_NAME = "Books"  # capitalize the first letter of collection names
BOOK_DESCRIPTION_FOLDER = "data/"
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"

Instantiate Embedded Weaviate client

In [20]:
with suppress_output():
    client = weaviate.connect_to_embedded(
        persistence_data_path= "tmp/weaviate",
    )
print("Started new embedded Weaviate instance.")
print(f"Client is ready: {client.is_ready()}")

WeaviateStartUpError: Embedded DB did not start because processes are already listening on ports http:8079 and grpc:50050use weaviate.connect_to_local(port=8079, grpc_port=50050) to connect to the existing instance

Create the collection

In [None]:
existing_collections = client.collections.list_all()
existing_collection_names = existing_collections.keys()

if COLLECTION_NAME not in existing_collection_names:
    print(f"Collection {COLLECTION_NAME} does not exist yet. Creating it...")
    collection = client.collections.create(name=COLLECTION_NAME)
    print(f"Collection {COLLECTION_NAME} created successfully.")
else:
    print(f"Collection {COLLECTION_NAME} already exists. No action taken.")
    collection = client.collections.get(COLLECTION_NAME)

Collection Books already exists. No action taken.


Extract text from local files

In [None]:
# list the book description files
book_description_files = [
    f for f in os.listdir(BOOK_DESCRIPTION_FOLDER)
    if f.endswith('.txt')
]

print(f"The following files with book descriptions were found: {book_description_files}")

The following files with book descriptions were found: ['book_descriptions_2.txt', 'book_descriptions_1.txt']


In [None]:
list_of_book_data = []

for book_description_file in book_description_files:
    with open(os.path.join(BOOK_DESCRIPTION_FOLDER, book_description_file), "r") as f:
        book_descriptions = f.readlines()

    titles = [
        book_description.split(":::")[1].strip()
        for book_description in book_descriptions
    ]

    authors = [
        book_description.split(":::")[2].strip()
        for book_description in book_descriptions
    ]
    book_description_text = [
        book_description.split(":::")[3].strip()
        for book_description in book_descriptions
    ]

    book_descriptions = [
        {
            "title": title,
            "author": author,
            "description": description,
        }
        for title, author, description in zip(
            titles, authors, book_description_text
        )
    ]

    list_of_book_data.append(book_descriptions)


In [None]:
JSON(json.dumps(list_of_book_data))



<IPython.core.display.JSON object>

Create vector embeddings from descriptions

In [None]:
embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  

list_of_description_embeddings = []

for book_data in list_of_book_data:
    book_descriptions = [book["description"] for book in book_data]
    description_embeddings = [
        list(embedding_model.embed([desc]))[0] for desc in book_descriptions
    ]
    list_of_description_embeddings.append(description_embeddings)

Fetching 5 files: 100%|██████████| 5/5 [00:02<00:00,  2.02it/s]


Load embeddings to Weaviate

In the books collection of Weaviate, We will create an item for each data object (book). The item has two attributes:

- vector: which represents the vector embedding of the book text description
- properties: which is a python dictionary that contains the book metadata: title, author and text description.

In [None]:
for book_data_list, emb_list in zip(list_of_book_data, list_of_description_embeddings):
    items = []
    
    for book_data, emb in zip(book_data_list, emb_list):
        item = DataObject(
            properties={
                "title": book_data["title"],
                "author": book_data["author"],
                "description": book_data["description"],
            },
            vector=emb
        )
        items.append(item)
    
    collection.data.insert_many(items)

Query for a book recommendation using semantic search

- query.near_vector of the Weaviate Books collection

In [None]:
query_str = "A philosophical book"

embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  
collection = client.collections.get(COLLECTION_NAME)

query_emb = list(embedding_model.embed([query_str]))[0]

results = collection.query.near_vector(
    near_vector=query_emb,
    limit=1,
)
for result in results.objects:
    print(f"You should read: {result.properties['title']} by {result.properties['author']}")
    print("Description:")
    print(result.properties["description"])