In this question, we will create a simple RAG database and use it to query a collection of documents. 

First, we will install Sentence Transformers, which we will use to compute the document embeddings. 

In [95]:
!pip install sentence_transformers
import os
import re
import numpy as np


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Next, write in code to load the rag docs into memory. Please store all the documents in a list. 

In [96]:
# Define the folder path
folder_path = "rag_docs/"

# Initialize the list to store the file contents
docs_as_strings = []

# Loop over the .txt files in the folder
for filename in os.listdir(folder_path):
    pass

# Print the list to verify the results
print(F"Length: {len(docs_as_strings)}")


Length: 6
["Organization and administration\nGovernance\n\nWeber Arch\nNorthwestern is privately owned and governed by an appointed Board of Trustees, which is composed of 70 members and, as of 2022, is chaired by Peter Barris '74.[63] The board delegates its power to an elected president who serves as the chief executive officer of the university.[64] Northwestern has had seventeen presidents in its history (excluding interim presidents). The current president, legal scholar Michael H. Schill, succeeded Morton O. Schapiro in fall 2022.[65] The president maintains a staff of vice presidents, directors, and other assistants for administrative, financial, faculty, and student matters.[66] Kathleen Haggerty assumed the role of provost for the university on September 1, 2020.[67]\n\nStudents are formally involved in the university's administration through the Associated Student Government, elected representatives of the undergraduate students, and the Graduate Student Association, which re

Next, split the documents into chunks. The chunks should each be K sentences of the original document. The function should take a document and return a list of chunks. For example, if the document is 12 sentences long and sentences_per_chunk is 4 then we should return 3 chunks of 4 sentences each. You can assume all sentences end in '.' or '?' or '!'

In [97]:
def split_into_chunks(text, sentences_per_chunk=25):
    # Split the text into sentences using regular expressions.

Convert all of the documents into chunks. Get a giant list of chunks. 

In [98]:
all_chunks = []
for one_doc in docs_as_strings:
    this_doc_chunks = split_into_chunks(one_doc)
    all_chunks.extend(this_doc_chunks)

print(F"Length of chunks: {len(all_chunks)}")
for c in all_chunks:
    print(F"Len of chunk: {len(c)}")
    

ZZ
ZZ
ZZ
ZZ
ZZ
ZZ
["Organization and administration\nGovernance\n\nWeber Arch\nNorthwestern is privately owned and governed by an appointed Board of Trustees, which is composed of 70 members and, as of 2022, is chaired by Peter Barris '74.[63] The board delegates its power to an elected president who serves as the chief executive officer of the university.[64] Northwestern has had seventeen presidents in its history (excluding interim presidents) . The current president, legal scholar Michael H . Schill, succeeded Morton O . Schapiro in fall 2022.[65] The president maintains a staff of vice presidents, directors, and other assistants for administrative, financial, faculty, and student matters.[66] Kathleen Haggerty assumed the role of provost for the university on September 1, 2020.[67]\n\nStudents are formally involved in the university's administration through the Associated Student Government, elected representatives of the undergraduate students, and the Graduate Student Associatio

Next, we use sentence transformers to actually do the embeddings. 

In [99]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ['This framework generates embeddings for each input sentence',
            'Sentences are passed as a list of string.',
            'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding dim:", len(embedding))
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding dim: 384

Sentence: Sentences are passed as a list of string.
Embedding dim: 384

Sentence: The quick brown fox jumps over the lazy dog.
Embedding dim: 384



This function should take a sentence or a chunk (any string really) as input and return the sentence transformers embedding. 

In [100]:
def compute_embeddings_st(input_sentence):
    pass



In [101]:

all_embeded_docs = compute_embeddings_st(all_chunks, batched=True)

print(all_embeded_docs.shape)

(14, 384)


Define a class to act as a document vector db. 

This class takes as input doc_db, which is the list of original chunks. It also takes doc_vectors, which is the computed embeddings from sentence_transformers. 

It implements one function, doc_search. 
doc_search takes some user query and returns the closest document. 

You will need to first embed the query with compute_embeddings_st
You will then need to compare that embedding to every element in doc_vectors. 

Finally, return the document whose distance vector from the query has the smallest L2 norm, that is, it is the closest to the query. 

In [102]:
class DocumentVectorDB:
    def __init__(self, doc_db, doc_vectors):
        self.doc_db = doc_db
        self.doc_vectors = doc_vectors
    
    def doc_search(self, query):
        pass

ragDB = DocumentVectorDB(doc_db=all_chunks, doc_vectors=all_embeded_docs)


Try to query your ragDB with various questions about Northwestern. Does it work? Please show some example queries below. 

In [None]:
the_query = "Tell me about Northwestern budget challenges in 2023."
doc = ragDB.doc_search(the_query)

Now you try with your own questions!

Note that, to use RAG with an LLM, you would use RAG to query documents. And then give the documents to the LLM as part of the input. 