<a href="https://colab.research.google.com/github/stat-junda/Stat-359-Modern-Deep-Learning/blob/main/%E2%80%9CAssignment_2_Q2_solution%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this question, we will create a simple RAG database and use it to query a collection of documents.

First, we will install Sentence Transformers, which we will use to compute the document embeddings.

In [2]:
!pip install sentence_transformers
import os
import re
import numpy as np


Collecting sentence_transformers
  Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/132.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-2.3.1


Next, write in code to load the rag docs into memory. Please store all the documents in a list.

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [14]:
# Define the folder path
folder_path = "/content/drive/MyDrive/Colab Notebooks/rag_docs"

# Initialize the list to store the file contents
docs_as_strings = []

# Loop over the .txt files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):  # Ensure we are reading only .txt files
        file_path = os.path.join(folder_path, filename)  # Get the full path of the file
        with open(file_path, 'r', encoding='utf-8') as file:  # Open the file
            content = file.read()  # Read the file content
            docs_as_strings.append(content)  # Add the content to the list

# Print the list to verify the results
print(F"Length: {len(docs_as_strings)}")


Length: 6


# 新段落

Next, split the documents into chunks. The chunks should each be K sentences of the original document. The function should take a document and return a list of chunks. For example, if the document is 12 sentences long and sentences_per_chunk is 4 then we should return 3 chunks of 4 sentences each. You can assume all sentences end in '.' or '?' or '!'

In [16]:
def split_into_chunks(text, sentences_per_chunk=25):
    # Use regular expressions to split the text into sentences.
    # This pattern matches strings that end with '.', '?' or '!', and accounts for potential whitespace after these characters.
    sentences = re.split(r'(?<=[.?!])\s+', text.strip())

    # Initialize an empty list to hold the chunks
    chunks = []

    # Loop through the sentences and group them into chunks of `sentences_per_chunk` sentences each.
    for i in range(0, len(sentences), sentences_per_chunk):
        # Join the sentences to form a chunk and add it to the list of chunks.
        chunk = ' '.join(sentences[i:i+sentences_per_chunk])
        chunks.append(chunk)

    return chunks


Convert all of the documents into chunks. Get a giant list of chunks.

In [17]:
all_chunks = []
for one_doc in docs_as_strings:
    this_doc_chunks = split_into_chunks(one_doc)
    all_chunks.extend(this_doc_chunks)

print(F"Length of chunks: {len(all_chunks)}")
for c in all_chunks:
    print(F"Len of chunk: {len(c)}")


Length of chunks: 15
Len of chunk: 2854
Len of chunk: 768
Len of chunk: 2105
Len of chunk: 6196
Len of chunk: 4937
Len of chunk: 856
Len of chunk: 5384
Len of chunk: 6944
Len of chunk: 6833
Len of chunk: 7509
Len of chunk: 4932
Len of chunk: 4165
Len of chunk: 6197
Len of chunk: 3825
Len of chunk: 3793


Next, we use sentence transformers to actually do the embeddings.

In [18]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ['This framework generates embeddings for each input sentence',
            'Sentences are passed as a list of string.',
            'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding dim:", len(embedding))
    print("")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence: This framework generates embeddings for each input sentence
Embedding dim: 384

Sentence: Sentences are passed as a list of string.
Embedding dim: 384

Sentence: The quick brown fox jumps over the lazy dog.
Embedding dim: 384



This function should take a sentence or a chunk (any string really) as input and return the sentence transformers embedding.

In [50]:
def compute_embeddings_st(input_sentences):
    embeddings = model.encode(input_sentences)
    embeddings_array = np.array(embeddings)
    return embeddings_array

In [51]:
all_embeded_docs = compute_embeddings_st(all_chunks)

print(all_embeded_docs.shape)

(15, 384)


Define a class to act as a document vector db.

This class takes as input doc_db, which is the list of original chunks. It also takes doc_vectors, which is the computed embeddings from sentence_transformers.

It implements one function, doc_search.
doc_search takes some user query and returns the closest document.

You will need to first embed the query with compute_embeddings_st
You will then need to compare that embedding to every element in doc_vectors.

Finally, return the document whose distance vector from the query has the smallest L2 norm, that is, it is the closest to the query.

In [52]:
class DocumentVectorDB:
    def __init__(self, doc_db, doc_vectors):
        self.doc_db = doc_db
        self.doc_vectors = doc_vectors

    def doc_search(self, query):
        query_embedding = compute_embeddings_st([query])[0]
        min_distance = np.inf
        closest_doc = None
        for doc, embedding in zip(self.doc_db, self.doc_vectors):
            distance = np.linalg.norm(query_embedding - embedding)
            if distance < min_distance:
                min_distance = distance
                closest_doc = doc

        return closest_doc

ragDB = DocumentVectorDB(doc_db=all_chunks, doc_vectors=all_embeded_docs)


Try to query your ragDB with various questions about Northwestern. Does it work? Please show some example queries below.

In [57]:
the_query = "Tell me about Northwestern budget challenges in 2023."
doc = ragDB.doc_search(the_query)
print("The closest document to the query is:")
print(doc)

The closest document to the query is:
Provost Kathleen Hagerty and Senior Associate Vice President for Finance Mandy Distel presented the Northwestern University’s 2022 Financial Report to the Northwestern University Faculty Senate at Wednesday’s meeting. Northwestern Northwestern University's net assets decreased from more than $16 billion last fiscal year to $15.4 billion, according to Executive Vice President Craig Johnson’s letter in the report. However, the Northwestern University ended the fiscal year with positive operating performance of $138.7 million across its schools. Following the report, senators discussed the Northwestern University’s employment policies, salaries and other items related to the budget. Hagerty said there are currently about 500 vacancies in staff positions, down from a peak of about 800. According to Senator and RTVF Prof. Kyle Henry, pre-COVID-19 vacancies were usually between 200 and 250 positions. He said one position in the RTVF department turned ove

Now you try with your own questions!

In [58]:
the_query = "Tell me about the landmark buildings at Northwestern University's Chicago campus."
doc = ragDB.doc_search(the_query)
print("The closest document to the query is:")
print(doc)

The closest document to the query is:
Chicago

The Montgomery Ward Memorial Building (1927) at Northwestern's Feinberg School of Medicine in Chicago, America's first academic skyscraper[46]
Northwestern's Chicago campus is located in the city's Streeterville neighborhood near Lake Michigan. The Chicago campus is home to the nationally ranked Northwestern Memorial Hospital, the medical school, the law school, the part-time MBA program, and the School of Professional Studies. Medill's one-year graduate program rents a floor on Wacker Drive, across the river from Streeterville and separate from the rest of the campus. Northwestern's professional schools and a number of its affiliated hospitals are located approximately four blocks east of the Chicago station on the CTA Red Line. The Chicago campus is also served by CTA bus routes. Founded or affiliated at varying points in the university's history, the professional schools originally were scattered throughout Chicago.[47] In connection wi

Note that, to use RAG with an LLM, you would use RAG to query documents. And then give the documents to the LLM as part of the input.