# Raw Rag - Level 1: Basic

In this notebook, we will go through the following simple steps:

1. Read the input data
2. Create embeddings
3. Store the embeddings in database
4. Retrieve the embeddings based on query
5. Generate the output based on the query and the retrieved context 



In [15]:
# Only necessary libraries are imported, such as openai and faiss. This aims to reduce the library dependencies and make the notebook more lightweight.
%pip install openai faiss-cpu numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Text split function for splitting long paragraphs in order to fit the embedding models
import re


def text_splitter(
    text: str, char_limit: int = 500, eos_characters: str = ".\n"
) -> list:
    """
    Splits a given text into paragraphs based on a character limit and end-of-sentence (EOS) characters.
    If a sentence exceeds the character limit, it's split without adding the EOS character.

    Args:
      text (str): The input text to be split into paragraphs.
      char_limit (int, optional): The maximum character limit for each paragraph. Defaults to 500.
      eos_characters (str, optional): The characters that mark the end of a sentence. Defaults to ".\n".

    Returns:
      list[str]: A list of paragraphs, where each paragraph is a string.
    """

    paragraphs = []
    current_paragraph = ""

    # Use regex to split the text, keeping the EOS characters
    pattern = f"([{re.escape(eos_characters)}])"
    sentences = re.split(pattern, text)
    # Group the sentences with their EOS characters
    sentences = ["".join(sentences[i : i + 2]) for i in range(0, len(sentences), 2)]

    def split_long_sentence(sentence, limit):
        words = sentence.split()
        chunks = []
        current_chunk = ""
        for word in words:
            if len(current_chunk) + len(word) + 1 <= limit:
                current_chunk += " " + word if current_chunk else word
            else:
                if current_chunk:
                    chunks.append(current_chunk.strip())
                current_chunk = word
        if current_chunk:
            chunks.append(current_chunk.strip())
        return chunks

    for sentence in sentences:
        sentence = sentence.strip()
        if sentence:  # Only process non-empty sentences
            eos_char = sentence[-1] if sentence[-1] in eos_characters else ""
            sentence_without_eos = sentence[:-1] if eos_char else sentence

            if len(sentence_without_eos) > char_limit:
                # Split the long sentence
                sentence_chunks = split_long_sentence(sentence_without_eos, char_limit)
                for chunk in sentence_chunks[:-1]:  # All chunks except the last one
                    if current_paragraph:
                        paragraphs.append(current_paragraph.strip())
                    current_paragraph = chunk
                # Handle the last chunk
                last_chunk = sentence_chunks[-1]
                if (
                    len(current_paragraph) + len(last_chunk) + len(eos_char)
                    <= char_limit
                ):
                    current_paragraph += (
                        " " + last_chunk + eos_char
                        if current_paragraph
                        else last_chunk + eos_char
                    )
                else:
                    if current_paragraph:
                        paragraphs.append(current_paragraph.strip())
                    current_paragraph = last_chunk + eos_char
            else:
                if len(current_paragraph) + len(sentence) <= char_limit:
                    current_paragraph += (
                        " " + sentence if current_paragraph else sentence
                    )
                else:
                    if current_paragraph:
                        paragraphs.append(current_paragraph.strip())
                    current_paragraph = sentence

    if current_paragraph:
        paragraphs.append(current_paragraph.strip())

    return paragraphs

In [7]:
# Load the text from a file. Original text: "The Lottery" by Shirley Jackson https://www.newyorker.com/magazine/1948/06/26/the-lottery 

file_path = 'docs/the_lottery_text.txt'

with open(file_path, 'r') as file:
  text = file.read()

print(text[:1000])

“The Lottery [abridged]” (1948)--- By Shirley Jackson

The people of the village began to gather in the square, between the post office and the bank, around ten o'clock; in some towns there were so many people that the lottery took two days and had to be started on June 2nd.  But in this village, where there were only about three hundred people, the whole lottery took less than two hours, so it could begin at ten o'clock in the morning and still be through in time to allow the villagers to get home for noon dinner. 
The children assembled first, of course. Bobby Martin had already stuffed his pockets full of stones, and the other boys soon followed his example, selecting the smoothest and roundest stones; Bobby and Harry Jones and Dickie Delacroix-- the villagers pronounced this name "Dellacroy"--eventually made a great pile of stones in one corner of the square and guarded it against the raids of the other boys. 
The lottery was conducted--as were the square dances, the teen club, the

In [9]:
paragraphs = text_splitter(text, char_limit=500)

print("Total number of paragraphs:", len(paragraphs))

print("\nFirst 5 paragraphs:")
for paragraph in paragraphs[:5]:
  print(paragraph)
  print("\n---\n")

Total number of paragraphs: 23

First 5 paragraphs:
“The Lottery [abridged]” (1948)--- By Shirley Jackson The people of the village began to gather in the square, between the post office and the bank, around ten o'clock; in some towns there were so many people that the lottery took two days and had to be started on June 2nd.

---

But in this village, where there were only about three hundred people, the whole lottery took less than two hours, so it could begin at ten o'clock in the morning and still be through in time to allow the villagers to get home for noon dinner. The children assembled first, of course.

---

Bobby Martin had already stuffed his pockets full of stones, and the other boys soon followed his example, selecting the smoothest and roundest stones; Bobby and Harry Jones and Dickie Delacroix-- the villagers pronounced this name "Dellacroy"--eventually made a great pile of stones in one corner of the square and guarded it against the raids of the other boys. The lottery 

In [14]:
import os

# set the OpenAI API key

os.environ['OPENAI_API_KEY'] = 'sk-***'

In [21]:
from openai import OpenAI
import numpy as np
import faiss 

client = OpenAI()

In [25]:
def get_embedding(text: str, model: str = "text-embedding-3-small"):
    """
    Get embedding for a given text using OpenAI's API.

    Parameters:
    text (str): The input text for which the embedding needs to be generated.
    model (str): The name of the model to use for generating the embedding. Default is "text-embedding-3-small", the cheapest latest model.

    Returns:
    list: The embedding vector for the input text.
    """
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

In [26]:
embeddings = [get_embedding(text) for text in paragraphs]

In [27]:
# Initialize FAISS index
dimension = len(embeddings[0])  # Dimension of the embedding
index = faiss.IndexFlatL2(dimension)

In [28]:
# Add embeddings to the FAISS index
index.add(np.array(embeddings).astype("float32"))

In [29]:
# Perform a search
query = "When was it written?"
query_embedding = get_embedding(query)

k = 2  # Number of nearest neighbors to retrieve
distances, indices = index.search(np.array([query_embedding]).astype("float32"), k)

In [32]:
# Retrieve and print results
print(f"Query: {query}")
print("Nearest neighbors:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}. {paragraphs[idx]} (Distance: {distances[0][i]})")
    
# Combine the paragraphs into a single context
context = " ".join(paragraphs)

Query: When was it written?
Nearest neighbors:
1. “The Lottery [abridged]” (1948)--- By Shirley Jackson The people of the village began to gather in the square, between the post office and the bank, around ten o'clock; in some towns there were so many people that the lottery took two days and had to be started on June 2nd. (Distance: 1.5568045377731323)
2. Summers said, and Bill Hutchinson reached into the box and felt around, bringing his hand out at last with the slip of paper in it. The crowd was quiet. A girl whispered, "I hope it's not Nancy," and the sound of the whisper reached the edges of the crowd. "It's not the way it used to be. " Old Man Warner said clearly. "People ain't the way they used to be. " "All right," Mr. Summers said. "Open the papers. Harry, you open little Dave's. " Mr. (Distance: 1.602304458618164)


In [34]:
# Send the retrieved paragraphs along with the query to the OpenAI API to generate a complete answer

full_query = f"""Use the below context to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{context}
\"\"\"

Question: {query}"""

response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You answer questions for the user.",
        },
        {"role": "user", "content": full_query},
    ],
    model="gpt-4-turbo",
    temperature=0,
)

print(response.choices[0].message.content)

1948
