In [1]:
!pip install chromadb
!pip install langchain
!pip install openai
!pip install requests

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.14.2-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.30.0-py3



In [2]:
import chromadb

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [4]:
from openai import OpenAI

book_metadata =  {
        "title": "The Great Gatsby",
        "author" : "F. Scott Fitzgerald",
        "source_url": "https://www.gutenberg.org/cache/epub/64317/pg64317.txt",
        "filename": "pg64317.txt"
  }

#### Checkpoint 1/3

First, initialize the OpenAI client assigned to `client_openai` by calling `OpenAI()`.

Then, call the client's `.chat.completions.create()` method and pass it the following values:

- for `model`, pass `"gpt-4"`
- for `messages`, pass a list containing two dicts:
    - one dict with the `"role"` key `"system"` and the `"content"` key `"You are a helpful assistant connected to a database for document search."`
    - one dict with the `"role"` key `"user"` and the `"content"` key `prompt`

This `get_completion` helper function will make it easy for us to prompt the model throughout the lesson.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [5]:
## YOUR SOLUTION HERE ##
client_openai = OpenAI(config.RAG_API_KEY)

def get_completion(prompt):
    ## YOUR SOLUTION HERE ##
    response = client_openai.chat.completions.create(
        model= "gpt-4",
        messages=[
            {"role":"system", "content": "You're a helpful assistant who retrieves information from external sources and presents them to the user."},
            {"role": "user", "content": prompt},
        ]
    )
    return response.choices[0].message.content


#### Chunking the Document

In the next cell, we'll initialize a recursive text splitter and create chunks of our text with it.

Be sure to execute this cell before moving onto the next checkpoint.

In [6]:
import os
import requests

In [7]:
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
great_gatsby = r.text

In [8]:
great_gatsby = great_gatsby[1433:277912]
print(great_gatsby[:500])

as Parke d’Invilliers


                                  I

In my younger and more vulnerable years my father gave me some advice
that I’ve been turning over in my mind ever since.

“Whenever you feel like criticizing anyone,” he told me, “just
remember that all the people in this world haven’t had the advantages
that you’ve had.”

He didn’t say any more, but we’ve always been unusually communicative
in a reserved way, and I understood that he meant a great deal more
than that. In


In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    separators=[". ", "? ", "! "],
    chunk_size=2000,
    chunk_overlap=300,
)
# with open("themind.txt", "r") as file:
#     content = file.read()
# chunks = text_splitter.create_documents([content])

chunks_gatsby = text_splitter.create_documents([great_gatsby])



In [10]:
print(f" 'The Great Gatsby' - First Chunk:\n{chunks_gatsby[0].page_content}\n")

 'The Great Gatsby' - First Chunk:
as Parke d’Invilliers


                                  I

In my younger and more vulnerable years my father gave me some advice
that I’ve been turning over in my mind ever since.

“Whenever you feel like criticizing anyone,” he told me, “just
remember that all the people in this world haven’t had the advantages
that you’ve had.”

He didn’t say any more, but we’ve always been unusually communicative
in a reserved way, and I understood that he meant a great deal more
than that. In consequence, I’m inclined to reserve all judgements, a
habit that has opened up many curious natures to me and also made me
the victim of not a few veteran bores. The abnormal mind is quick to
detect and attach itself to this quality when it appears in a normal
person, and so it came about that in college I was unjustly accused of
being a politician, because I was privy to the secret griefs of wild,
unknown men. Most of the confidences were unsought—freq

#### Checkpoint 2/3
Now we will create a new persistent Chroma collection.

Use Chroma's `.PersistentClient()` method to initialize a database that will persist throughout the lesson. Pass the method the desired route to our collection, `"./advanced"`.

Then use the Chroma client's `.get_or_create_collection()` method. Pass the method the `name` `"advanced"` and indicate we'll use cosine similarity with `metadata={"hnsw:space": "cosine"}`.

Fill in the missing sections of the cell.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [11]:
## YOUR SOLUTION HERE ##
client_chroma = chromadb.PersistentClient(path="./advanced")
collection = client_chroma.get_or_create_collection(name = "advanced", metadata = {"hnsw:space": "cosine"})

print(f"ChromaDB collection {client_chroma.list_collections()}")

ChromaDB collection ['advanced']


#### Uploading the chunks

Now that our collection is initialized we can upload to it the chunks we made earlier.

We enumerate through the list of chunks, accessing the chunk's text in `chunk.page_content`, then add a chunk index to its metadata and upload the document, its id, and its metadata to our Chroma collection.

Be sure to execute this cell before moving on to the next one.

In [12]:
N = 150
for idx, chunk in enumerate(chunks_gatsby[:N]):
    doc_text = chunk.page_content
    book_metadata["chunk_idx"] = idx
    collection.add(
        documents=[doc_text],
        ids=[f"{book_metadata['title']}_{idx}"],
        metadatas=[book_metadata]
    )

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 29.6MiB/s]


#### Formatting the search results
Now we'll define a helper function that takes a user query and returns a well-formatted, pseudo-XML string of the search results. This will make it easier to experiment throughout the lesson.

Don't forget to execute this cell before moving on.

In [13]:
def populate_rag_query(query, n_results=1):
    search_results = collection.query(query_texts=[query], n_results=n_results)
    result_str = ""
    for idx, result in enumerate(search_results["documents"][0]):
        metadata = search_results["metadatas"][0][idx]
        formatted_result = f"""<SEARCH RESULT>
        <DOCUMENT>{result}</DOCUMENT>
        <METADATA>
        <TITLE>{metadata['title']}</TITLE>
        <AUTHOR>{metadata['author']}</AUTHOR>
        <CHUNK_IDX>{metadata['chunk_idx']}</CHUNK_IDX>
        <URL>{metadata['source_url']}</URL>
        </METADATA>
        </SEARCH RESULT>"""
        result_str += formatted_result
    return result_str

#### Checkpoint 3/3

Finally, create the RAG prompt to send to the LLM.

Write out the instructions to the model in your own words in the `<INSTRUCTIONS>` section.

Consider how you might guide the model to:
 - Use the search results effectively
 - Handle cases where information isn't available
 - Provide credibility to its answers by citing sources

Note we've included a  `<EXAMPLE CITATION>` that can show the model how its cited sources should look.

To wrap it up, pass the correct variables in the `<USER QUERY>` and `<SEARCH RESULTS>` sections to finish the function.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.


In [14]:
def make_rag_prompt(query, results):
    return f"""<INSTRUCTIONS>
   <EXAMPLE CITATION>
   Answer to the user query in your own words, drawn from the search results.
   - "Direct quote from source material backing up the claim" - [Source: Title, Author, Chunk: chunk index, Link: url]
   </EXAMPLE CITATION>
   </INSTRUCTIONS>

    <USER QUERY>
    {query}
    </USER QUERY>

    <SEARCH RESULTS>
    {results}
    </SEARCH RESULTS>

    Your answer:"""

In [15]:
#get method is used to retrieve info from a collection

collection.get('The Great Gatsby_10')

{'ids': ['The Great Gatsby_10'],
 'embeddings': None,
 'documents': ['. Her grey sun-strained eyes looked\r\nback at me with polite reciprocal curiosity out of a wan, charming,\r\ndiscontented face. It occurred to me now that I had seen her, or a\r\npicture of her, somewhere before.\r\n\r\n“You live in West Egg,” she remarked contemptuously. “I know somebody\r\nthere.”\r\n\r\n“I don’t know a single—”\r\n\r\n“You must know Gatsby.”\r\n\r\n“Gatsby?” demanded Daisy. “What Gatsby?”\r\n\r\nBefore I could reply that he was my neighbour dinner was announced;\r\nwedging his tense arm imperatively under mine, Tom Buchanan compelled\r\nme from the room as though he were moving a checker to another square.\r\n\r\nSlenderly, languidly, their hands set lightly on their hips, the two\r\nyoung women preceded us out on to a rosy-coloured porch, open toward\r\nthe sunset, where four candles flickered on the table in the\r\ndiminished wind.\r\n\r\n“Why candles?” objected Daisy, frowning. She snapped the

In [16]:
collection.get(where={"chunk_idx": { "$eq":151}})
#with the unexistent id and it will return everything empty

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [17]:
collection.get(where={"chunk_idx": { "$eq":10}})


{'ids': ['The Great Gatsby_10'],
 'embeddings': None,
 'documents': ['. Her grey sun-strained eyes looked\r\nback at me with polite reciprocal curiosity out of a wan, charming,\r\ndiscontented face. It occurred to me now that I had seen her, or a\r\npicture of her, somewhere before.\r\n\r\n“You live in West Egg,” she remarked contemptuously. “I know somebody\r\nthere.”\r\n\r\n“I don’t know a single—”\r\n\r\n“You must know Gatsby.”\r\n\r\n“Gatsby?” demanded Daisy. “What Gatsby?”\r\n\r\nBefore I could reply that he was my neighbour dinner was announced;\r\nwedging his tense arm imperatively under mine, Tom Buchanan compelled\r\nme from the room as though he were moving a checker to another square.\r\n\r\nSlenderly, languidly, their hands set lightly on their hips, the two\r\nyoung women preceded us out on to a rosy-coloured porch, open toward\r\nthe sunset, where four candles flickered on the table in the\r\ndiminished wind.\r\n\r\n“Why candles?” objected Daisy, frowning. She snapped the

In [21]:
def get_next_and_previous_chunks(chunk_idx):
  previous_chunk = collection.get(where = {"chunk_idx": {"$eq": chunk_idx - 1}})
  next_chunk =  collection.get(where = {"chunk_idx": {"$eq": chunk_idx + 1}})
 #passing chunk_idx to the where argument

  return previous_chunk, next_chunk

In [24]:
def expanded_search_results(original_chunk):
    original_chunk_idx = original_chunk["metadatas"][0]["chunk_idx"]
    previous_chunk, next_chunk = get_next_and_previous_chunks(original_chunk_idx)
    result_str = ""
    for chunk in [previous_chunk, original_chunk, next_chunk]:
        if len(chunk["metadatas"])>0:
            metadata = chunk["metadatas"][0]
            formatted_result = f"""<SEARCH RESULT>
            <DOCUMENT>{chunk["documents"][0]}</DOCUMENT>
            <METADATA>
            <TITLE>{metadata["title"]}</TITLE>
            <AUTHOR>{metadata["author"]}</AUTHOR>
            <CHUNK_IDX>{metadata["chunk_idx"]}</CHUNK_IDX>
            <URL>{metadata["source_url"]}</URL>
            </METADATA>
            </SEARCH RESULT>"""
            result_str += formatted_result
    return result_str

original_demo_chunk = collection.get(where={"chunk_idx": {"$eq": 10}})
expanded_results = expanded_search_results(original_demo_chunk)
print(expanded_results)

<SEARCH RESULT>
            <DOCUMENT>. Tomorrow!” Then she added
irrelevantly: “You ought to see the baby.”

“I’d like to.”

“She’s asleep. She’s three years old. Haven’t you ever seen her?”

“Never.”

“Well, you ought to see her. She’s—”

Tom Buchanan, who had been hovering restlessly about the room, stopped
and rested his hand on my shoulder.

“What you doing, Nick?”

“I’m a bond man.”

“Who with?”

I told him.

“Never heard of them,” he remarked decisively.

This annoyed me.

“You will,” I answered shortly. “You will if you stay in the East.”

“Oh, I’ll stay in the East, don’t you worry,” he said, glancing at
Daisy and then back at me, as if he were alert for something
more. “I’d be a God damned fool to live anywhere else.”

At this point Miss Baker said: “Absolutely!” with such suddenness that
I started—it was the first word she had uttered since I came into the
room. Evidently it surprised her as much as it did me, for she yawned
and with a seri

In [26]:
def make_decoupled_rag_prompt(query, n_results=1):

    search_results = collection.query(query_texts=[query], n_results=n_results)
    total_result_str = ""
    for doc, metadata in zip(search_results['documents'][0], search_results['metadatas'][0]):
        chunk = {
            'documents': [doc],
            'metadatas': [metadata]
        }

        expanded_result = expanded_search_results(chunk)
        total_result_str += expanded_result
        rag_prompt = make_rag_prompt(query, total_result_str)
    return rag_prompt

# given a search result, we can get everything back in a formatted string
prompt = make_decoupled_rag_prompt("Describe The Buchanans' House")
rag_completion = get_completion(prompt)
print(rag_completion)

The text does not provide a direct description of the Buchanans' house. The narrative mentions a veranda, a dining room, and a driveway filled with hot gravel. The house is located across the bay from Gatsby's place and has a garage that Tom has converted into a stable. They also have an upper window from where Daisy calls out. However, specific details about the house's structure or appearance aren't provided in the excerpts. Please note these descriptions come from the novel "The Great Gatsby" by F. Scott Fitzgerald. - [Source: The Great Gatsby, F. Scott Fitzgerald, Chunk: 104-106, Link: https://www.gutenberg.org/cache/epub/64317/pg64317.txt]
