In [1]:
!pip install chromadb
!pip install langchain
!pip install openai
!pip install requests

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.14.1-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.30.0-py3



In [2]:
import chromadb

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [4]:
from openai import OpenAI

book_metadata =  {
        "title": "The Great Gatsby",
        "author" : "F. Scott Fitzgerald",
        "source_url": "https://www.gutenberg.org/cache/epub/64317/pg64317.txt",
        "filename": "pg64317.txt"
  }

#### Checkpoint 1/3

First, initialize the OpenAI client assigned to `client_openai` by calling `OpenAI()`.

Then, call the client's `.chat.completions.create()` method and pass it the following values:

- for `model`, pass `"gpt-4"`
- for `messages`, pass a list containing two dicts:
    - one dict with the `"role"` key `"system"` and the `"content"` key `"You are a helpful assistant connected to a database for document search."`
    - one dict with the `"role"` key `"user"` and the `"content"` key `prompt`

This `get_completion` helper function will make it easy for us to prompt the model throughout the lesson.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [None]:
import config

In [None]:
## YOUR SOLUTION HERE ##
client_openai = OpenAI(config.RAG_API_KEY)

def get_completion(prompt):
    ## YOUR SOLUTION HERE ##
    response = client_openai.chat.completions.create(
        model= "gpt-4",
        messages=[
            {"role":"system", "content": "You're a helpful assistant who retrieves information from external sources and presents them to the user."},
            {"role": "user", "content": prompt},
        ]
    )
    return response.choices[0].message.content


#### Chunking the Document

In the next cell, we'll initialize a recursive text splitter and create chunks of our text with it.

Be sure to execute this cell before moving onto the next checkpoint.

In [7]:
import os
import requests

In [8]:
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
great_gatsby = r.text

In [9]:
great_gatsby = great_gatsby[1433:277912]
print(great_gatsby[:500])

as Parke d’Invilliers


                                  I

In my younger and more vulnerable years my father gave me some advice
that I’ve been turning over in my mind ever since.

“Whenever you feel like criticizing anyone,” he told me, “just
remember that all the people in this world haven’t had the advantages
that you’ve had.”

He didn’t say any more, but we’ve always been unusually communicative
in a reserved way, and I understood that he meant a great deal more
than that. In


In [10]:
text_splitter = RecursiveCharacterTextSplitter(
    separators=[". ", "? ", "! "],
    chunk_size=2000,
    chunk_overlap=300,
)
# with open("themind.txt", "r") as file:
#     content = file.read()
# chunks = text_splitter.create_documents([content])

chunks_gatsby = text_splitter.create_documents([great_gatsby])



In [11]:
print(f" 'The Great Gatsby' - First Chunk:\n{chunks_gatsby[0].page_content}\n")

 'The Great Gatsby' - First Chunk:
as Parke d’Invilliers


                                  I

In my younger and more vulnerable years my father gave me some advice
that I’ve been turning over in my mind ever since.

“Whenever you feel like criticizing anyone,” he told me, “just
remember that all the people in this world haven’t had the advantages
that you’ve had.”

He didn’t say any more, but we’ve always been unusually communicative
in a reserved way, and I understood that he meant a great deal more
than that. In consequence, I’m inclined to reserve all judgements, a
habit that has opened up many curious natures to me and also made me
the victim of not a few veteran bores. The abnormal mind is quick to
detect and attach itself to this quality when it appears in a normal
person, and so it came about that in college I was unjustly accused of
being a politician, because I was privy to the secret griefs of wild,
unknown men. Most of the confidences were unsought—freq

#### Checkpoint 2/3
Now we will create a new persistent Chroma collection.

Use Chroma's `.PersistentClient()` method to initialize a database that will persist throughout the lesson. Pass the method the desired route to our collection, `"./advanced"`.

Then use the Chroma client's `.get_or_create_collection()` method. Pass the method the `name` `"advanced"` and indicate we'll use cosine similarity with `metadata={"hnsw:space": "cosine"}`.

Fill in the missing sections of the cell.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [12]:
## YOUR SOLUTION HERE ##
client_chroma = chromadb.PersistentClient(path="./advanced")
collection = client_chroma.get_or_create_collection(name = "advanced", metadata = {"hnsw:space": "cosine"})

print(f"ChromaDB collection {client_chroma.list_collections()}")

ChromaDB collection ['advanced']


#### Uploading the chunks

Now that our collection is initialized we can upload to it the chunks we made earlier.

We enumerate through the list of chunks, accessing the chunk's text in `chunk.page_content`, then add a chunk index to its metadata and upload the document, its id, and its metadata to our Chroma collection.

Be sure to execute this cell before moving on to the next one.

In [14]:
N = 100
for idx, chunk in enumerate(chunks_gatsby[:N]):
    doc_text = chunk.page_content
    book_metadata["chunk_idx"] = idx
    collection.add(
        documents=[doc_text],
        ids=[f"{book_metadata['title']}_{idx}"],
        metadatas=[book_metadata]
    )



#### Formatting the search results
Now we'll define a helper function that takes a user query and returns a well-formatted, pseudo-XML string of the search results. This will make it easier to experiment throughout the lesson.

Don't forget to execute this cell before moving on.

In [15]:
def populate_rag_query(query, n_results=1):
    search_results = collection.query(query_texts=[query], n_results=n_results)
    result_str = ""
    for idx, result in enumerate(search_results["documents"][0]):
        metadata = search_results["metadatas"][0][idx]
        formatted_result = f"""<SEARCH RESULT>
        <DOCUMENT>{result}</DOCUMENT>
        <METADATA>
        <TITLE>{metadata['title']}</TITLE>
        <AUTHOR>{metadata['author']}</AUTHOR>
        <CHUNK_IDX>{metadata['chunk_idx']}</CHUNK_IDX>
        <URL>{metadata['source_url']}</URL>
        </METADATA>
        </SEARCH RESULT>"""
        result_str += formatted_result
    return result_str

#### Checkpoint 3/3

Finally, create the RAG prompt to send to the LLM.

Write out the instructions to the model in your own words in the `<INSTRUCTIONS>` section.

Consider how you might guide the model to:
 - Use the search results effectively
 - Handle cases where information isn't available
 - Provide credibility to its answers by citing sources

Note we've included a  `<EXAMPLE CITATION>` that can show the model how its cited sources should look.

To wrap it up, pass the correct variables in the `<USER QUERY>` and `<SEARCH RESULTS>` sections to finish the function.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.


In [17]:
def make_rag_prompt(query, results):
    return f"""<INSTRUCTIONS>
   <EXAMPLE CITATION>
   Answer to the user query in your own words, drawn from the search results.
   - "Direct quote from source material backing up the claim" - [Source: Title, Author, Chunk: chunk index, Link: url]
   </EXAMPLE CITATION>
   </INSTRUCTIONS>

    <USER QUERY>
    {query}
    </USER QUERY>

    <SEARCH RESULTS>
    {results}
    </SEARCH RESULTS>

    Your answer:"""