<a href="https://colab.research.google.com/github/quanticedu/sample-rag-app/blob/lesson-4-complete/SUNLight_Lesson_4_Complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Lesson 3: Vector Database


##Getting Started
If you're new to Google Colab, download and review the [Getting Started with Colab](https://uploads.smart.ly/assets/49f329a834468c6f6e9010cbf337a2753b22d35c245e49fc00d4b89e4ceb10fa/original/49f329a834468c6f6e9010cbf337a2753b22d35c245e49fc00d4b89e4ceb10fa.pdf) guide.

Your code and data will run in the `/content` directory. Create a subdirectory in `/content` called `context_data` and upload the [context documents for the course](https://uploads.smart.ly/assets/b10a588ae693ff74daaf04058ce6254b05efd193f289f0a1cc01f9c934ee3d13/original/b10a588ae693ff74daaf04058ce6254b05efd193f289f0a1cc01f9c934ee3d13.zip) into `context_data`.

You'll also need an API key from Hugging Face. Visit their [signup page](https://huggingface.co/join), enter your email and a password, then complete your profile. Once you have an account and are signed in, go to [Settings | Access Tokens](https://huggingface.co/settings/tokens) and select "New token." Write tokens allow you to post to Hugging Face, which you won't be doing here, so you only need a read-type token.

Once you have your token, enter it below and run the code in the cell by clicking the play button on its left. Note that all commands at the shell prompt, such as `pip` below, should be preceded with a bang `!`.

In [None]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = "your-token-here"

LangChain touches all aspects of this app, so let's go ahead and install it now.

In [None]:
!pip install langchain==0.1.13 langchain-community==0.0.29 langchain-core==0.1.36

##Loading Context Documents
The first step in building the vector database is to load the context documents. Load them into a variable named `context_data`.

In [None]:
!pip install pypdf==4.1.0
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("./context_data")
context_data = loader.load()

Now let's verify that the documents loaded by printing the content of each page. Scroll to the end of a line to see what metadata the document loader includes.

In [None]:
for page in context_data:
  print(page)

##Chunking
Now it's time to split the documents into chunks that will work with the LLM's context window. Store them in a variable named `chunks`.

In [None]:
!pip install langchain-text-splitters==0.0.1
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False
)
chunks = text_splitter.split_documents(context_data)

Verify it worked by exploring how the documents were chunked.

In [None]:
print(f"Total Document Chunks: {len(chunks)}\n")
print(chunks[0].metadata)
print(chunks[0].page_content)

print("Length of each chunk:")

for num, chunk in enumerate(chunks):
  print(f"Chunk {num} (from page {chunk.metadata['page'] + 1}): {len(chunk.page_content)} characters")

##Embedding

Now it's time to set up the embedding function. Assign it to a variable named `embedding_function`.

In [None]:
!pip install sentence_transformers==2.6.1
from langchain_community.embeddings import HuggingFaceEmbeddings
embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Make sure your model works by finding the embedding for a test sentence.

In [None]:
embedding = embedding_function.embed_query("This is a test sentence.")
print(f"Embedding length: {len(embedding)}")
print(f"{embedding[:3]}, ... , {embedding[-3:]}")

##Persisting

Now it's time for the vector store. Assign it the name `chromadb`.

In [None]:
!pip install chromadb==0.4.24
from langchain_community.vectorstores import Chroma
chromadb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_function,
    persist_directory='./chromadb'
)
chromadb.persist()

Now test it by executing a similarity search.

In [None]:
retrieved_chunks = chromadb.similarity_search("Two people who take a vacation together.")
print(f"Query retrieved {len(retrieved_chunks)} chunks.")
for chunk in retrieved_chunks:
  print(f"Chunk content: {chunk.page_content}")
  print(f"Chunk metadata: {chunk.metadata}")

#Lesson 4: LangChain and Language Models

##Using the LangChain Model I/O Module
Start by installing the packages we'll need.

In [None]:
!pip install huggingface_hub==0.20.3 transformers==4.38.2

###Getting the LLM
Now we want to get the LLM.

In [None]:
from langchain_community.llms import HuggingFaceHub

llm = HuggingFaceHub(
    repo_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)

Let's invoke the LLM with a prompt it should be able to handle.

In [None]:
response = llm.invoke("List Tawfiq al-Hakim's plays by title as a comma-separated list.")
print(response)

###Setting up a Prompt Template
We'll now build a simple prompt template to make our interface with the LLM a bit more generic.

In [None]:
from langchain.prompts import PromptTemplate
prompt = PromptTemplate.from_template("List {playwright}'s plays by title as a comma-separated list.")

Let's test it out!

In [None]:
print(prompt)
response = llm.invoke(prompt.format(playwright="Jez Butterworth"))
print(response)

###Output Parsers
While we're exploring the Model I/O module let's take a quick look at how the output parser in the Quickstart works.

In [None]:
from langchain.output_parsers import CommaSeparatedListOutputParser
output_parser = CommaSeparatedListOutputParser()
response = output_parser.parse(llm.invoke(prompt.format(playwright="Jez Butterworth")))
print(response)

## LangChain Expression Language (LCEL)
The "Chain" in "LangChain" refers to the ability to chain several actions into one invocation. This replaces your nested calls to `output_parser()`, `llm.invoke()`, and `prompt.format()`. Try to build a chain for what you have here.

In [None]:
chain = prompt | llm | output_parser
response = chain.invoke({"playwright" : "Tawfiq al-Hakim"})
print(response)

#Lesson 5: RAG Using LangChain

##Build a Prompt Template
We'll start with a prompt template that combines the context and original question and provides instructions to the model on how to use both.

To get the context, we'll use a *retriever*. It takes a string as the input query and returns a `list` of `Document` objects.

Run it to see what it outputs.

In [None]:
retriever.get_relevant_documents("List Jez Butterworth's plays.")

The final form we're going for is `chain.invoke(user_question)`. We'll need the `user_question` for two things in this prompt: the question itself and finding the context from the vector database. Doing multiple things to one input is the job of a `RunnableParallel`. Let's create one that does that.

Let's see what that looks like.

In [None]:
context_and_question.invoke("List Jez Butterworth's plays.")

To use the context docs in a prompt, we're going to need to convert them to a string. We'll use a `RunnablePassthrough` to assign that string to the `context` key the prompt needs. Note that the `question` attribute from `context_docs_and_question` gets passed through.

In [None]:
def convert_context_docs(to_convert):
    # Take the page_content attribute of each Document object
    # and join them into one string, separated by two newlines.
    return "\n\n".join(doc.page_content for doc in to_convert["context_docs"])



Let's see how all this works with our prompt.

In [None]:
complete_prompt_chain = context_and_question | convert_context | prompt
complete_prompt_chain.invoke("List Jez Butterworth's plays.")

Now we'll build the final chain for our app.

And run it to see what results we get.

In [None]:
result = chain.invoke("List Jez Butterworth's plays.")
print(result)

Now we'll build a chain that passes the source citations, which were in the metadata field of the `list` of `Document` objects returned from the retriever. We'll use `RunnableParallel` to pass the `list` to the end of the chain while also passing it to a chain that builds the prompt and invokes the model.

Now run it to see what we got.

In [None]:
result = chain_with_sources.invoke("List Jez Butterworth's plays.")
print("The docs used in this answer:")
print("\n".join(doc.metadata.__repr__() for doc in result["context_docs"]))
print("\nThe answer:")
print(result["answer"])