<a href="https://colab.research.google.com/github/quanticedu/sample-rag-app/blob/main/SUNLight_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Lesson 3: Vector Database


##Getting Started
If you're new to Google Colab, download and review the [Getting Started with Colab](https://uploads.smart.ly/assets/49f329a834468c6f6e9010cbf337a2753b22d35c245e49fc00d4b89e4ceb10fa/original/49f329a834468c6f6e9010cbf337a2753b22d35c245e49fc00d4b89e4ceb10fa.pdf) guide.

Your code and data will run in the `/content` directory. Create a subdirectory in `/content` called `context_data` and upload the [context documents for the course](https://uploads.smart.ly/assets/b10a588ae693ff74daaf04058ce6254b05efd193f289f0a1cc01f9c934ee3d13/original/b10a588ae693ff74daaf04058ce6254b05efd193f289f0a1cc01f9c934ee3d13.zip) into `context_data`.

You'll also need an API key from Hugging Face. Visit their [signup page](https://huggingface.co/join), enter your email and a password, then complete your profile. Once you have an account and are signed in, go to [Settings | Access Tokens](https://huggingface.co/settings/tokens) and select "New token." Write tokens allow you to post to Hugging Face, which you won't be doing here, so you only need a read-type token.

Once you have your token, enter it below and run the code in the cell by clicking the play button on its left.

In [None]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = "<your token>"

LangChain touches all aspects of this app, so let's go ahead and install it now.

In [None]:
!pip install package-name

##Loading Context Documents
The first step in building the vector database is to load the context documents. Load them into a variable named `context_data`.

Now let's verify that the documents loaded by printing the content of each page. Scroll to the end of a line to see what metadata the document loader includes.

In [None]:
for page in context_data:
  print(page)

##Chunking
Now it's time to split the documents into chunks that will work with the LLM's context window. Store them in a variable named `chunks`.

Verify it worked by exploring how the documents were chunked.

In [None]:
print(f"Total Document Chunks: {len(chunks)}\n")
print(chunks[0].metadata)
print(chunks[0].page_content)

print("Length of each chunk:")

for num, chunk in enumerate(chunks):
  print(f"Chunk {num} (from page {chunk.metadata['page'] + 1}): {len(chunk.page_content)} characters")

##Embedding

Now it's time to set up the embedding function. Assign it to a variable named `embedding_function`.

Make sure your model works by finding the embedding for a test sentence.

In [None]:
embedding = embedding_function.embed_query("This is a test sentence.")
print(f"Embedding length: {len(embedding)}")
print(f"{embedding[:3]}, ... , {embedding[-3:]}")

##Persisting

Now it's time for the vector store. Assign it the name `chromadb`.

Now test it by executing a similarity search.

In [None]:
retrieved_chunks = chromadb.similarity_search("Two people who take a vacation together.")
print(f"Query retrieved {len(retrieved_chunks)} chunks.")
for chunk in retrieved_chunks:
  print(f"Chunk content: {chunk.page_content}")
  print(f"Chunk metadata: {chunk.metadata}")

#Lesson 4: LangChain and Language Models

##Using the LangChain Model I/O Module
Start by installing the packages we'll need.

In [None]:
!pip install <packages>

###Getting the LLM
Now we want to get the LLM.

Let's invoke the LLM with a prompt it should be able to handle.

In [None]:
response = llm.invoke("List Tawfiq al-Hakim's plays by title as a comma-separated list.")
print(response)

###Setting up a Prompt Template
We'll now build a simple prompt template to make our interface with the LLM a bit more generic.

Let's test it out!

In [None]:
print(prompt)
response = llm.invoke(prompt.format(playwright="Jez Butterworth"))
print(response)

###Output Parsers
While we're exploring the Model I/O module let's take a quick look at how the output parser in the Quickstart works.

In [None]:
from langchain.output_parsers import CommaSeparatedListOutputParser
output_parser = CommaSeparatedListOutputParser()
response = output_parser.parse(llm.invoke(prompt.format(playwright="Jez Butterworth")))
print(response)

## LangChain Expression Language (LCEL)
The "Chain" in "LangChain" refers to the ability to chain several actions into one invocation. This replaces your nested calls to `output_parser()`, `llm.invoke()`, and `prompt.format()`. Try to build a chain for what you have here.