<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_06_5_embed_db.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative Artificial Intelligence
**Module 6: Retrieval-Augmented Generation (RAG)**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 6 Material

* Part 6.1: Introduction to Retrieval-Augmented Generation (RAG) [[Video]](https://www.youtube.com/watch?v=qA52K0K181Q) [[Notebook]](t81_559_class_06_1_rag.ipydb)
* Part 6.2: Introduction to ChromaDB [[Video]](https://www.youtube.com/watch?v=R53lo4sevLQ) [[Notebook]](t81_559_class_06_2_chromadb.ipynb)
* Part 6.3: Understanding Embeddings [[Video]](https://www.youtube.com/watch?v=Tq82Gl2ZZNM) [[Notebook]](t81_559_class_06_3_embeddings.ipynb)
* Part 6.4: Question Answering Over Documents [[Video]](https://www.youtube.com/watch?v=hCwL_lW-gP0) [[Notebook]](t81_559_class_06_4_qa.ipynb)
* **Part 6.5: Embedding Databases** [[Video]](https://www.youtube.com/watch?v=BG2gT4uYxhM) [[Notebook]](t81_559_class_06_5_embed_db.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [1]:
import os

try:
    from google.colab import drive, userdata
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# OpenAI Secrets
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai chromadb langchain_community sentence-transformers langchainhub

Note: using Google CoLab
Collecting langchain
  Downloading langchain-0.2.6-py3-none-any.whl (975 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_openai
  Downloading langchain_openai-0.1.10-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.6/40.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.5.3-py3-none-any.whl (559 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_community
  Downloading langchain_community-0.2.6-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━

# 6.5: Embedding Databases


In this section, we will explore the process of running ChromaDB as a server to facilitate access to RAG LLm. Initially, we'll begin by loading a locally saved embedding database, setting the foundation for our server's data handling capabilities. Once the database is successfully loaded, we will proceed to demonstrate how it can be reloaded, ensuring that our system can maintain up-to-date information and recover seamlessly from any disruptions. Finally, we will shift our focus to accessing the database through an HTTP client, which will enable users to interact with the server remotely, retrieving and utilizing embeddings as needed for efficient and scalable language model operations. Throughout this part, we aim to provide a comprehensive guide to setting up and managing ChromaDB in a way that supports robust RAG LLm access.

To illustrate the capabilities of the server, we will utilize a sample dataset I created, containing synthetic data of employee biographies from five fictional companies. This dataset is crafted to demonstrate how RAG can effectively retrieve and utilize specific information that a foundation model would not inherently possess. By doing so, it provides a clear example of how RAG can be applied in a real-world corporate setting.

Here is an example of such a generated biography:

> Elena Martinez is a seasoned Robotics Engineer at FutureTech, a leading innovator in artificial intelligence and robotics based in Silicon Valley. With a Master's degree in Mechanical Engineering from MIT and over a decade of experience, Elena has been pivotal in the development of autonomous robotic systems designed to enhance urban mobility and accessibility. Her groundbreaking work includes the creation of the first AI-powered robotic assistant that can seamlessly interact with urban environments to aid the elderly and disabled. A passionate advocate for women in STEM, Elena also leads FutureTech's outreach program, aiming to inspire the next generation of female engineers through workshops and mentorships. Her contributions have not only propelled FutureTech to new heights but have also set new standards in robotics applications for social good.


This biography showcases the type of detailed, company-specific information that RAG can retrieve and incorporate into its responses, enhancing the relevance and accuracy of the generated content. Through the integration of such tailored data, RAG not only enriches the output of LLMs but also ensures that the responses are aligned with the unique context and needs of the organization.

This sample data is stored at the following URL's, each file holds people from one of the companies.

* https://data.heatonresearch.com/data/t81-559/bios/DD.txt
* https://data.heatonresearch.com/data/t81-559/bios/FT.txt
* https://data.heatonresearch.com/data/t81-559/bios/GS.txt
* https://data.heatonresearch.com/data/t81-559/bios/NGS.txt
* https://data.heatonresearch.com/data/t81-559/bios/TI.txt

The following code loads these documents and saves them to a ChromaDB database. We have seen this code previously.

## Load the ChromaDB Database



In [2]:
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import TextLoader
from langchain import OpenAI, PromptTemplate
from langchain_openai import ChatOpenAI
from IPython.display import display_markdown
from langchain.indexes import VectorstoreIndexCreator
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.inmemory import InMemoryVectorStore
from langchain.schema import Document
import requests

urls = [
    "https://data.heatonresearch.com/data/t81-559/bios/DD.txt",
    "https://data.heatonresearch.com/data/t81-559/bios/FT.txt",
    "https://data.heatonresearch.com/data/t81-559/bios/GS.txt",
    "https://data.heatonresearch.com/data/t81-559/bios/NGS.txt",
    "https://data.heatonresearch.com/data/t81-559/bios/TI.txt"
]

def chunk_text(text, chunk_size, overlap):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

chunk_size = 900
overlap = 300

documents = []

for url in urls:
    print(f"Reading: {url}")
    response = requests.get(url)
    response.raise_for_status()  # Ensure we notice bad responses
    content = response.text
    chunks = chunk_text(content, chunk_size, overlap)
    for chunk in chunks:
        document = Document(page_content=chunk)
        documents.append(document)

from langchain_text_splitters import CharacterTextSplitter
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain.vectorstores import Chroma

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

#db = Chroma.from_documents(docs, embedding_function)

Reading: https://data.heatonresearch.com/data/t81-559/bios/DD.txt
Reading: https://data.heatonresearch.com/data/t81-559/bios/FT.txt
Reading: https://data.heatonresearch.com/data/t81-559/bios/GS.txt
Reading: https://data.heatonresearch.com/data/t81-559/bios/NGS.txt
Reading: https://data.heatonresearch.com/data/t81-559/bios/TI.txt


  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Finally, we will load a ChromaDB database from these documents; however, we will now specify a directory to store this data.

In [3]:
db = Chroma.from_documents(docs, embedding_function, persist_directory="/content/chroma_db")

We can see that this is a regular Chroma object.

In [4]:
type(db)

The following code shows the structure of the embedding database, as stored by ChromaDB.

In [5]:
!ls -al /content/chroma_db

total 5904
drwxr-xr-x 3 root root    4096 Jun 26 16:31 .
drwxr-xr-x 1 root root    4096 Jun 26 16:30 ..
-rw-r--r-- 1 root root 6029312 Jun 26 16:31 chroma.sqlite3
drwxr-xr-x 2 root root    4096 Jun 26 16:31 e7f2b890-35c6-4a12-b5ff-a12a08fe90c8


## Reload Database from Disk

We can now reload this data into a new memory-based ChromaDB. This technique is much more efficient than reloading the data using the from_documents method previously employed.

In [6]:
db2 = Chroma(persist_directory="/content/chroma_db", embedding_function=embedding_function)


The following code we previously discussed demonstrates that the reloaded database is operational.

In [7]:
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

MODEL = 'gpt-4o-mini'

llm = ChatOpenAI(
        model=MODEL,
        temperature=0.2,
        n=1
    )

rag_prompt = hub.pull("rlm/rag-prompt")

def format_documents(documents):
    return "\n\n".join(doc.page_content for doc in documents)

retriever = db2.as_retriever()

qa_chain = (
    {"context": retriever | format_documents, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

We perform a query that accesses the RAG enabled data.

In [8]:
qa_chain.invoke("What company does Elena Martinez work for?")

'Elena Martinez works for Digital Dynamics.'

## Run Database as Server

You can use the following command to start a ChromaDB server that uses the previously created directory.

```
chroma run --host 127.0.0.1 --path /content/chroma_db &
```

We can now construct a client to communicate with this database server.

In [9]:
import chromadb
import langchain_community

client = chromadb.HttpClient(host='127.0.0.1', port=8000)
db3 = langchain_community.vectorstores.chroma.Chroma(
    client=client,embedding_function=embedding_function)

The following code we previously discussed demonstrates that the client HTTP database is operational.

In [10]:
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

MODEL = 'gpt-4o-mini'

llm = ChatOpenAI(
        model=MODEL,
        temperature=0.2,
        n=1
    )

rag_prompt = hub.pull("rlm/rag-prompt")

def format_documents(documents):
    return "\n\n".join(doc.page_content for doc in documents)

retriever = db3.as_retriever()

qa_chain = (
    {"context": retriever | format_documents, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

We perform a query that accesses the RAG enabled data.

In [11]:
qa_chain.invoke("What company does Elena Martinez work for?")

'Samantha Clarke works for Digital Dynamics.'