<a href="https://colab.research.google.com/github/quantumhome/DataAnalysisCaseStudy/blob/master/RAG_Application_AJ_MG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Objective**
  * **We are going to create our conversational AI, that will answer the questions based on the given data source (pdf, text, img, json)**

* **`Open Source Model`: Deepseek, Mixtral, Zephyr, Dolly, Llama, Phi (HuggingFace, Unsloth, replicate)**

* **`Proprietry Models`: OpenAI, Google Gemini & PaLm, Microsoft**


### **RAG Application**
* **Indexing**
  * **Load the data: Document Loader**
  * **Split the data: Text Splitter**
  * **Embed the data: Embedding Model**
  * **Save the data into a DB: VectorDB (`Chroma` and PineCone)**
<hr>
* **Retrieval**
  * **Setup LLM: ChatGPT (4o-mini, GPT-4)**
  * **Prompt Engineering (To make sure the model works fine)**
  * **Connect & Chain these all together: Chain**
  * **Utilize the LLM: Test**
<hr>
  * **Interface for having results as output: Gradio**

# **Step 1 - Requirement Phase**

* **Data Source: `plain text file`**
* **Framework: `Langchain`**

In [None]:
!pip install langchain langchain_community langchain_chroma

Collecting langchain_community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain_chroma
  Downloading langchain_chroma-1.0.0-py3-none-any.whl.metadata (1.9 kB)
INFO: pip is looking at multiple versions of langchain-community to determine which version is compatible with other requirements. This could take a while.
Collecting langchain_community
  Downloading langchain_community-0.4-py3-none-any.whl.metadata (3.0 kB)
  Downloading langchain_community-0.3.31-py3-none-any.whl.metadata (3.0 kB)
Collecting requests<3,>=2 (from langchain)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting chromadb<2.0.0,>=1.0.20 (from langchain_chroma)
  Downloading chromadb-1.3.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of la

### **Importing the dependencies**

In [None]:
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers.string import StrOutputParser

# **Step 2 - Document Processing**

### **1. Taking a plain text file**

**Link: https://drive.google.com/file/d/1z5FTeCvkrfHnMrSfbtlvHH1CpYKJ6udR/view?usp=sharing**

In [None]:
with open('/content/2024_state_of_the_union (1).txt') as f:
  files=f.read()

In [None]:
files

'March 07, 2024\nRemarks of President Joe Biden — State of the Union Address As Prepared for Delivery\nHome\nBriefing Room\nSpeeches and Remarks\nThe United States Capitol\n\n###\n\nGood evening. \n\nMr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. \n\nIn January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. \n\nHe said, “I address you at a moment unprecedented in the history of the Union.” \n\nHitler was on the march. War was raging in Europe. \n\nPresident Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   \n\nFreedom and democracy were under assault in the world. \n\nTonight I come to the same chamber to address the nation. \n\nNow it is we who face an unprecedented moment in the history of the Union. \n\nAnd yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. \n\nNot since President L

### **2. Split the data**


In [None]:
# the is first chunk, chunk this is secound chunk

In [None]:
textsplitter=CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len

)

### **3. Create the split / segment the documentation**

In [None]:
texts=textsplitter.create_documents([files])

### **Output**

In [None]:
len(texts)

48

In [None]:
texts[0]

Document(metadata={}, page_content='March 07, 2024\nRemarks of President Joe Biden — State of the Union Address As Prepared for Delivery\nHome\nBriefing Room\nSpeeches and Remarks\nThe United States Capitol\n\n###\n\nGood evening. \n\nMr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. \n\nIn January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. \n\nHe said, “I address you at a moment unprecedented in the history of the Union.” \n\nHitler was on the march. War was raging in Europe. \n\nPresident Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   \n\nFreedom and democracy were under assault in the world. \n\nTonight I come to the same chamber to address the nation. \n\nNow it is we who face an unprecedented moment in the history of the Union. \n\nAnd yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary momen

In [None]:
texts[47]

Document(metadata={}, page_content='Tonight you’ve heard mine. \n\nI see a future where we defend democracy not diminish it. \n\nI see a future where we restore the right to choose and protect other freedoms not take them away. \n\nI see a future where the middle class finally has a fair shot and the wealthy finally have to pay their fair share in taxes. \n\nI see a future where we save the planet from the climate crisis and our country from gun violence. \n\nAbove all, I see a future for all Americans! \n\nI see a country for all Americans! \n\nAnd I will always be a president for all Americans! \n\nBecause I believe in America! \n\nI believe in you the American people. \n\nYou’re the reason I’ve never been more optimistic about our future! \n\nSo let’s build that future together! \n\nLet’s remember who we are! \n\nWe are the United States of America. \n\nThere is nothing beyond our capacity when we act together! \n\nMay God bless you all. \n\nMay God protect our troops.\n\n###')

# **Step 3 - Embed the data using Embedding Model**

In [None]:
# all-MiniLM-L6-V2 -encoder

### **Create the embeddings**

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
embedding_model=HuggingFaceEmbeddings(model_name='all-MiniLM-L6-V2')

  embedding_model=HuggingFaceEmbeddings(model_name='all-MiniLM-L6-V2')


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### **Databae Formation**

In [None]:
vector_db=Chroma(
    collection_name='Ajinkya',
    embedding_function=embedding_model
)

In [None]:
vector_db

<langchain_chroma.vectorstores.Chroma at 0x7fcab273c710>

### **Load the documents in the DB**

In [None]:
storage_id=vector_db.add_documents(texts)

In [None]:
len(storage_id) == len(texts)

True

In [None]:
storage_id[0]

'9b0a8e27-eea5-41ff-bffa-c14f01a61528'

1. Text

   └── Raw input text data (e.g., document, web page, transcript)

2. Split into Chunks

   └── Divide text into manageable chunks (e.g., by sentences or paragraphs)

3. Embedding Model

   └── Use a model (like OpenAI, Sentence-BERT) to convert text chunks into embeddings

4. Vectors

   └── Embeddings are high-dimensional numeric representations of the text

5. Vector Database

   └── Store these vectors in a database optimized for similarity search (e.g., FAISS, Pinecone, Weaviate)

6. Primary IDs

   └── Assign a unique identifier to each vector entry

7. Ensure Uniqueness

   └── Validate that each ID is distinct to avoid collisions or duplication


### **Similarity Searching using VecDB**

In [None]:
res=vector_db.similarity_search(
    query="What did the president say about Ketanji Brown Jackson",
    k=2
)

In [None]:
res[0]

Document(id='121fa7c1-802d-4108-bdd6-51c68f177545', metadata={}, page_content='To take on crimes of domestic violence, I am ramping up federal enforcement of the Violence Against Women Act, that I proudly wrote, so we can finally end the scourge of violence against women in America!  \n\nAnd there’s another kind of violence I want to stop. \n\nWith us tonight is Jasmine, whose 9-year-old sister Jackie was murdered with 21 classmates and teachers at her elementary school in Uvalde, Texas. \n\nSoon after it happened, Jill and I went to Uvalde and spent hours with the families. \n\nWe heard their message, and so should everyone in this chamber do something. \n\nI did do something by establishing the first-ever Office of Gun Violence Prevention in the White House that Vice President Harris is leading. \n\nMeanwhile, my predecessor told the NRA he’s proud he did nothing on guns when he was President. \n\nAfter another school shooting in Iowa he said we should just “get over it.” \n\nI say w

In [None]:
res[1]

Document(id='e6717051-7f05-4e29-bd12-73f9ed74b4e8', metadata={}, page_content='Instead of a town being left behind it’s a community moving forward again! \n\nBecause instead of watching auto jobs of the future go overseas 4,000 union workers with higher wages will be building that future, in Belvidere, here in America! \n\nHere tonight is UAW President, Shawn Fain, a great friend, and a great labor leader. \n\nAnd Dawn Simms, a third generation UAW worker  in Belvidere. \n\nShawn, I was proud to be the first President in American history to walk a picket line. \n\nAnd today Dawn has a job in her hometown providing stability for her family and pride and dignity. \n\nShowing once again, Wall Street didn’t build this country! \n\nThe middle class built this country! And unions built the middle class! \n\nWhen Americans get knocked down, we get back up! \n\nWe keep going! \n\nThat’s America! That’s you, the American people! \n\nIt’s because of you America is coming back!  \n\nIt’s because 

# **Step 4 - Setting up the Retrievals**

### **a. Create a retriever**

In [None]:
retriver=vector_db.as_retriever()

### **b. LLM Instance**

In [None]:
# google/flan-t5-large

In [None]:
from transformers import AutoTokenizer,AutoModelForSeq2SeqLM,pipeline

In [None]:
tokenizer=AutoTokenizer.from_pretrained('google/flan-t5-large')

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

In [None]:
if tokenizer.pad_token is None :
  tokenizer.add_special_tokens({'pad_toke':'[PAD]'})

In [None]:
model=AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-large')

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
model.resize_token_embeddings(len(tokenizer)) # adding auto padding in model as well

Embedding(32100, 1024)

In [None]:
model.config.pad_token_id=tokenizer.pad_token_id

In [None]:
generator=pipeline(
    'text2text-generation',
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=150
)

Device set to use cpu


In [None]:
from langchain.llms import HuggingFacePipeline

In [None]:
llm=HuggingFacePipeline(pipeline=generator)

  llm=HuggingFacePipeline(pipeline=generator)


#### **Other Examples**

* **HuggingFaceH4/zephyr-7b-beta**
* **Qwen/Qwen3-235B-A22B**

### **c. Design a Prompt**

In [None]:
template = """Use the context provided to answer the question. If you don't know the answer, say you don't know.

Context:
{context}

Question: {question}
Answer:"""

In [None]:
custom_temp=PromptTemplate(
    template=template
)

In [None]:
custom_temp

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="Use the context provided to answer the question. If you don't know the answer, say you don't know.\n\nContext:\n{context}\n\nQuestion: {question}\nAnswer:")

**We have a template, model, database**

* **Can we connect them**

In [None]:
rag_chain=(
    {'context':retriver,'question':RunnablePassthrough()}
    |custom_temp
    |llm
    |StrOutputParser()
)

In [None]:
# Input question
#      ↓
# Retriever → Format → Context
#      ↓
# {context, question}
#      ↓
# Prompt Template (custom_template)
#      ↓
# LLM (generate answer)
#      ↓
# StrOutputParser (final output string)


# **Step 5 - Test**

In [None]:
query='What did the President say about Ukrain?'
res=rag_chain.invoke(query)
print(res)

Token indices sequence length is longer than the specified maximum sequence length for this model (1302 > 512). Running this sequence through the model will result in indexing errors
