## ***`Step 1 : Requirement Phase`***



* Data Source - plain text file.

*  Framework - Langchain.

In [50]:
!pip install langchain langchain_community langchain_chroma



## ***`Importing the dependencies`***

In [51]:
from langchain_chroma import Chroma #vector database that stores and retrieves data (like text or documents) in vector (embedding) format.
from langchain_core.prompts  import PromptTemplate # create dynamic prompts in LangChain — where you insert variables into a prompt string.
from langchain_text_splitters import CharacterTextSplitter # split long text into smaller chunks based on character count, so that they can fit within the context limit of an LLM like GPT.
from langchain_core.runnables import RunnablePassthrough # Pass to next step without modifying anything.

In [52]:
!pip install langchain_core



In [53]:
from langchain_core.output_parsers.string import StrOutputParser # a tool from LangChain used to convert the raw output of an LLM into a plain string that you can work with easily.

In [54]:
path = '/content/drive/MyDrive/Deep Learning/GenAI/2024_state_of_the_union.txt' # file path

# ***`Handling the file`***

In [55]:
with open(path) as f:
  files = f.read()

In [56]:
print(files) #Print the context inside the file.

March 07, 2024
Remarks of President Joe Biden — State of the Union Address As Prepared for Delivery
Home
Briefing Room
Speeches and Remarks
The United States Capitol

###

Good evening. 

Mr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. 

In January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. 

He said, “I address you at a moment unprecedented in the history of the Union.” 

Hitler was on the march. War was raging in Europe. 

President Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   

Freedom and democracy were under assault in the world. 

Tonight I come to the same chamber to address the nation. 

Now it is we who face an unprecedented moment in the history of the Union. 

And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have 

## ***`Splitting the data`***


In [57]:
text_splitter = CharacterTextSplitter(
                                      chunk_size = 1000, # number of character in a single chunk
                                      chunk_overlap = 200, # The chunk_overlap is designed to retain part of the previous chunk in the next one.
                                      length_function = len) # length_function is just used to keep track of the size of the current chunk while it’s being built.

## `Create the split/segement the document`


In [58]:
texts = text_splitter.create_documents([files])

In [59]:
print(f' Chunk  1 Contains is of ',len(texts), 'size')

 Chunk  1 Contains is of  48 size


In [60]:
texts[0]

Document(metadata={}, page_content='March 07, 2024\nRemarks of President Joe Biden — State of the Union Address As Prepared for Delivery\nHome\nBriefing Room\nSpeeches and Remarks\nThe United States Capitol\n\n###\n\nGood evening. \n\nMr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. \n\nIn January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. \n\nHe said, “I address you at a moment unprecedented in the history of the Union.” \n\nHitler was on the march. War was raging in Europe. \n\nPresident Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   \n\nFreedom and democracy were under assault in the world. \n\nTonight I come to the same chamber to address the nation. \n\nNow it is we who face an unprecedented moment in the history of the Union. \n\nAnd yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary momen

## ***`Embed the data using Embedding model`***

In [61]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

In [62]:
embedding_model = HuggingFaceBgeEmbeddings(model_name= 'all-MiniLM-L6-V2')

## `Database Formation`

In [63]:
#"Hey, here’s the model I want to use to convert text into vectors (embeddings) so you can store and search based on meaning."

vector_database = Chroma(
                          collection_name = 'Rohit',
                          embedding_function = embedding_model)

## `load the document in the db`

In [64]:
storage_id = vector_database.add_documents(texts)

In [65]:
len(storage_id)

48

In [66]:
if (len(storage_id) == len(texts)):
  print('You are on the right track')
else:
  print('Need some improvement')

You are on the right track


## `Similarity search using VecDB`

In [67]:
results =   vector_database.similarity_search(
    query = 'What did the president say about ketanji Brown Jackson',
    k = 2
)

In [68]:
results[0]

Document(id='fb58defd-d27d-45f7-bced-5dba76a9d245', metadata={}, page_content='Honesty. Decency. Dignity. Equality. \n\nTo respect everyone. To give everyone a fair shot. To give hate no safe harbor.  \n\nNow some other people my age see a different story.  \n\nAn American story of resentment, revenge, and retribution. \n\nThat’s not me. \n\nI was born amid World War II when America stood for freedom in the world. \n\nI grew up in Scranton, Pennsylvania and Claymont, Delaware among working people who built this country. \n\nI watched in horror as two of my heroes, Dr. King and Bobby Kennedy, were assassinated and their legacies inspired me to pursue a career in service. \n\nA public defender, county councilman, elected United States Senator at 29, then Vice President, to our first Black President, now President, with our first woman Vice President. \n\nIn my career I’ve been told I’m too young and I’m too old. \n\nWhether young or old, I’ve always known what endures. \n\nOur North Star

In [69]:
results[1]

Document(id='3622a8cf-2dce-407f-a870-07ad23acc38d', metadata={}, page_content='Honesty. Decency. Dignity. Equality. \n\nTo respect everyone. To give everyone a fair shot. To give hate no safe harbor.  \n\nNow some other people my age see a different story.  \n\nAn American story of resentment, revenge, and retribution. \n\nThat’s not me. \n\nI was born amid World War II when America stood for freedom in the world. \n\nI grew up in Scranton, Pennsylvania and Claymont, Delaware among working people who built this country. \n\nI watched in horror as two of my heroes, Dr. King and Bobby Kennedy, were assassinated and their legacies inspired me to pursue a career in service. \n\nA public defender, county councilman, elected United States Senator at 29, then Vice President, to our first Black President, now President, with our first woman Vice President. \n\nIn my career I’ve been told I’m too young and I’m too old. \n\nWhether young or old, I’ve always known what endures. \n\nOur North Star

In [70]:
for x in results:
  print(f"\n* ID:{x.id}\Content:{x.page_content}")


* ID:fb58defd-d27d-45f7-bced-5dba76a9d245\Content:Honesty. Decency. Dignity. Equality. 

To respect everyone. To give everyone a fair shot. To give hate no safe harbor.  

Now some other people my age see a different story.  

An American story of resentment, revenge, and retribution. 

That’s not me. 

I was born amid World War II when America stood for freedom in the world. 

I grew up in Scranton, Pennsylvania and Claymont, Delaware among working people who built this country. 

I watched in horror as two of my heroes, Dr. King and Bobby Kennedy, were assassinated and their legacies inspired me to pursue a career in service. 

A public defender, county councilman, elected United States Senator at 29, then Vice President, to our first Black President, now President, with our first woman Vice President. 

In my career I’ve been told I’m too young and I’m too old. 

Whether young or old, I’ve always known what endures. 

Our North Star.

* ID:3622a8cf-2dce-407f-a870-07ad23acc38d\Conte

## `Setting up the retrivals`


In [71]:
retriever = vector_database.as_retriever()

In [72]:
!pip install transformers



In [73]:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM


In [74]:
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

In [75]:
#Adding padding so it can be of same length.

if tokenizer.pad_token is None:
  tokenizer.add_special_token({'pad_token':'[PAD]'})

In [76]:
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')

In [77]:
model.resize_token_embeddings(len(tokenizer)) #adding the padding in the model

Embedding(32100, 768)

In [78]:
model.config.pad_token_id = tokenizer.pad_token_id

In [79]:
generator = pipeline(
    'text2text-generation',
    model = model,
    tokenizer = tokenizer,
    max_new_tokens = 150
)


Device set to use cuda:0


In [80]:
from langchain.llms import HuggingFacePipeline

In [81]:
llm = HuggingFacePipeline(pipeline=generator)

## `design a prompt`

In [82]:
template = """

                Use the context provided to answer the question. If you don't know the answer, say you don't know.

                Context:
                {context}

                Question: {question}
                Answer:

"""

In [83]:
custom_template = PromptTemplate(
    template = template
    )

In [84]:
custom_template

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="\n\n                Use the context provided to answer the question. If you don't know the answer, say you don't know.\n\n                Context:\n                {context}\n\n                Question: {question}\n                Answer:\n\n")

## `Connect them`

In [85]:
rag_chain = (
    {
        "context": retriever,
        "question": RunnablePassthrough()
    }
    | custom_template
    | llm
    | StrOutputParser()
)


In [89]:
query = "can you tell me about the president of U.S what did it tell"

answer = rag_chain.invoke(query)

print(answer)

I address you at a moment unprecedented in the history of the Union
