# Natural Language Processing

# Retrieval-Augmented generation (RAG)

RAG is a technique for augmenting LLM knowledge with additional, often private or real-time, data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs.

<img src="../figures/RAG-process.png" >

Introducing `ChakyBot`, an innovative chatbot designed to assist Chaky (the instructor) and TA (Gun) in explaining the lesson of the NLP course to students. Leveraging LangChain technology, ChakyBot excels in retrieving information from documents, ensuring a seamless and efficient learning experience for students engaging with the NLP curriculum.

1. Prompt
2. Retrieval
3. Memory
4. Chain

In [1]:
# #langchain library
# !pip install langchain==0.0.350
# #LLM
# !pip install accelerate==0.25.0
# !pip install transformers==4.36.2
# !pip install bitsandbytes==0.41.2
# #Text Embedding
# !pip install sentence-transformers==2.2.2
# !pip install InstructorEmbedding==1.0.1
# #vectorstore
# !pip install pymupdf==1.23.8
# !pip install faiss-gpu==1.7.2
# !pip install faiss-cpu==1.7.4

In [2]:
import os
import torch
device = torch.device('mps')
device

device(type='mps')

## 1. Prompt

A set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output, such as answering questions, completing sentences, or engaging in a conversation.

In [3]:
from langchain import PromptTemplate

prompt_template = """
    I'm your friendly AI bot I will try to answer your questions especially about Usman
    {context}
    Question: {question}
    Answer:
    """.strip()

PROMPT = PromptTemplate.from_template(
    template = prompt_template
)

PROMPT
#using str.format 
#The placeholder is defined using curly brackets: {} {}

PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly AI bot I will try to answer your questions especially about Usman\n    {context}\n    Question: {question}\n    Answer:")

In [4]:
PROMPT.format(
    context = "Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can effectively generalize and thus perform tasks without explicit instructions.",
    question = "What is Machine Learning"
)

"I'm your friendly AI bot I will try to answer your questions especially about Usman\n    Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can effectively generalize and thus perform tasks without explicit instructions.\n    Question: What is Machine Learning\n    Answer:"

Note : [How to improve prompting (Zero-shot, Few-shot, Chain-of-Thought, etc.](https://github.com/chaklam-silpasuwanchai/Natural-Language-Processing/blob/main/Code/05%20-%20RAG/advance/cot-tot-prompting.ipynb)

## 2. Retrieval

1. `Document loaders` : Load documents from many different sources (HTML, PDF, code). 
2. `Document transformers` : One of the essential steps in document retrieval is breaking down a large document into smaller, relevant chunks to enhance the retrieval process.
3. `Text embedding models` : Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar.
4. `Vector stores`: there has emerged a need for databases to support efficient storage and searching of these embeddings.
5. `Retrievers` : Once the data is in the database, you still need to retrieve it.

### 2.1 Document Loaders 
Use document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

[PDF Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)

[Download Document](https://web.stanford.edu/~jurafsky/slp3/)

In [5]:
from langchain.document_loaders import PyMuPDFLoader

nlp_docs = 'usman_portfolio.pdf'

loader = PyMuPDFLoader(nlp_docs)
documents = loader.load()

In [6]:
len(documents)

3

In [7]:
documents[1]

Document(page_content='optimization, cloud integrations, and security implementations. One of my key projects \nhas been transitioning over 1,200 Hilton properties to HotelKey’s Front Desk system and \nintegrating it with major online travel agencies (OTAs) like Agoda, Airbnb, and \nBooking.com. Additionally, I have worked on integrating revenue management systems \nsuch as IDeaS, Duetto, and Revenue Analytics, as well as optimizing ETL (Extract, \nTransform, Load) processes to improve data handling eIiciency. Through my work, I have \ncome to appreciate the importance of agile methodologies, teamwork, and scalable \narchitectures in the software industry. \nThe tech industry is one of the most dynamic ﬁelds, constantly evolving with new trends, \nframeworks, and paradigms. My professional experiences have not only enhanced my \ntechnical skills but also shaped my core beliefs regarding the role of technology in \nsociety. I ﬁrmly believe that technology should be developed with the in

### 2.2 Document Transformers

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 100
)

doc = text_splitter.split_documents(documents)

In [9]:
doc[1]

Document(page_content='shaping industries and making life more eIicient. \nMy academic journey began with a strong foundation in information technology. I pursued \nmy Bachelor of Science in Information Technology at the University of the Punjab, a \nprogram that gave me the technical expertise and analytical skills necessary to succeed \nin the ﬁeld. Not only did I graduate with distinction, but I also achieved a CGPA of \n3.93/4.00, which earned me the honor of being a gold medalist. During my undergraduate \nyears, I immersed myself in programming, databases, and cloud computing while \nactively participating in research and development projects. Excelling academically was', metadata={'source': 'usman_portfolio.pdf', 'file_path': 'usman_portfolio.pdf', 'page': 0, 'total_pages': 3, 'format': 'PDF 1.3', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'macOS Version 14.6.1 (Build 23G93) Quartz PDFContext', 'creationDate': "D:20250316175958Z00'00'", 

In [10]:
len(doc)

15

### 2.3 Text Embedding Models
Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

*Note* Instructor Model : [Huggingface](gingface.co/hkunlp/instructor-base) | [Paper](https://arxiv.org/abs/2212.09741)

In [11]:
import torch
from langchain_community.embeddings import HuggingFaceInstructEmbeddings

model_name = 'hkunlp/instructor-base'

embedding_model = HuggingFaceInstructEmbeddings(
    model_name = model_name,
    model_kwargs = {"device" : device}
)

  from tqdm.autonotebook import trange
  _torch_pytree._register_pytree_node(


load INSTRUCTOR_Transformer


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  return torch.load(checkpoint_file, map_location=map_location)


max_seq_length  512


  model.load_state_dict(torch.load(os.path.join(input_path, 'pytorch_model.bin'), map_location=torch.device('cpu')))


### 2.4 Vector Stores

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

In [12]:
#locate vectorstore
vector_path = './vector-store'
if not os.path.exists(vector_path):
    os.makedirs(vector_path)
    print('create path done')

In [13]:
#save vector locally
from langchain.vectorstores import FAISS

vectordb = FAISS.from_documents(
    documents = doc,
    embedding = embedding_model
)

db_file_name = 'nlp_stanford'

vectordb.save_local(
    folder_path = os.path.join(vector_path, db_file_name),
    index_name = 'nlp' #default index
)

### 2.5 retrievers
A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [14]:
#calling vector from local
vector_path = './vector-store'
db_file_name = 'nlp_stanford'

from langchain.vectorstores import FAISS

vectordb = FAISS.load_local(
    folder_path = os.path.join(vector_path, db_file_name),
    embeddings = embedding_model,
    index_name = 'nlp' #default index
)   

In [15]:
#ready to use
retriever = vectordb.as_retriever()

In [16]:
retriever.get_relevant_documents("What is your name")

[Document(page_content="actively participating in research and development projects. Excelling academically was \na priority, but what truly drove me was the ability to apply theoretical knowledge to \npractical problems. This academic success led me to receive one of the most prestigious \nscholarships, His Majesty the King's Scholarship, awarded by the Royal Thai Government. \nThis enabled me to take my education to the next level by pursuing a Master’s in Data \nScience and Artiﬁcial Intelligence at the Asian Institute of Technology (AIT), Thailand, \nwhere I began my studies in August 2024. The transition from undergraduate studies to a", metadata={'source': 'usman_portfolio.pdf', 'file_path': 'usman_portfolio.pdf', 'page': 0, 'total_pages': 3, 'format': 'PDF 1.3', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'macOS Version 14.6.1 (Build 23G93) Quartz PDFContext', 'creationDate': "D:20250316175958Z00'00'", 'modDate': "D:20250316175958Z00'00'"

In [17]:
retriever.get_relevant_documents("What is job experience")

[Document(page_content='demanding yet fulﬁlling experience. \nWith over two years of professional work experience, I have gained hands-on exposure to \nvarious aspects of software engineering. Currently, I am employed as a Software \nEngineer at HotelKey, a US-based company, where I work remotely, contributing to a \nhighly dynamic and innovative environment. Before this, I served as an Associate \nSoftware Engineer at the same company, progressing in my role due to my contributions \nand ability to handle complex technical challenges. My work primarily revolves around \ndeveloping and maintaining RESTful APIs, cloud-based architectures, and scalable', metadata={'source': 'usman_portfolio.pdf', 'file_path': 'usman_portfolio.pdf', 'page': 0, 'total_pages': 3, 'format': 'PDF 1.3', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'macOS Version 14.6.1 (Build 23G93) Quartz PDFContext', 'creationDate': "D:20250316175958Z00'00'", 'modDate': "D:202503161759

## 3. Memory

One of the core utility classes underpinning most (if not all) memory modules is the ChatMessageHistory class. This is a super lightweight wrapper that provides convenience methods for saving HumanMessages, AIMessages, and then fetching them all.

You may want to use this class directly if you are managing memory outside of a chain.


In [18]:
from langchain.memory import ChatMessageHistory

history = ChatMessageHistory()
history

ChatMessageHistory(messages=[])

In [19]:
history.add_user_message('hi')
history.add_ai_message('Whats up?')
history.add_user_message('How are you')
history.add_ai_message('I\'m quite good. How about you?')

In [20]:
history

ChatMessageHistory(messages=[HumanMessage(content='hi'), AIMessage(content='Whats up?'), HumanMessage(content='How are you'), AIMessage(content="I'm quite good. How about you?")])

### 3.1 Memory types

There are many different types of memory. Each has their own parameters, their own return types, and is useful in different scenarios. 
- Converstaion Buffer
- Converstaion Buffer Window

What variables get returned from memory

Before going into the chain, various variables are read from memory. These have specific names which need to align with the variables the chain expects. You can see what these variables are by calling memory.load_memory_variables({}). Note that the empty dictionary that we pass in is just a placeholder for real variables. If the memory type you are using is dependent upon the input variables, you may need to pass some in.

In this case, you can see that load_memory_variables returns a single key, history. This means that your chain (and likely your prompt) should expect an input named history. You can usually control this variable through parameters on the memory class. For example, if you want the memory variables to be returned in the key chat_history you can do:

#### Converstaion Buffer
This memory allows for storing messages and then extracts the messages in a variable.

In [21]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: hi\nAI: What's up?\nHuman: How are you?\nAI: I'm quite good. How about you?"}

In [22]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(return_messages = True)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': [HumanMessage(content='hi'),
  AIMessage(content="What's up?"),
  HumanMessage(content='How are you?'),
  AIMessage(content="I'm quite good. How about you?")]}

#### Conversation Buffer Window
- it keeps a list of the interactions of the conversation over time. 
- it only uses the last K interactions. 
- it can be useful for keeping a sliding window of the most recent interactions, so the buffer does not get too large.

In [23]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=1)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: How are you?\nAI: I'm quite good. How about you?"}

## 4. Chain

Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

An `LLMChain` is a simple chain that adds some functionality around language models.
- it consists of a `PromptTemplate` and a `LM` (either an LLM or chat model).
- it formats the prompt template using the input key values provided (and also memory key values, if available), 
- it passes the formatted string to LLM and returns the LLM output.

Note : [Download Fastchat Model Here](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)

In [24]:
# %cd ./models
# !git clone https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

In [25]:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline
import torch

model_id = './models/fastchat-t5-3b-v1.0/'

tokenizer = AutoTokenizer.from_pretrained(
    model_id)

tokenizer.pad_token_id = tokenizer.eos_token_id

bitsandbyte_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.float16,
    bnb_4bit_use_double_quant = True
)

# model = AutoModelForSeq2SeqLM.from_pretrained(
#     model_id,
#     quantization_config = bitsandbyte_config, #caution Nvidia
#     device_map = 'auto',
#     load_in_8bit = True
# )

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id,
    #quantization_config = bitsandbyte_config, #caution Nvidia
    #device_map = 'auto',
    torch_dtype=torch.float16,
    device_map={"": device}
    #load_in_8bit = True
)

pipe = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens = 256,
    model_kwargs = {
        "temperature" : 0,
        "repetition_penalty": 1.5
    }
)

llm = HuggingFacePipeline(pipeline = pipe)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  return torch.load(checkpoint_file, map_location=map_location)


### [Class ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/_modules/langchain/chains/conversational_retrieval/base.html#ConversationalRetrievalChain)

- `retriever` : Retriever to use to fetch documents.

- `combine_docs_chain` : The chain used to combine any retrieved documents.

- `question_generator`: The chain used to generate a new question for the sake of retrieval. This chain will take in the current question (with variable question) and any chat history (with variable chat_history) and will produce a new standalone question to be used later on.

- `return_source_documents` : Return the retrieved source documents as part of the final result.

- `get_chat_history` : An optional function to get a string of the chat history. If None is provided, will use a default.

- `return_generated_question` : Return the generated question as part of the final result.

- `response_if_no_docs_found` : If specified, the chain will return a fixed response if no docs are found for the question.


`question_generator`

In [26]:
from langchain.chains import LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain

In [27]:
CONDENSE_QUESTION_PROMPT

PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:')

In [28]:
question_generator = LLMChain(
    llm = llm,
    prompt = CONDENSE_QUESTION_PROMPT,
    verbose = True
)

In [29]:
query = 'Comparing both of them'
chat_history = "Human:What is your age?\nAI:\nHuman:How old are you?\nAI:"

question_generator({'chat_history' : chat_history, "question" : query})



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
Human:What is your age?
AI:
Human:How old are you?
AI:
Follow Up Input: Comparing both of them
Standalone question:[0m

[1m> Finished chain.[0m


{'chat_history': 'Human:What is your age?\nAI:\nHuman:How old are you?\nAI:',
 'question': 'Comparing both of them',
 'text': '<pad> How  old  are  you?\n'}

`combine_docs_chain`

In [30]:
doc_chain = load_qa_chain(
    llm = llm,
    chain_type = 'stuff',
    prompt = PROMPT,
    verbose = True
)
doc_chain

StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly AI bot I will try to answer your questions especially about Usman\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x32da5e8e0>)), document_variable_name='context')

In [31]:
query = "What is your birth date?"
input_document = retriever.get_relevant_documents(query)

doc_chain({'input_documents':input_document, 'question':query})



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly AI bot I will try to answer your questions especially about Usman
    Usman’s Portfolio 
Technology has always been a transformative force, shaping societies in unprecedented 
ways and redeﬁning how we interact, conduct business, and solve real-world problems. 
My passion for technology and software engineering has been a deﬁning aspect of my 
journey, leading me to where I am today. Born on September 24, 1995, I have spent my 
years constantly evolving with the technological landscape, embracing challenges, and 
continuously learning. With nearly 29 years of life experience, I have always been 
fascinated by how software, artiﬁcial intelligence, and cloud computing contribute to 
shaping industries and making life more eIicient.

where I began my studies in August 2024. The transition from undergraduate studies to a 
master’s program 

{'input_documents': [Document(page_content='Usman’s Portfolio \nTechnology has always been a transformative force, shaping societies in unprecedented \nways and redeﬁning how we interact, conduct business, and solve real-world problems. \nMy passion for technology and software engineering has been a deﬁning aspect of my \njourney, leading me to where I am today. Born on September 24, 1995, I have spent my \nyears constantly evolving with the technological landscape, embracing challenges, and \ncontinuously learning. With nearly 29 years of life experience, I have always been \nfascinated by how software, artiﬁcial intelligence, and cloud computing contribute to \nshaping industries and making life more eIicient.', metadata={'source': 'usman_portfolio.pdf', 'file_path': 'usman_portfolio.pdf', 'page': 0, 'total_pages': 3, 'format': 'PDF 1.3', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'macOS Version 14.6.1 (Build 23G93) Quartz PDFContext', 'creat

In [32]:
memory = ConversationBufferWindowMemory(
    k=3, 
    memory_key = "chat_history",
    return_messages = True,
    output_key = 'answer'
)

chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator,
    combine_docs_chain=doc_chain,
    return_source_documents=True,
    memory=memory,
    verbose=True,
    get_chat_history=lambda h : h
)
chain

ConversationalRetrievalChain(memory=ConversationBufferWindowMemory(output_key='answer', return_messages=True, memory_key='chat_history', k=3), verbose=True, combine_docs_chain=StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly AI bot I will try to answer your questions especially about Usman\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x32da5e8e0>)), document_variable_name='context'), question_generator=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:'), llm=HuggingFacePipeline

## 5. Chatbot

In [33]:
prompt_question = "How old are you?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly AI bot I will try to answer your questions especially about Usman
    Usman’s Portfolio 
Technology has always been a transformative force, shaping societies in unprecedented 
ways and redeﬁning how we interact, conduct business, and solve real-world problems. 
My passion for technology and software engineering has been a deﬁning aspect of my 
journey, leading me to where I am today. Born on September 24, 1995, I have spent my 
years constantly evolving with the technological landscape, embracing challenges, and 
continuously learning. With nearly 29 years of life experience, I have always been 
fascinated by how software, artiﬁcial intelligence, and cloud computing contribute to 
shaping industries and making life more eIicient.

where I began my studies in August 2024. Th

{'question': 'How old are you?',
 'chat_history': [],
 'answer': '<pad>  I  am  29  years  old.\n',
 'source_documents': [Document(page_content='Usman’s Portfolio \nTechnology has always been a transformative force, shaping societies in unprecedented \nways and redeﬁning how we interact, conduct business, and solve real-world problems. \nMy passion for technology and software engineering has been a deﬁning aspect of my \njourney, leading me to where I am today. Born on September 24, 1995, I have spent my \nyears constantly evolving with the technological landscape, embracing challenges, and \ncontinuously learning. With nearly 29 years of life experience, I have always been \nfascinated by how software, artiﬁcial intelligence, and cloud computing contribute to \nshaping industries and making life more eIicient.', metadata={'source': 'usman_portfolio.pdf', 'file_path': 'usman_portfolio.pdf', 'page': 0, 'total_pages': 3, 'format': 'PDF 1.3', 'title': '', 'author': '', 'subject': '', 'key

In [34]:
prompt_question = "What is your highest level of education?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='How old are you?'), AIMessage(content='<pad>  I  am  29  years  old.\n')]
Follow Up Input: What is your highest level of education?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly AI bot I will try to answer your questions especially about Usman
    shaping industries and making life more eIicient. 
My academic journey began with a strong foundation in information technology. I pursued 
my Bachelor of Science in Information Technology at the University of the Punjab, a 
program that gave me the technical exper

{'question': 'What is your highest level of education?',
 'chat_history': [HumanMessage(content='How old are you?'),
  AIMessage(content='<pad>  I  am  29  years  old.\n')],
 'answer': '<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n',
 'source_documents': [Document(page_content='shaping industries and making life more eIicient. \nMy academic journey began with a strong foundation in information technology. I pursued \nmy Bachelor of Science in Information Technology at the University of the Punjab, a \nprogram that gave me the technical expertise and analytical skills necessary to succeed \nin the ﬁeld. Not only did I graduate with distinction, but I also achieved a CGPA of \n3.93/4.00, which earned me the honor of being a gold medalist. During my undergraduate \nyears, I immersed myself in programming, databases, and cloud computing while \nactively participating in research and development projects. Excelling academically was', metadata={'source': 'usman_portfoli

In [35]:
prompt_question = "What major or field of study did you pursue during your education?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='How old are you?'), AIMessage(content='<pad>  I  am  29  years  old.\n'), HumanMessage(content='What is your highest level of education?'), AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n')]
Follow Up Input: What major or field of study did you pursue during your education?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly AI bot I will try to answer your questions especially about Usman
    shaping industries and making life more eIicient. 
My academic journey began with a

{'question': 'What major or field of study did you pursue during your education?',
 'chat_history': [HumanMessage(content='How old are you?'),
  AIMessage(content='<pad>  I  am  29  years  old.\n'),
  HumanMessage(content='What is your highest level of education?'),
  AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n')],
 'answer': '<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n',
 'source_documents': [Document(page_content='shaping industries and making life more eIicient. \nMy academic journey began with a strong foundation in information technology. I pursued \nmy Bachelor of Science in Information Technology at the University of the Punjab, a \nprogram that gave me the technical expertise and analytical skills necessary to succeed \nin the ﬁeld. Not only did I graduate with distinction, but I also achieved a CGPA of \n3.93/4.00, which earned me the honor of being a gold medalist. During my undergraduate \nyears, I immersed m

In [36]:
prompt_question = "How many years of work experience do you have?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='How old are you?'), AIMessage(content='<pad>  I  am  29  years  old.\n'), HumanMessage(content='What is your highest level of education?'), AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n'), HumanMessage(content='What major or field of study did you pursue during your education?'), AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n')]
Follow Up Input: How many years of work experience do you have?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your 

{'question': 'How many years of work experience do you have?',
 'chat_history': [HumanMessage(content='How old are you?'),
  AIMessage(content='<pad>  I  am  29  years  old.\n'),
  HumanMessage(content='What is your highest level of education?'),
  AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n'),
  HumanMessage(content='What major or field of study did you pursue during your education?'),
  AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n')],
 'answer': '<pad>  I  have  over  two  years  of  work  experience.\n',
 'source_documents': [Document(page_content='demanding yet fulﬁlling experience. \nWith over two years of professional work experience, I have gained hands-on exposure to \nvarious aspects of software engineering. Currently, I am employed as a Software \nEngineer at HotelKey, a US-based company, where I work remotely, contributing to a \nhighly dynamic and innovative environment. Before this, I serv

In [37]:
prompt_question = "What type of work or industry have you been involved in?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What is your highest level of education?'), AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n'), HumanMessage(content='What major or field of study did you pursue during your education?'), AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n'), HumanMessage(content='How many years of work experience do you have?'), AIMessage(content='<pad>  I  have  over  two  years  of  work  experience.\n')]
Follow Up Input: What type of work or industry have you been involved in?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLM

{'question': 'What type of work or industry have you been involved in?',
 'chat_history': [HumanMessage(content='What is your highest level of education?'),
  AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n'),
  HumanMessage(content='What major or field of study did you pursue during your education?'),
  AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n'),
  HumanMessage(content='How many years of work experience do you have?'),
  AIMessage(content='<pad>  I  have  over  two  years  of  work  experience.\n')],
 'answer': '<pad>  Software  Engineering\n',
 'source_documents': [Document(page_content='demanding yet fulﬁlling experience. \nWith over two years of professional work experience, I have gained hands-on exposure to \nvarious aspects of software engineering. Currently, I am employed as a Software \nEngineer at HotelKey, a US-based company, where I work remotely, contributing to a \nhighly dynamic and inno

In [38]:
prompt_question = "Can you describe your current role or job responsibilities?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What major or field of study did you pursue during your education?'), AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n'), HumanMessage(content='How many years of work experience do you have?'), AIMessage(content='<pad>  I  have  over  two  years  of  work  experience.\n'), HumanMessage(content='What type of work or industry have you been involved in?'), AIMessage(content='<pad>  Software  Engineering\n')]
Follow Up Input: Can you describe your current role or job responsibilities?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m

{'question': 'Can you describe your current role or job responsibilities?',
 'chat_history': [HumanMessage(content='What major or field of study did you pursue during your education?'),
  AIMessage(content='<pad>  Master’s  in  Data  Science  and  Artificial  Intelligence\n'),
  HumanMessage(content='How many years of work experience do you have?'),
  AIMessage(content='<pad>  I  have  over  two  years  of  work  experience.\n'),
  HumanMessage(content='What type of work or industry have you been involved in?'),
  AIMessage(content='<pad>  Software  Engineering\n')],
 'answer': '<pad>   pad>  I  am  a  Software  Engineer  at  HotelKey,  a  US-based  company,  where  I  work  remotely,  contributing  to  a  highly  dynamic  and  innovative  environment.  My  work  primarily  revolves  around  developing  and  maintaining  RESTful  APIs,  cloud-based  architectures,  and  scalable  software  solutions.  I  specialize  in  using  Java,  Python,  C,  and  C++,  along  with  a  range  of  A

In [39]:
prompt_question = "What are your core beliefs regarding the role of technology in shaping society?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='How many years of work experience do you have?'), AIMessage(content='<pad>  I  have  over  two  years  of  work  experience.\n'), HumanMessage(content='What type of work or industry have you been involved in?'), AIMessage(content='<pad>  Software  Engineering\n'), HumanMessage(content='Can you describe your current role or job responsibilities?'), AIMessage(content='<pad>   pad>  I  am  a  Software  Engineer  at  HotelKey,  a  US-based  company,  where  I  work  remotely,  contributing  to  a  highly  dynamic  and  innovative  environment.  My  work  primarily  revolves  around  developing  and  maintaining  RESTful  APIs,  cloud-based  architectures,  and  scalab

{'question': 'What are your core beliefs regarding the role of technology in shaping society?',
 'chat_history': [HumanMessage(content='How many years of work experience do you have?'),
  AIMessage(content='<pad>  I  have  over  two  years  of  work  experience.\n'),
  HumanMessage(content='What type of work or industry have you been involved in?'),
  AIMessage(content='<pad>  Software  Engineering\n'),
  HumanMessage(content='Can you describe your current role or job responsibilities?'),
  AIMessage(content='<pad>   pad>  I  am  a  Software  Engineer  at  HotelKey,  a  US-based  company,  where  I  work  remotely,  contributing  to  a  highly  dynamic  and  innovative  environment.  My  work  primarily  revolves  around  developing  and  maintaining  RESTful  APIs,  cloud-based  architectures,  and  scalable  software  solutions.  I  specialize  in  using  Java,  Python,  C,  and  C++,  along  with  a  range  of  Amazon  Web  Services  (AWS)  tools  such  as  CloudWatch,  S3,  Athena,

In [40]:

prompt_question = "How do you think cultural values should influence technological advancements?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What type of work or industry have you been involved in?'), AIMessage(content='<pad>  Software  Engineering\n'), HumanMessage(content='Can you describe your current role or job responsibilities?'), AIMessage(content='<pad>   pad>  I  am  a  Software  Engineer  at  HotelKey,  a  US-based  company,  where  I  work  remotely,  contributing  to  a  highly  dynamic  and  innovative  environment.  My  work  primarily  revolves  around  developing  and  maintaining  RESTful  APIs,  cloud-based  architectures,  and  scalable  software  solutions.  I  specialize  in  using  Java,  Python,  C,  and  C++,  along  with  a  range  of  Amazon  Web  Services  (AWS)  tools  such 

{'question': 'How do you think cultural values should influence technological advancements?',
 'chat_history': [HumanMessage(content='What type of work or industry have you been involved in?'),
  AIMessage(content='<pad>  Software  Engineering\n'),
  HumanMessage(content='Can you describe your current role or job responsibilities?'),
  AIMessage(content='<pad>   pad>  I  am  a  Software  Engineer  at  HotelKey,  a  US-based  company,  where  I  work  remotely,  contributing  to  a  highly  dynamic  and  innovative  environment.  My  work  primarily  revolves  around  developing  and  maintaining  RESTful  APIs,  cloud-based  architectures,  and  scalable  software  solutions.  I  specialize  in  using  Java,  Python,  C,  and  C++,  along  with  a  range  of  Amazon  Web  Services  (AWS)  tools  such  as  CloudWatch,  S3,  Athena,  DynamoDB,  Lambda,  SNS,  and  SQS.  My  responsibilities  include  working  on  microservices,  database  shaping  industries  and  making  life  more  eIi

In [41]:
prompt_question = "As a master’s student, what is the most challenging aspect of your studies so far?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='Can you describe your current role or job responsibilities?'), AIMessage(content='<pad>   pad>  I  am  a  Software  Engineer  at  HotelKey,  a  US-based  company,  where  I  work  remotely,  contributing  to  a  highly  dynamic  and  innovative  environment.  My  work  primarily  revolves  around  developing  and  maintaining  RESTful  APIs,  cloud-based  architectures,  and  scalable  software  solutions.  I  specialize  in  using  Java,  Python,  C,  and  C++,  along  with  a  range  of  Amazon  Web  Services  (AWS)  tools  such  as  CloudWatch,  S3,  Athena,  DynamoDB,  Lambda,  SNS,  and  SQS.  My  responsibilities  include  working  on  microservices,  databa

{'question': 'As a master’s student, what is the most challenging aspect of your studies so far?',
 'chat_history': [HumanMessage(content='Can you describe your current role or job responsibilities?'),
  AIMessage(content='<pad>   pad>  I  am  a  Software  Engineer  at  HotelKey,  a  US-based  company,  where  I  work  remotely,  contributing  to  a  highly  dynamic  and  innovative  environment.  My  work  primarily  revolves  around  developing  and  maintaining  RESTful  APIs,  cloud-based  architectures,  and  scalable  software  solutions.  I  specialize  in  using  Java,  Python,  C,  and  C++,  along  with  a  range  of  Amazon  Web  Services  (AWS)  tools  such  as  CloudWatch,  S3,  Athena,  DynamoDB,  Lambda,  SNS,  and  SQS.  My  responsibilities  include  working  on  microservices,  database  shaping  industries  and  making  life  more  eIicient.\n'),
  HumanMessage(content='What are your core beliefs regarding the role of technology in shaping society?'),
  AIMessage(con

In [42]:
prompt_question = "What specific research interests or academic goals do you hope to achieve during your time as a master’s student?"
answer = chain({"question": prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What are your core beliefs regarding the role of technology in shaping society?'), AIMessage(content="<pad>`` css>\n As  an  AI  bot,  I  don't  have  personal  thoughts  or  opinions,  but  I  can  provide  you  with  information  and  insights  based  on  the  text  you  provide.  However,  I  can  only  provide  information  and  insights  based  on  the  text  you  provide.  If  you  have  any  specific  questions  or  topics  you  would  like  me  to  cover,  please  let  me  know  and  I'll  do  my  best  to  assist  you.\n"), HumanMessage(content='How do you think cultural values should influence technological advancements?'), AIMessage(content='<pad>   pad

In [47]:
from langchain.chains import RetrievalQA
import json

In [49]:
# Setting up the question-answering pipeline
qa_system = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

In [50]:
# Function to process and clean the response
def format_response(text):
    return text.replace("<pad>", "").replace("\n", " ").strip()

# Function to handle queries and return refined answers
def query_qa_system(user_query):
    response = qa_system.invoke({"query": user_query})  # Fetch response using invoke
    return format_response(response["result"])  # Clean up the text output

In [52]:
question_set = [
    "How old are you?",
    "What is your highest level of education?",
    "What major or field of study did you pursue during your education?",
    "How many years of work experience do you have?",
    "What type of work or industry have you been involved in?",
    "Can you describe your current role or job responsibilities?",
    "What are your core beliefs regarding the role of technology in shaping society?",
    "How do you think cultural values should influence technological advancements?",
    "As a master's student, what is the most challenging aspect of your studies so far?",
    "What specific research interests or academic goals do you hope to achieve during your time as a master's student?"
]

In [54]:
# Generating answers for each question
response_collection = [{"query": q, "response": query_qa_system(q)} for q in question_set]

In [55]:
# Storing results in a JSON file
with open("qa_responses.json", "w") as output_file:
    json.dump(response_collection, output_file, indent=4)

print("Responses have been successfully stored.")

In [56]:
response_collection

[
    {
        "question": "How old are you?",
        "answer": "I am 29 years old."
    },
    {
        "question": "What is your highest level of education?",
        "answer": "Master’s in Data Science and Artificial Intelligence"
    },
    {
        "question": "What major or field of study did you pursue during your education?",
        "answer": "Master’s in Data Science and Artificial Intelligence"
    },
    {
        "question": "How many years of work experience do you have?",
        "answer": "I have over two years of work experience."
    },
    {
        "question": "What type of work or industry have you been involved in?",
        "answer": "Software Engineering"
    },
    {
        "question": "Can you describe your current role or job responsibilities?",
        "answer": "I am a Software Engineer at HotelKey, a US-based company, where I work remotely, contributing to a highly dynamic and innovative environment. My work primarily revolves around developing and ma