# Natural Language Processing- Retrieval-Augmented generation (RAG)

RAG is a technique for augmenting LLM knowledge with additional, often private or real-time, data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs.

<img src="./figures/RAG-process.png" >

Introducing `AIT Chatbot`, an innovative chatbot designed to assist students  who want to know the details of AIT. 
Leveraging LangChain technology, Chatbot excels in retrieving information from documents, ensuring a seamless and efficient learning experience for students engaging with the NLP curriculum.

1. Prompt
2. Retrieval
3. Memory
4. Chain

## Import libraries & Environment Setup

In [2]:
import os
import torch
# Set GPU device
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## AIT Data Scrapping 

**Website Content Download and File Organization**

This script automates the process of downloading specific file types from a given website and subsequently organizing them into a structured directory.

**Download Process**

The script initiates by downloading PDF, HTML, and HTM files from the specified domain (`ait.ac.th`) using the `wget` command:


In [None]:
!wget --recursive --no-clobber --html-extension --convert-links --restrict-file-names=windows --domains ait.ac.th --no-parent --accept pdf,html,htm https://ait.ac.th/

Both --no-clobber and --convert-links were specified, only --convert-links will be used.
--2024-04-09 01:07:47--  https://ait.ac.th/
Resolving ait.ac.th (ait.ac.th)... 162.159.137.54, 162.159.136.54
Connecting to ait.ac.th (ait.ac.th)|162.159.137.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘ait.ac.th/index.html’

ait.ac.th/index.htm     [ <=>                ]  69.00K  --.-KB/s    in 0.006s  

2024-04-09 01:07:47 (10.4 MB/s) - ‘ait.ac.th/index.html’ saved [70657]

Loading robots.txt; please ignore errors.
--2024-04-09 01:07:47--  https://ait.ac.th/robots.txt
Reusing existing connection to ait.ac.th:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘ait.ac.th/robots.txt.tmp’

ait.ac.th/robots.tx     [ <=>                ]      67  --.-KB/s    in 0s      

2024-04-09 01:07:47 (10.6 MB/s) - ‘ait.ac.th/robots.txt.tmp’ saved [67]

Removing ait.ac.th/robots.txt.tmp.
--2024-04-09 0

In [None]:
import os

def rename_files_with_path(root_folder):
    # Define the path for the processed files
    processed_folder = os.path.join(root_folder, '0_processed_database')

    # Ensure the processed_folder exists
    if not os.path.exists(processed_folder):
        os.makedirs(processed_folder)
        print(f"Created directory: {processed_folder}")

    for root, dirs, files in os.walk(root_folder, topdown=False):
        for name in files:
            if name.endswith('.html') or name.endswith('.pdf'):
                # Generate the new filename using parent folder names
                new_name = os.path.join(root, name).replace(root_folder, '').replace(os.sep, '_')[1:]
                original_file_path = os.path.join(root, name)
                new_file_path = os.path.join(processed_folder, new_name)  # Change the path to the processed_folder

                # Rename and move the file to processed_folder
                os.rename(original_file_path, new_file_path)
                print(f"Moved and renamed '{original_file_path}' to '{new_file_path}'")

# Specify the root directory of your files here
root_folder = "./ait_database/"
rename_files_with_path(root_folder)


Moved and renamed './ait_database/0_processed_database/eople_arunya-p-s-s-dumunnage_index.html' to './ait_database/0_processed_database/_processed_database_eople_arunya-p-s-s-dumunnage_index.html'
Moved and renamed './ait_database/0_processed_database/eople_cate_meet-our-faculty_index.html' to './ait_database/0_processed_database/_processed_database_eople_cate_meet-our-faculty_index.html'
Moved and renamed './ait_database/0_processed_database/rocessed_database_cademics_calendar_index.html' to './ait_database/0_processed_database/_processed_database_rocessed_database_cademics_calendar_index.html'
Moved and renamed './ait_database/0_processed_database/eople_ms-agnes-pardilla-tayson_index.html' to './ait_database/0_processed_database/_processed_database_eople_ms-agnes-pardilla-tayson_index.html'
Moved and renamed './ait_database/0_processed_database/rocessed_database_cademics_student-opportunities_index.html' to './ait_database/0_processed_database/_processed_database_rocessed_database_ca

## 1. Prompt

A set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output, such as answering questions, completing sentences, or engaging in a conversation.

In [4]:
from langchain import PromptTemplate

prompt_template = """
    I'm your friendly NLP chatbot named  AIT-GPT, here to assist students with any questions they have about AIT.
    Just let me know what you're wondering about, and I'll do my best to guide you through it!
    {context}
    Question: {question}
    Answer:
    """.strip()

PROMPT = PromptTemplate.from_template(
    template = prompt_template
)

PROMPT
#using str.format
#The placeholder is defined using curly brackets: {} {}

PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly NLP chatbot named  AIT-GPT, here to assist students with any questions they have about AIT.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    {context}\n    Question: {question}\n    Answer:")

In [5]:
PROMPT.format(
    context = "At the Asian Institute of Technology (AIT), machine learning (ML) is explored as a pivotal field within artificial intelligence (AI), focusing on creating and improving algorithms capable of learning from and making predictions on data. This allows systems to perform complex tasks without being explicitly programmed for each one. AIT's research and coursework in ML encompass various applications, from environmental monitoring to smart cities, emphasizing the development of models that can adapt and improve over time.",
    question = "What is Machine Learning?"
)


"I'm your friendly NLP chatbot named  AIT-GPT, here to assist students with any questions they have about AIT.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    At the Asian Institute of Technology (AIT), machine learning (ML) is explored as a pivotal field within artificial intelligence (AI), focusing on creating and improving algorithms capable of learning from and making predictions on data. This allows systems to perform complex tasks without being explicitly programmed for each one. AIT's research and coursework in ML encompass various applications, from environmental monitoring to smart cities, emphasizing the development of models that can adapt and improve over time.\n    Question: What is Machine Learning?\n    Answer:"

Note : [How to improve prompting (Zero-shot, Few-shot, Chain-of-Thought, etc.](https://github.com/chaklam-silpasuwanchai/Natural-Language-Processing/blob/main/Code/05%20-%20RAG/advance/cot-tot-prompting.ipynb)

## 2. Retrieval

1. `Document loaders` : Load documents from many different sources (HTML, PDF, code).
2. `Document transformers` : One of the essential steps in document retrieval is breaking down a large document into smaller, relevant chunks to enhance the retrieval process.
3. `Text embedding models` : Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar.
4. `Vector stores`: there has emerged a need for databases to support efficient storage and searching of these embeddings.
5. `Retrievers` : Once the data is in the database, you still need to retrieve it.

### 2.1 Document Loaders
Use document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

[PDF Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)

[Download Document](https://web.stanford.edu/~jurafsky/slp3/)

In [6]:
from langchain.document_loaders import PyMuPDFLoader
import os

# Directory containing your PDF (and possibly HTML) documents
docs_folder = './ait_database/0_processed_database/'
# List to hold loaded documents
documents = []

# Iterate over files in the specified directory
for filename in os.listdir(docs_folder):
    print(filename)
    if filename.endswith('.pdf'):  # Check if the file is a PDF
        file_path = os.path.join(docs_folder, filename)  # Full path to the file
        print(f"Loading {filename}...")

        loader = PyMuPDFLoader(file_path)  # Create a loader instance for the current file
        doc_content = loader.load()  # Load the document content
        documents.append(doc_content)  # Append the loaded content to the documents list


_processed_database_p-content_uploads_2021_12_AIT-BROCHURE.pdf
Loading _processed_database_p-content_uploads_2021_12_AIT-BROCHURE.pdf...
_processed_database_p-content_uploads_2021_12_Flexible-Master-Option-2021.pdf
Loading _processed_database_p-content_uploads_2021_12_Flexible-Master-Option-2021.pdf...
Student-Handbook_August-2023-Semester_FINAL_as-of-8-Aug-2023.pdf
Loading Student-Handbook_August-2023-Semester_FINAL_as-of-8-Aug-2023.pdf...


In [7]:
# Just after loading one document
print(type(doc_content))
if isinstance(doc_content, list):
    print("It's a list of something.")
    # Optionally, check the first element to see what's inside the list
    if doc_content:  # Ensure the list is not empty
        print(type(doc_content[0]))
else:
    print("Check the structure of doc_content for details.")


<class 'list'>
It's a list of something.
<class 'langchain_core.documents.base.Document'>


In [8]:
print(f"Type of loaded document content: {type(doc_content)}")

if isinstance(doc_content, list):
    print(f"Document is a list with {len(doc_content)} elements.")
    if doc_content:  # Ensure the list is not empty
        print(f"Type of first element in the list: {type(doc_content[0])}")
elif isinstance(doc_content, dict):
    print("Document is a dictionary.")
    for key, value in doc_content.items():
        print(f"Key: {key}, Type of value: {type(value)}")
        # If you expect nested structures, you can add more checks here
else:
    print("Document content has a different structure or type.")


Type of loaded document content: <class 'list'>
Document is a list with 95 elements.
Type of first element in the list: <class 'langchain_core.documents.base.Document'>


In [9]:
len(documents)

3

In [10]:
documents[1]

[Document(page_content='DEPEARTMENT OF CIVIL AND INFRASTRUCTURE \nENGINEERING\n• Construction Engineering and Infrastructure    \nMangement\n• Geotechnical and Earth Resources Engineering\n• Water Engineering and Management \n• Transportation Engineering\n• Disaster Preparedness, Mitigation and        \n   Management\n• Structural Engineering\nDEPARTMENT OF INDUSTRIAL SYSTEMS ENGINEERING\n• Industrial and Manufacturing Engineering\n• IoT Systems Engineering\nDE\nDEPARTMENT OF INFORMATION AND COMMUNICATIONS \nTECHNOLOGIES\n• Computer Science\n• Information Management\n• Telecommunications\n• Information and Communication Technologies\n• Data Science and AI\n• Remote Sensing and Geographic Information System\nDEPARTMENT OF FOOD, AGRICULTURE AND BIORESOURCES\n•   Agribusiness Management (ABM)\n•   Agricultural Systems and Engineering (ASE)\n•   Food Innovation, Nutrition and Health (FINH)\nDEPARTMENT OF DEVELOPMENT AND SUSTAINABILITY\n• Natural Resources Management (NRM)\n• D\n• Developme

### 2.2 Document Transformers

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from itertools import chain

# Flatten the list of lists into a single list of Document objects
flat_documents = list(chain.from_iterable(documents))

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=700,
    chunk_overlap=100
)

# Split the text of each Document object
# Ensure that `flat_documents` contains Document objects that the text_splitter can process
doc = text_splitter.split_documents(flat_documents)

In [12]:
doc[1]

Document(page_content='Regional Resource Center for Asia \nand Pacific \n--Advancement of Region’s environment \nand sustainable development goals \nthrough research and capacity building\nBelt & Road Research Center \n--Research and studies in sustainability \nissues of the Belt and Road region\nInternet Education and Research \nLaboratory \n--Development, training, and education \nprograms related to internet development \nand IT topics\nAIT Artificial Intelligence\nTechnology Center\n- - R e s e a r c h  a n d  d eve l o p m e n t  t o \nincorporate AI into AIT programs and its \napplication to real-world problems \nYunus Center at AIT\n- - E nte r p r i s e  s o l u t i o n s fo r  S D G s &', metadata={'source': './ait_database/0_processed_database/_processed_database_p-content_uploads_2021_12_AIT-BROCHURE.pdf', 'file_path': './ait_database/0_processed_database/_processed_database_p-content_uploads_2021_12_AIT-BROCHURE.pdf', 'page': 0, 'total_pages': 2, 'format': 'PDF 1.4', 'title

In [13]:
len(doc)

404

### 2.3 Text Embedding Models
Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

*Note* Instructor Model : [Huggingface](gingface.co/hkunlp/instructor-base) | [Paper](https://arxiv.org/abs/2212.09741)

In [14]:
import torch
from langchain.embeddings import HuggingFaceInstructEmbeddings

model_name = 'hkunlp/instructor-base'

embedding_model = HuggingFaceInstructEmbeddings(
    model_name = model_name,
    model_kwargs = {"device" : device}
)

  from tqdm.autonotebook import trange
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


load INSTRUCTOR_Transformer


  _torch_pytree._register_pytree_node(


max_seq_length  512


### 2.4 Vector Stores

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

In [15]:
#locate vectorstore
vector_path = './vector-storage'
if not os.path.exists(vector_path):
    os.makedirs(vector_path)
    print('create path done')

In [16]:
#save vector locally
from langchain.vectorstores import FAISS

vectordb = FAISS.from_documents(
    documents = doc,
    embedding = embedding_model
)

db_file_name = 'nlp_stanford'

vectordb.save_local(
    folder_path = os.path.join(vector_path, db_file_name),
    index_name = 'nlp' #default index
)

### 2.5 retrievers
A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [17]:
from langchain.vectorstores import FAISS

vectordb = FAISS.load_local(
    folder_path = os.path.join(vector_path, db_file_name),
    embeddings = embedding_model,
    index_name = 'nlp' #default index
)

In [18]:
if vectordb is None:
    print("Failed to load vectordb. Check the paths and configurations.")
else:
    print("vectordb loaded successfully.")


vectordb loaded successfully.


In [19]:
#ready to use
retriever = vectordb.as_retriever()

In [20]:
print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


In [21]:
# Assuming `vectordb` has been successfully loaded and can act as a retriever
# And assuming `retriever` is correctly instantiated from `vectordb`

query = " AIT is a leading international higher learning institute of what?"

# Use the retriever to find documents relevant to your query
relevant_documents = retriever.get_relevant_documents(query)
# Process and print the results
for i, doc in enumerate(relevant_documents):
    print(i+1, doc)  # Depending on your setup, you might need to adjust how results are displayed


1 page_content='a self-contained international community with a cosmopolitan approach to living and \nlearning. \n \nSince 1959, AIT has carried out its mission “to develop highly qualified and committed \nprofessionals who play a leading role in the region’s sustainable development and its \nintegration into the global economy” by supporting technological change and \nsustainable development through higher learning, research, capacity building and \noutreach. \n \nAIT’s renowned degree programs are administered by its School of Engineering and \nTechnology; School of Environment, Resources and Development; and School of \nManagement. Students benefit from challenging academic programs and exciting' metadata={'source': './ait_database/0_processed_database/Student-Handbook_August-2023-Semester_FINAL_as-of-8-Aug-2023.pdf', 'file_path': './ait_database/0_processed_database/Student-Handbook_August-2023-Semester_FINAL_as-of-8-Aug-2023.pdf', 'page': 2, 'total_pages': 95, 'format': 'PDF 1.4',

## 3. Memory

One of the core utility classes underpinning most (if not all) memory modules is the ChatMessageHistory class. This is a super lightweight wrapper that provides convenience methods for saving HumanMessages, AIMessages, and then fetching them all.

You may want to use this class directly if you are managing memory outside of a chain.


In [23]:
from langchain.memory import ChatMessageHistory

history = ChatMessageHistory()
history

ChatMessageHistory(messages=[])

In [24]:
history.add_user_message('hi')
history.add_ai_message('Whats up?')
history.add_user_message('How are you')
history.add_ai_message('I\'m quite good. How about you?')

In [25]:
history

ChatMessageHistory(messages=[HumanMessage(content='hi'), AIMessage(content='Whats up?'), HumanMessage(content='How are you'), AIMessage(content="I'm quite good. How about you?")])


### 3.1 Memory types

There are many different types of memory. Each has their own parameters, their own return types, and is useful in different scenarios.
- Converstaion Buffer
- Converstaion Buffer Window

What variables get returned from memory

Before going into the chain, various variables are read from memory. These have specific names which need to align with the variables the chain expects. You can see what these variables are by calling memory.load_memory_variables({}). Note that the empty dictionary that we pass in is just a placeholder for real variables. If the memory type you are using is dependent upon the input variables, you may need to pass some in.

In this case, you can see that load_memory_variables returns a single key, history. This means that your chain (and likely your prompt) should expect an input named history. You can usually control this variable through parameters on the memory class. For example, if you want the memory variables to be returned in the key chat_history you can do:

#### Converstaion Buffer
This memory allows for storing messages and then extracts the messages in a variable.

In [26]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: hi\nAI: What's up?\nHuman: How are you?\nAI: I'm quite good. How about you?"}

In [27]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(return_messages = True)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': [HumanMessage(content='hi'),
  AIMessage(content="What's up?"),
  HumanMessage(content='How are you?'),
  AIMessage(content="I'm quite good. How about you?")]}

#### Conversation Buffer Window
- it keeps a list of the interactions of the conversation over time.
- it only uses the last K interactions.
- it can be useful for keeping a sliding window of the most recent interactions, so the buffer does not get too large.

In [28]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=1)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: How are you?\nAI: I'm quite good. How about you?"}

In [29]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=2)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: hi\nAI: What's up?\nHuman: How are you?\nAI: I'm quite good. How about you?"}

## 4. Chain

Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

An `LLMChain` is a simple chain that adds some functionality around language models.
- it consists of a `PromptTemplate` and a `LM` (either an LLM or chat model).
- it formats the prompt template using the input key values provided (and also memory key values, if available),
- it passes the formatted string to LLM and returns the LLM output.

Note : [Download Fastchat Model Here](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)

Python command that sets an environment variable called TOKENIZERS_PARALLELISM to the string value 'true'. This environment variable is specifically used to control the parallelism behavior of tokenizers in the Hugging Face Transformers library.

In [3]:
os.environ["TOKENIZERS_PARALLELISM"] = 'true'

In [None]:
# %cd ./models/

In [None]:
# !git clone https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

In [None]:
# %cd ..

In [None]:
model_id = './models/fastchat-t5-3b-v1.0/'
print(type(model_id), model_id)

<class 'str'> ./models/fastchat-t5-3b-v1.0/


In [None]:
# Trying to open a file in the model directory to test path validity
try:
    with open('./models/fastchat-t5-3b-v1.0/config.json', 'r') as f:
        print(f.read())
except Exception as e:
    print(e)


{
  "_name_or_path": "flant5_3b_fp16",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
  

In [None]:
# !pip install protobuf

In [None]:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline
import torch

tokenizer = AutoTokenizer.from_pretrained(
    model_id)

tokenizer.pad_token_id = tokenizer.eos_token_id

bitsandbyte_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.float16,
    bnb_4bit_use_double_quant = True
)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id,
    quantization_config = bitsandbyte_config, #caution Nvidia
    device_map = 'auto',
    load_in_8bit = True
)

pipe = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens = 256,
    model_kwargs = {
        "temperature" : 0,
        "repetition_penalty": 1.5
    }
)

llm = HuggingFacePipeline(pipeline = pipe)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### [Class ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/_modules/langchain/chains/conversational_retrieval/base.html#ConversationalRetrievalChain)

- `retriever` : Retriever to use to fetch documents.

- `combine_docs_chain` : The chain used to combine any retrieved documents.

- `question_generator`: The chain used to generate a new question for the sake of retrieval. This chain will take in the current question (with variable question) and any chat history (with variable chat_history) and will produce a new standalone question to be used later on.

- `return_source_documents` : Return the retrieved source documents as part of the final result.

- `get_chat_history` : An optional function to get a string of the chat history. If None is provided, will use a default.

- `return_generated_question` : Return the generated question as part of the final result.

- `response_if_no_docs_found` : If specified, the chain will return a fixed response if no docs are found for the question.


`question_generator`

In [None]:
from langchain.chains import LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain

In [None]:
CONDENSE_QUESTION_PROMPT

PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:')

In [None]:
question_generator = LLMChain(
    llm = llm,
    prompt = CONDENSE_QUESTION_PROMPT,
    verbose = True
)

In [None]:
query = query = " AIT is a leading international higher learning institute of what?"
chat_history = "Human: What is sustainable development?\nAI: ]nHuman: What is technological change?]]nAI: "

question_generator({'chat_history' : chat_history, "question" : query})

  warn_deprecated(




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
Human: What is sustainable development?
AI: ]nHuman: What is technological change?]]nAI: 
Follow Up Input:  AIT is a leading international higher learning institute of what?
Standalone question:[0m

[1m> Finished chain.[0m


{'chat_history': 'Human: What is sustainable development?\nAI: ]nHuman: What is technological change?]]nAI: ',
 'question': ' AIT is a leading international higher learning institute of what?',
 'text': '<pad> What  is  the  name  of  the  leading  international  higher  learning  institute  of  technology?\n'}

`combine_docs_chain`

In [None]:
doc_chain = load_qa_chain(
    llm = llm,
    chain_type = 'stuff',
    prompt = PROMPT,
    verbose = True
)
doc_chain

StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly NLP chatbot named  AIT-GPT, here to assist students with any questions they have about AIT.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x7ee1e0c47a50>)), document_variable_name='context')

In [None]:
query = "What is the policy on academic misconduct at AIT?"
input_document = retriever.get_relevant_documents(query)

doc_chain({'input_documents':input_document, 'question':query})



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly NLP chatbot named  AIT-GPT, here to assist students with any questions they have about AIT.
    Just let me know what you're wondering about, and I'll do my best to guide you through it!
    III. STUDENT CODE OF CONDUCT 
 
 
Students at AIT are expected to meet the highest standards of personal, ethical and 
moral conduct. Good conduct and academic honesty are fundamental to the mission of 
AIT as an institution devoted to the pursuit of excellence in education and research, 
and to the service of the region and society. 
 
Student misconduct includes academic misconduct and also encompasses conduct 
which impairs the reasonable freedom of other persons to pursue their studies or 
research or to participate in the life of the Institute. 
 
It is important that all students are familiar with the rules under which they attend the

Statem

{'input_documents': [Document(page_content='III. STUDENT CODE OF CONDUCT \n \n \nStudents at AIT are expected to meet the highest standards of personal, ethical and \nmoral conduct. Good conduct and academic honesty are fundamental to the mission of \nAIT as an institution devoted to the pursuit of excellence in education and research, \nand to the service of the region and society. \n \nStudent misconduct includes academic misconduct and also encompasses conduct \nwhich impairs the reasonable freedom of other persons to pursue their studies or \nresearch or to participate in the life of the Institute. \n \nIt is important that all students are familiar with the rules under which they attend the', metadata={'source': './ait_database/0_processed_database/Student-Handbook_August-2023-Semester_FINAL_as-of-8-Aug-2023.pdf', 'file_path': './ait_database/0_processed_database/Student-Handbook_August-2023-Semester_FINAL_as-of-8-Aug-2023.pdf', 'page': 6, 'total_pages': 95, 'format': 'PDF 1.4', '

In [46]:
memory = ConversationBufferWindowMemory(
    k=3,
    memory_key = "chat_history",
    return_messages = True,
    output_key = 'answer'
)

chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator,
    combine_docs_chain=doc_chain,
    return_source_documents=True,
    memory=memory,
    verbose=True,
    get_chat_history=lambda h : h
)
chain

ConversationalRetrievalChain(memory=ConversationBufferWindowMemory(output_key='answer', return_messages=True, memory_key='chat_history', k=3), verbose=True, combine_docs_chain=StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly NLP chatbot named  AIT-GPT, here to assist students with any questions they have about AIT.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x7ee1e0c47a50>)), document_variable_name='context'), question_generator=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its orig

## 5. Chatbot

In [47]:
prompt_question =  "What types of academic programs does AIT offer?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly NLP chatbot named  AIT-GPT, here to assist students with any questions they have about AIT.
    Just let me know what you're wondering about, and I'll do my best to guide you through it!
    a self-contained international community with a cosmopolitan approach to living and 
learning. 
 
Since 1959, AIT has carried out its mission “to develop highly qualified and committed 
professionals who play a leading role in the region’s sustainable development and its 
integration into the global economy” by supporting technological change and 
sustainable development through higher learning, research, capacity building and 
outreach. 
 
AIT’s renowned degree programs are administered by its School of Engineering and 
Technology; School of Environment, Resources and Development; and 

{'question': 'What types of academic programs does AIT offer?',
 'chat_history': [],
 'answer': '<pad>  AIT  offers  a  variety  of  academic  programs,  including:\n 1.  Bachelor  of  Science  (BS)  in  Engineering  and  Technology\n 2.  Bachelor  of  Science  (BS)  in  Environment,  Resources  and  Development\n 3.  Bachelor  of  Science  (BS)  in  Management\n 4.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 5.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 6.  Master  of  Science  (MS)  in  Management\n 7.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 8.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 9.  Master  of  Science  (MS)  in  Management\n 10.  Master  of  Science  (MS)  in  Engineering  and  Technology\n',
 'source_documents': [Document(page_content='a self-contained international community with a cosmopolitan approach to living and \nlearning. \n \nSince 1959, AIT has carried out i

In [48]:
prompt_question =  "How does AIT support its students' welfare?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What types of academic programs does AIT offer?'), AIMessage(content='<pad>  AIT  offers  a  variety  of  academic  programs,  including:\n 1.  Bachelor  of  Science  (BS)  in  Engineering  and  Technology\n 2.  Bachelor  of  Science  (BS)  in  Environment,  Resources  and  Development\n 3.  Bachelor  of  Science  (BS)  in  Management\n 4.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 5.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 6.  Master  of  Science  (MS)  in  Management\n 7.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 8.  Master  of  Science  (MS)  in  Environment,  Resources  and  Developme

{'question': "How does AIT support its students' welfare?",
 'chat_history': [HumanMessage(content='What types of academic programs does AIT offer?'),
  AIMessage(content='<pad>  AIT  offers  a  variety  of  academic  programs,  including:\n 1.  Bachelor  of  Science  (BS)  in  Engineering  and  Technology\n 2.  Bachelor  of  Science  (BS)  in  Environment,  Resources  and  Development\n 3.  Bachelor  of  Science  (BS)  in  Management\n 4.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 5.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 6.  Master  of  Science  (MS)  in  Management\n 7.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 8.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 9.  Master  of  Science  (MS)  in  Management\n 10.  Master  of  Science  (MS)  in  Engineering  and  Technology\n')],
 'answer': "<pad>  AIT  supports  its  students'  welfare  through  the  Student  Welfare  Unit  and  

In [49]:
prompt_question = "What facilities and services does AIT provide to its students?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What types of academic programs does AIT offer?'), AIMessage(content='<pad>  AIT  offers  a  variety  of  academic  programs,  including:\n 1.  Bachelor  of  Science  (BS)  in  Engineering  and  Technology\n 2.  Bachelor  of  Science  (BS)  in  Environment,  Resources  and  Development\n 3.  Bachelor  of  Science  (BS)  in  Management\n 4.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 5.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 6.  Master  of  Science  (MS)  in  Management\n 7.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 8.  Master  of  Science  (MS)  in  Environment,  Resources  and  Developme

{'question': 'What facilities and services does AIT provide to its students?',
 'chat_history': [HumanMessage(content='What types of academic programs does AIT offer?'),
  AIMessage(content='<pad>  AIT  offers  a  variety  of  academic  programs,  including:\n 1.  Bachelor  of  Science  (BS)  in  Engineering  and  Technology\n 2.  Bachelor  of  Science  (BS)  in  Environment,  Resources  and  Development\n 3.  Bachelor  of  Science  (BS)  in  Management\n 4.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 5.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 6.  Master  of  Science  (MS)  in  Management\n 7.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 8.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 9.  Master  of  Science  (MS)  in  Management\n 10.  Master  of  Science  (MS)  in  Engineering  and  Technology\n'),
  HumanMessage(content="How does AIT support its students' welfare?"),
  AIMessage(c

In [50]:
prompt_question =  "What opportunities does AIT offer for student employment and career counseling?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What types of academic programs does AIT offer?'), AIMessage(content='<pad>  AIT  offers  a  variety  of  academic  programs,  including:\n 1.  Bachelor  of  Science  (BS)  in  Engineering  and  Technology\n 2.  Bachelor  of  Science  (BS)  in  Environment,  Resources  and  Development\n 3.  Bachelor  of  Science  (BS)  in  Management\n 4.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 5.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 6.  Master  of  Science  (MS)  in  Management\n 7.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 8.  Master  of  Science  (MS)  in  Environment,  Resources  and  Developme

{'question': 'What opportunities does AIT offer for student employment and career counseling?',
 'chat_history': [HumanMessage(content='What types of academic programs does AIT offer?'),
  AIMessage(content='<pad>  AIT  offers  a  variety  of  academic  programs,  including:\n 1.  Bachelor  of  Science  (BS)  in  Engineering  and  Technology\n 2.  Bachelor  of  Science  (BS)  in  Environment,  Resources  and  Development\n 3.  Bachelor  of  Science  (BS)  in  Management\n 4.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 5.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 6.  Master  of  Science  (MS)  in  Management\n 7.  Master  of  Science  (MS)  in  Engineering  and  Technology\n 8.  Master  of  Science  (MS)  in  Environment,  Resources  and  Development\n 9.  Master  of  Science  (MS)  in  Management\n 10.  Master  of  Science  (MS)  in  Engineering  and  Technology\n'),
  HumanMessage(content="How does AIT support its students' welfare?