## Run an interactive chatbot locally using Langchain

This tutorial is heavily inspired by the online course [LangChain Chat with your data](https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/1/introduction) at Deeplearning.ai.
However I added a lot of my learnings and investigations for clarity and better understanding.

We will be using a technique called Retrieval Augmented Generation.
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc)

To achieve that, we will be heavily relying on Langchain and many of its libraries.
Last but not least all models and generated vectors are stored and run locally.

**Disclaimer**: Despite the fact that all models and files remain local in this tutorial, it is not guaranteed and the author has no responsibility in case of data leaks using this code. Avoid using sensitive or classified information as part of this tutorial.

### Dependencies
We will be installing dependencies as we go so no worries on that one..
This notebook was fully tested under WSL2. If you are using other platforms your mileadge may vary..
Please keep in mind that this tutorial aims to load the models on the GPU for higher performance, but it is easily adaptable to run on the CPU instead 

![RAG.jpg](./assets/RAG.jpg)

## Document Loading

The ffmpeg library might be needed for the Speech-to-Text models to run. Install it in the host system.

In [None]:
! pip install langchain torch
#! apt install ffmpeg

Basic imports and a helper function to show retrieved documents cleanly

In [None]:
import os
import sys
import torch

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content +"\n" + str(d.metadata) for i, d in enumerate(docs)]))


### Loading PDFs
The first file we load is the Technical Summary of the IPCC Report.

Install dependency to load pdf files

In [None]:
! pip install pypdf 

In [None]:
docs = [] # Will be used to store all our loaded documents

# PDFs
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/pdfs/IPCC_AR6_WGII_TechnicalSummary.pdf")
pages = loader.load()
docs.extend(pages)
print(len(docs))
print(pages[0].page_content[0:500])
print(pages[0].metadata)

### Load Youtube video transcripts

In the below section we are going to show how to download and trandscribe a Youtube video using OpenAI's Whisper-Medium model.
The first time you execute this it will download the model on your system. Transcription will run on your GPU but you can change to CPU if you want (for compatibility reasons). Keep in mind it will take a few minutes. The transcription will be then loaded into the docs.

**Notice**: Youtube is changing its code very ofter, so the library we are using here (yt_dlp) might not work in the future. Please always make sure you have the latest version installed. If you run into an error, it is safe to continue with the tutorial without this step.

In [None]:
! pip install yt_dlp
! pip install pydub
! pip install transformers
! pip install librosa

In [None]:
# Youtube
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers.audio import (
    OpenAIWhisperParserLocal,
)
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url="https://www.youtube.com/watch?v=aywZrzNaKjs"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParserLocal(device="gpu",lang_model="openai/whisper-medium")
)
pages = loader.load()
docs.extend(pages)
print(len(docs))
print(pages[0].page_content[0:500])
print(pages[0].metadata)

### Loading URLs

In the below section we are showing how to load URLs. As before, the loaded pages are added to the docs.

In [None]:
# URLs
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://python.langchain.com/docs/expression_language/get_started")
pages = loader.load()
docs.extend(pages)
print(len(docs))
print(pages[0].page_content[0:100])
print(pages[0].metadata)

## Document Splitting

In this section we are going to show how to split the documents in chunks that we can use for creating Embeddings later. You can see here that we are using a chunk_size of 1500 and we use an overlap of 150. The overlap helps in the continuity of the chunks and the context later on.

In [None]:

# Example
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 3,
    chunk_overlap = 1
)

text1 = 'abcdefghijklmnopqrstuvwxyz'
splits = text_splitter.split_text(text1)
print(splits)

# Our Doc Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)
print(len(splits))

## Creating Vector scores and Embeddings

Let's take our splits and embed them. Embedding is the process of creating vectors using deep learning. An "embedding" is the output of this process — in other words, the vector that is created by a deep learning model for the purpose of similarity searches by that model.

We are going to use the BGE-small model. You can try other models as well if you want to experiment.
The first time you run this command it will download the model from Hugging Face.

In [None]:
! pip install sentence-transformers

In [None]:
# Create embeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

embedding = HuggingFaceEmbeddings(
                    model_name="BAAI/bge-small-en-v1.5",
                    cache_folder="./models/",
                )

embedding1 = embedding.embed_query("i like dogs")
print(embedding1[0:10])

Next we will store those as vectors using Chromadb. The database is stored in a local folder called 'db'. At every execution, we clean the library and start from an empty one.

**Notice**: In case this step fails in creating the db or importing chromadb you might need to restart the notebook kernel

In [None]:
! pip install chromadb

In [None]:
from langchain.vectorstores import Chroma

persist_directory = 'db'

import shutil
# remove old database files if any
try:
  shutil.rmtree(persist_directory)
except Exception as e:
  print(e)
finally:
    os.makedirs(persist_directory)

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
vectordb.persist()
print(vectordb._collection.count())

### Retrieval Techniques

In this step we are going to use some retrieval techniques to find similar or relevant information based on a question. There are many types of searches. We are going to see 3 of them in this example.

#### Similarity search

Similarity search will return... similar vectors to your question! We configured it here to return 3 results. More information can be found [in Langchain's documentation](https://python.langchain.com/docs/integrations/vectorstores/chroma).

#### Max Marginal Relevance search

This type of search behaves similar to the Similarity search but it filters down the results to maximize the "new information" of the overall retrieved documents. In this example we are fetching 4 documents and we are choosing the 3 that are the most "irrelevant" to each other, thus avoiding getting duplicate information that might be stored in different places of the ingested files/sources.

#### Similarity Score search

Similar to the similarity search but also returns the scores of the documents (lower is better). By default ChromaDb uses [L2 Distance](https://docs.trychroma.com/usage-guide#changing-the-distance-function) to score the documents. When defining a retriever we can also add the "score_threshold" we want in the "search_kwags". In our case we set it to 0.5.

In [None]:
question = "Why should I use Langchain?"

# Test similarity search
print("\nSIMILARITY SEARCH")
docs_sim = vectordb.similarity_search(question,k=3)
print(len(docs_sim))
pretty_print_docs(docs_sim)

question = "What are some major risks highlighted in the IPCC report?"

# Test similarity search
print("\nSIMILARITY SEARCH")
docs_sim = vectordb.similarity_search(question,k=3)
print(len(docs_sim))
pretty_print_docs(docs_sim)

# Test retrieval via mmr
print("\nMMR SEARCH")
docs_mmr = vectordb.max_marginal_relevance_search(question, k=3, fetch_k=4)
print(len(docs_mmr))
print(docs_mmr)
pretty_print_docs(docs_mmr)

# Test retrieval via similarity score
print("\nSIMILARITY SCORE SEARCH")
docs_sim_score_tuple = vectordb.similarity_search_with_score(question, k=3)
# print(docs_sim_score_tuple)
scores = [d[1] for d in docs_sim_score_tuple]
print(scores) # The lower the better
docs_sim_score = [d[0] for d in docs_sim_score_tuple]
pretty_print_docs(docs_sim_score)

# Create and Test retrievers

print("\nMMR SEARCH")
retriever_mmr = vectordb.as_retriever(search_type = "mmr", search_kwargs={"k":3, "fetch_k":4})
docs = retriever_mmr.get_relevant_documents(question)
pretty_print_docs(docs)
assert(docs == docs_mmr)

print("\nSIMILARITY SCORE SEARCH")
retriever_score = vectordb.as_retriever(search_type = "similarity_score_threshold", search_kwargs={"k":3, "score_threshold": 0.5})
docs = retriever_score.get_relevant_documents(question)
pretty_print_docs(docs)
assert(docs == docs_sim_score)


#### Retrieval via LLMs (compression)

In this section we will see one more way to retrieve documents from our Vector database and compare it with the MMR search of the section before and the Similarity score. In order to be able to pull more context our of the ingested documents, we can assign an LLM to "summarize"/"compress" the results. We are going to use the open source Mistral Instruct model as a compressor retriever, and we will use this until the end of the tutorial.

We set the context for the model we are using to the maximum (32786) and temperature to zero for repeatable and accurate compression.
Note that the compression_retriever requires a base_retriever and we are using the retriever_score (Similarity Score search) we created earlier. Also, since we are running an LLM invokation for each retrieved document, this retriever makes chating with the LLM later on quite much longer.

With this command you can install the cuda-enabled llama-cpp-python library. This command might defer depending on your system (Windows, Linux, MacOS). This one is needed to run llama.cpp on the GPU on Linux/WSL. More information in the [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/).

**Notice**: In case you have issues installing llama-cpp-python refer to [this PR](https://github.com/oobabooga/text-generation-webui/issues/1534).

In [None]:
! CMAKE_ARGS='-DLLAMA_CUBLAS=on' pip install --force-reinstall --no-cache-dir llama-cpp-python

In [None]:
! pip install wget

In [None]:
from langchain.llms import LlamaCpp
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
from os.path import expanduser
import wget

# Compression retriever via llm

model_url = "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf?download=true"
model_path = "models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
if not os.path.isfile(model_path):
    print("Downloading model")
    wget.download(model_url, model_path)
else:
    print("Model already downloaded")

# Initialize a model for sequence-to-sequence tasks using the specified pretrained model
compress_llm = LlamaCpp(
    model_path="./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    streaming=False,
    n_ctx=32768,
    temperature=0,
)
compressor = LLMChainExtractor.from_llm(compress_llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever_score
)
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

### Q&A

In this section we will start asking questions to our model while providing it with context. We will also set up a prompt template that is instructing our model with what we expect. In the last part we will use a Chat chain to maintain the chat history as part of the context of the model.

#### Loading of the Llama 2 Chat Model

We are going to use the same Mistral 7b instruct LLAma CPP model and we will augment Llama-2 LLMs with the Llama2Chat wrapper to support the [Llama-2 chat prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2). More examples in the [Langchain Docs](https://python.langchain.com/docs/integrations/chat/llama2_chat).

In [None]:
! pip install langchain_experimental

In [None]:
# Use a chat model to retrieve an answer to a question
model_name = "./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
model_path = expanduser(model_name)

llm_model = LlamaCpp(
    model_path=model_path,
    streaming=False,
    n_ctx=4096,
    temperature=0,
)

from langchain_experimental.chat_models import Llama2Chat

llm = Llama2Chat(llm=llm_model)
llm.invoke("Hello world!")

#### Single answer retrieval

In this step we are showing how to retrieve an answer without chat history. We are using a prompt template that specifies where the {context} and {question} should go. Please note that we are using the compression_retriever we defined earlier as the chain retriever. Feel free to experiment with others too.

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

template_no_hist = """[INST] <<SYS>>
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
Context:
{context}
<</SYS>> 
Question:
{question} [/INST] 
"""
A_CHAIN_PROMPT = PromptTemplate.from_template(template_no_hist)

a_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=compression_retriever,
    return_source_documents=True,
    verbose=True,
    chain_type_kwargs={"prompt": A_CHAIN_PROMPT}
)

result = a_chain({"query": question})
print(result["result"])
pretty_print_docs(result["source_documents"])
print(result)



#### Creating the Q&A chat prompt template

In this step we are preparing a prompt template compatible with Llama 2 format that supports chat history. Also we are specifying where the {context}, the {chat_history} and the {question} should go. This template is going to be used every time we are asking something to the Chat LLM.

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """[INST] <<SYS>>
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
Context:
{context}
<</SYS>> 
Chat History: 
{chat_history}
Question: 
{question} [/INST] 
"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)
print(QA_CHAIN_PROMPT)

#### Q&A with chat history

For this section we will use the ConversationalRetrievalChain class to create a chain and interact with our Chat model. I highly suggest reading thourgh the [Langchain docs and API](https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html#langchain-chains-conversational-retrieval-base-conversationalretrievalchain) since this class has many options.

This chain is using a condense step to summarize the chat history and pass it on to the Chat model to answer follow up questions. We show here how you could use you own, however Langchain has a prompt like this by default which normally should produce good results. Check [this PR](https://github.com/langchain-ai/langchain/issues/4076) for more explanation on the topic.

For the chat history, we are keeping track of every Q&A step and pass it in the chaing. Alternatively you can use the ConversationBufferMemory class and pass it to the chain (here commented out).

Look at the logs for more information on what is happening during this chain execution.

In [None]:
from langchain.chains import ConversationalRetrievalChain
#from langchain.memory import ConversationBufferMemory
#memory = ConversationBufferMemory(memory_key="chat_history", input_key='question', output_key='answer', return_messages=True)
chat_history = []

condense_question_template = """
    [INST] <<SYS>>
    Return text in the original language of the follow up question.
    If the follow up question does not need context, return the exact same text back.
    Rephrase the follow up question based on the chat history only if it needs context.
    <</SYS>> 
    Chat History: {chat_history}
    Follow Up question: {question}
    Standalone question: [/INST] 
"""
condense_question_prompt = PromptTemplate.from_template(condense_question_template)

conv_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=compression_retriever,
    # memory=memory,
    return_generated_question=True,
    return_source_documents=True,
    combine_docs_chain_kwargs={"prompt": QA_CHAIN_PROMPT},
    verbose=True,
    # response_if_no_docs_found="No context found!",
    rephrase_question=False, # Not sure if this does anything...
    # condense_question_prompt=condense_question_prompt
)
result = conv_chain({"question": question, "chat_history": chat_history})
print(result["answer"])
pretty_print_docs(result["source_documents"])
print(result)
chat_history.extend([(question, result["answer"])])

#### Asking Follow-up questions

Here we are asking 2 follow up questions to the Chat LLM to test the chat history and the context. Pay attention to the steps in the verbose logs. Also, at the end, we are showing how to clear the chat history.

In [None]:
fu_question = "Can you re-write your response so a 10-year old kid can understand?" 
result = conv_chain({"question": fu_question, "chat_history": chat_history})
print(result["answer"])
pretty_print_docs(result["source_documents"])
print(result)
chat_history.extend([(fu_question, result["answer"])]) # Save Q&A in chat history

fu_question = "What was my previous question?" 
result = conv_chain({"question": fu_question, "chat_history": chat_history})
print(result["answer"])
pretty_print_docs(result["source_documents"])
print(result)
chat_history = [] # Clears the chat history