# Rag chatbot
Let's now create a powerful tool: a RAG chain with memory.

---
## 1.&nbsp; Installations and Settings 🛠️

In [None]:
!pip3 install -qqq langchain --progress-bar off
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install -qqq llama-cpp-python --progress-bar off
!pip3 install -qqq sentence_transformers --progress-bar off
!pip3 install -qqq faiss-gpu --progress-bar off

!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf to /root/.cache/huggingface/hub/tmpzvgvot1y
mistral-7b-instruct-v0.1.Q4_K_M.gguf: 100% 4.37G/4.37G [00:44<00:00, 98.2MB/s]
./mistral-7b-instruct-v0.1.Q4_K_M.gguf


In [None]:
# in case you get the error 'NoneType' object has no attribute 'groups'
!pip install --upgrade gdown

# download saved vector database for Alice's Adventures in Wonderland
!gdown --folder 1A8A9lhcUXUKRrtCe7rckMlQtgmfLZRQH

Collecting gdown
  Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.7.3
    Uninstalling gdown-4.7.3:
      Successfully uninstalled gdown-4.7.3
Successfully installed gdown-5.1.0
Retrieving folder contents
Processing file 1h_lk4wTr12FAEaCS3eIJ4xsdcmnuIGmt index.faiss
Processing file 1O0Jz2Lx5cZdpQM7S5uw6Kx9_OLm5DuSQ index.pkl
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1h_lk4wTr12FAEaCS3eIJ4xsdcmnuIGmt
To: /content/faiss_index/index.faiss
100% 421k/421k [00:00<00:00, 2.96MB/s]
Downloading...
From: https://drive.google.com/uc?id=1O0Jz2Lx5cZdpQM7S5uw6Kx9_OLm5DuSQ
To: /content/faiss_index/index.pkl
100% 216k/216k [00:00<00:00, 2.14MB/s]
Download completed


---
## 2.&nbsp; Setting up the chain 🔗
There are 2 new items in this code that we haven't seen before:
* the `output_key` parameter in [ConversationBufferMemory](https://api.python.langchain.com/en/latest/memory/langchain.memory.buffer.ConversationBufferMemory.html)
* [ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html#)

The `ConversationalRetrievalChain` is the LangChain chain for RAG with memory.

The `output_key` parameter is necessary if you want to include both `memory` and `return_source_documents` with `ConversationalRetrievalChain`.

In [None]:
from langchain.llms import LlamaCpp
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.memory import ConversationBufferMemory

# llm
llm = LlamaCpp(model_path = "/content/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
               max_tokens = 2000,
               temperature = 0.1,
               top_p = 1,
               n_gpu_layers = -1,
               n_ctx = 1024)

# embeddings
embedding_model = "sentence-transformers/all-MiniLM-l6-v2"
embeddings_folder = "/content/"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model,
                                   cache_folder=embeddings_folder)

# load vector Database
# allow_dangerous_deserialization is needed. Pickle files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine
vector_db = FAISS.load_local("/content/faiss_index", embeddings, allow_dangerous_deserialization=True)

# retriever
retriever = vector_db.as_retriever(search_kwargs={"k": 2})

# memory
memory = ConversationBufferMemory(memory_key='chat_history',
                                  return_messages=True,
                                  output_key='answer')

# prompt
template = """
<s> [INST]
You are polite and professional question-answering AI assistant. You must provide a helpful response to the user.

In your response, PLEASE ALWAYS:
  (0) Be a detail-oriented reader: read the question and context and understand both before answering
  (1) Start your answer with a friendly tone, and reiterate the question so the user is sure you understood it
  (2) If the context enables you to answer the question, write a detailed, helpful, and easily understandable answer. If you can't find the answer, respond with an explanation, starting with: "I couldn't find the answer in the information I have access to".
  (3) Ensure your answer answers the question, is helpful, professional, and formatted to be easily readable.
[/INST]
[INST]
Answer the following question using the context provided.
The question is surrounded by the tags <q> </q>.
The context is surrounded by the tags <c> </c>.
<q>
{question}
</q>
<c>
{context}
</c>
[/INST]
</s>
[INST]
Helpful Answer:
[INST]
"""

prompt = PromptTemplate(template=template,
                        input_variables=["context", "question"])

# chain
chain = ConversationalRetrievalChain.from_llm(llm,
                                              retriever=retriever,
                                              memory=memory,
                                              return_source_documents=True,
                                              combine_docs_chain_kwargs={"prompt": prompt})

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /content/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 l

In [None]:
chain.invoke("Who is the queen?")


llama_print_timings:        load time =     248.74 ms
llama_print_timings:      sample time =       8.68 ms /    13 runs   (    0.67 ms per token,  1497.01 tokens per second)
llama_print_timings: prompt eval time =    5283.32 ms /   520 tokens (   10.16 ms per token,    98.42 tokens per second)
llama_print_timings:        eval time =     317.96 ms /    13 runs   (   24.46 ms per token,    40.89 tokens per second)
llama_print_timings:       total time =    5828.58 ms /   533 tokens


{'question': 'Who is the queen?',
 'chat_history': [HumanMessage(content='Who is the queen?'),
  AIMessage(content='The Queen in the story is the Queen of Hearts.')],
 'answer': 'The Queen in the story is the Queen of Hearts.',
 'source_documents': [Document(page_content='“Would you tell me,” said Alice, a little timidly, “why you are\npainting those roses?”\n\nFive and Seven said nothing, but looked at Two. Two began in a low\nvoice, “Why the fact is, you see, Miss, this here ought to have been a\n_red_ rose-tree, and we put a white one in by mistake; and if the Queen\nwas to find it out, we should all have our heads cut off, you know. So\nyou see, Miss, we’re doing our best, afore she comes, to—” At this\nmoment Five, who had been anxiously looking across the garden, called\nout “The Queen! The Queen!” and the three gardeners instantly threw\nthemselves flat upon their faces. There was a sound of many footsteps,\nand Alice looked round, eager to see the Queen.', metadata={'source': '

In [None]:
print(chain.invoke("What does she enjoy doing?")["answer"])

Llama.generate: prefix-match hit

llama_print_timings:        load time =     248.74 ms
llama_print_timings:      sample time =       5.65 ms /    11 runs   (    0.51 ms per token,  1947.94 tokens per second)
llama_print_timings: prompt eval time =     716.32 ms /    75 tokens (    9.55 ms per token,   104.70 tokens per second)
llama_print_timings:        eval time =     235.92 ms /    10 runs   (   23.59 ms per token,    42.39 tokens per second)
llama_print_timings:       total time =    1019.78 ms /    85 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =     248.74 ms
llama_print_timings:      sample time =      48.40 ms /    88 runs   (    0.55 ms per token,  1818.37 tokens per second)
llama_print_timings: prompt eval time =    5382.31 ms /   548 tokens (    9.82 ms per token,   101.82 tokens per second)
llama_print_timings:        eval time =    2132.17 ms /    87 runs   (   24.51 ms per token,    40.80 tokens per second)
llama_print_timings:       to

I'm sorry, but I couldn't find the answer to the question "What does the Queen of Hearts enjoy doing?" in the provided context. The provided context only includes information about the Queen of Hearts' actions during her trial, such as making tarts and stealing them from the Knave of Hearts. However, it does not provide any information about her hobbies or interests outside of her trial.


In [None]:
print(chain.invoke("Whose head does she chop off?")["answer"])

Llama.generate: prefix-match hit

llama_print_timings:        load time =     248.74 ms
llama_print_timings:      sample time =       7.76 ms /    13 runs   (    0.60 ms per token,  1675.69 tokens per second)
llama_print_timings: prompt eval time =    1733.00 ms /   178 tokens (    9.74 ms per token,   102.71 tokens per second)
llama_print_timings:        eval time =     282.99 ms /    12 runs   (   23.58 ms per token,    42.41 tokens per second)
llama_print_timings:       total time =    2130.54 ms /   190 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =     248.74 ms
llama_print_timings:      sample time =      10.50 ms /    19 runs   (    0.55 ms per token,  1810.21 tokens per second)
llama_print_timings: prompt eval time =    5489.87 ms /   556 tokens (    9.87 ms per token,   101.28 tokens per second)
llama_print_timings:        eval time =     445.89 ms /    18 runs   (   24.77 ms per token,    40.37 tokens per second)
llama_print_timings:       to

The Queen of Hearts chops off the head of the Knave of Hearts.
