# Introduction
In this notebook, we will pick up where we have left in the [preparation notebook](https://github.com/shaaagri/iat481-nlp-proj/blob/main/LLama2_vanilla_bot.ipynb) and will add a vector store with a retriever to the pipeline. This should be enough to lay the framework to realize our intention - a chatbot powered by RAG (Retrieval Augmented Generation). Just a reminder, the specialized knowledge we plan to inject into the chatbot is concerned with sleep hygiene and related science-backed tips.

# Workflow

1. Setting Up LLama-2 and LangChain
2. Text Embeddings and the Vector Store
3. Creating a RAG pipeline using sample data

# Setting Up LLama-2 and LangChain

The next section merely repeats the code from the preparation notebook. If that notebook has been run already, running this section may not be required as the kernel should keep its state. Otherwise, the same code can be run for convenience.

### Prerequisites

In [12]:
# GPU llama-cpp-python
%set_env CMAKE_ARGS="-DLLAMA_CUBLAS=on"
%set_env FORCE_CMAKE=1
!pip install llama-cpp-python --upgrade --verbose
!pip install huggingface_hub
!pip install llama-cpp-python

env: CMAKE_ARGS="-DLLAMA_CUBLAS=on"
env: FORCE_CMAKE=1
Using pip 24.0 from C:\Program Files\Python312\Lib\site-packages\pip (python 3.12)
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [1]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

### Model

In [2]:
model_name_or_path = "TheBloke/Llama-2-7B-chat-GGUF"
model_basename = "llama-2-7b-chat.Q4_K_M.gguf"

Before downloading the model again, which can be time-consuming, check the Hugging Face Hub's cache folder where it may be stored during the previous notebook runs. 

In [3]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

### LangChain

In [4]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

In [92]:
prompt_template=f'''[INST]
<<SYS>>
You are helpful, respectful, caring and honest assistant. You do not have expressions or emotions. You are objective and provide everything that is helpful to know given the question, but you are not chatty. Answer as helpfully as you possibly can.
<</SYS>>

USER: {question}

ASSISTANT: 
[/INST]
'''

In [93]:
prompt = PromptTemplate(
    input_variables=["question"],
    template=prompt_template,
)

In [94]:
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [95]:
llm = LlamaCpp(
    # Make sure the model path is correct for your system!
    model_path="/Users/Narratic-DEV002/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf",
    
    temperature=1.0,
    max_tokens=1024,
    repeat_penalty=1.02,
    top_p=0.6, # nucleus sampling
    top_k=150,  # sample from k top tokens 
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/Narratic-DEV002/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.

In [11]:
from langchain.globals import set_debug

# debugging on demand
set_debug(True) 

In [87]:
question='Describe the main campus of the Simon Fraser University'

In [96]:
llm.invoke(prompt.format(question=question))

[32;1m[1;3m[llm/start][0m [1m[1:llm:LlamaCpp] Entering LLM run with input:
[0m{
  "prompts": [
    "[INST]\n<<SYS>>\nYou are helpful, respectful, caring and honest assistant. You do not have expressions or emotions. You are objective and provide everything that is helpful to know given the question, but you are not chatty. Answer as helpfully as you possibly can.\n<</SYS>>\n\nUSER: Describe the main campus of the Simon Fraser University\n\nASSISTANT: \n[/INST]"
  ]
}
Thank you for your question! The main campus of Simon Fraser University is located in Burnaby, British Columbia, Canada. The campus is situated on 32 hectares of land and features a variety of buildings, including academic facilities, residence halls, and recreational spaces.
Here are some key features of the main campus:

* The SFU Library, which houses over 2 million books, journals, and other resources, as well as a variety of study spaces and group work areas.
* The Academic Quad, which is home to many of the univ


llama_print_timings:        load time =     762.39 ms
llama_print_timings:      sample time =      77.22 ms /   342 runs   (    0.23 ms per token,  4428.90 tokens per second)
llama_print_timings: prompt eval time =    6248.81 ms /    98 tokens (   63.76 ms per token,    15.68 tokens per second)
llama_print_timings:        eval time =   34430.05 ms /   341 runs   (  100.97 ms per token,     9.90 tokens per second)
llama_print_timings:       total time =   42146.83 ms /   439 tokens


[36;1m[1;3m[llm/end][0m [1m[1:llm:LlamaCpp] [42.15s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "Thank you for your question! The main campus of Simon Fraser University is located in Burnaby, British Columbia, Canada. The campus is situated on 32 hectares of land and features a variety of buildings, including academic facilities, residence halls, and recreational spaces.\nHere are some key features of the main campus:\n\n* The SFU Library, which houses over 2 million books, journals, and other resources, as well as a variety of study spaces and group work areas.\n* The Academic Quad, which is home to many of the university's academic departments, including the Faculty of Arts and Social Sciences, the Faculty of Science, and the Faculty of Applied Sciences.\n* The Student Union Building (SUB), which offers a variety of services and amenities, including food services, a convenience store, and a game room.\n* The SFU Theatre, which hosts a vari

"Thank you for your question! The main campus of Simon Fraser University is located in Burnaby, British Columbia, Canada. The campus is situated on 32 hectares of land and features a variety of buildings, including academic facilities, residence halls, and recreational spaces.\nHere are some key features of the main campus:\n\n* The SFU Library, which houses over 2 million books, journals, and other resources, as well as a variety of study spaces and group work areas.\n* The Academic Quad, which is home to many of the university's academic departments, including the Faculty of Arts and Social Sciences, the Faculty of Science, and the Faculty of Applied Sciences.\n* The Student Union Building (SUB), which offers a variety of services and amenities, including food services, a convenience store, and a game room.\n* The SFU Theatre, which hosts a variety of performances and events throughout the year, including concerts, plays, and dance performances.\n* The Athletics and Recreation Centre

# Text Embeddings and the Vector Store

As our RAG bot is going to rely on the supply of extra knowledge that we will manually package into the project (in the form of Q&A data), here comes a crucial part - choosing a text embedding model and the vector store. The former will take care of converting our textual Q&A data into vector representation which is required to do the semantic similarity comparison later - in other words, to match to the best of our ability the user question to the appropriate piece of information within the extra knowledge. The latter is going to neatly store these representations, providing access to them as needed. These two nodes are cornerstones of any RAG project and the use cases and the range of choices for the models and the vector stores are well documented.

### Choosing the Text Embedding Model

For a long time, there was little choice for a specific model that produces the embeddings beside OpenAI's `ada-002`, which is provided through API requiring a small fee to use. However, by April 2024 (the time of writing this notebook) the range has considerably increased, and now there are not only players in the market (e.g. [Cohere](https://cohere.com/embeddings), [Jina](https://jina.ai/embeddings/) - both offer a free tier) but also open-source text embeddings model can be found, such as `SentenceTransformers` available at Hugging Face ([link](https://huggingface.co/sentence-transformers)). 

As students we are delighted to be able to use another model free of charge; our only question is whether it performs comparably to ada-002. The good news is that our brief research has told us we should be fine with the open-source Sentence Transformers (which come as a [family of models](https://www.sbert.net/docs/pretrained_models.html]) each trading off performance for quality in various ways) - here are the resources we are referring to: [(1)](https://iamnotarobot.substack.com/p/should-you-use-openais-embeddings), [(2)](https://www.reddit.com/r/MachineLearning/comments/11okrni/discussion_compare_openai_and_sentencetransformer/), [(3)](https://supabase.com/blog/fewer-dimensions-are-better-pgvector), ([4](https://weaviate.io/blog/how-to-choose-a-sentence-transformer-from-hugging-face])).

The consensus seems to be that it's not necessary to use ada-002 at all as the open-source models match it and sometimes even exceed it in performance. One particular text embedding model that seems to have an ideal balance between size, speed, and accuracy is `all-MiniLM-L6-v2`. It also has an "older brother", a slightly larger model `all-MiniLM-L12-v2`, and according to [this table](https://www.sbert.net/docs/pretrained_models.html), it's only marginally better than all-MiniLM-L6-v2, while being significantly slower. All in all, we think the all-MiniLM-L6-v2 model is an excellent start, given our use case is mostly concerned with general purpose English words. It is also supported by LangChain out of the box. 

In [100]:
!pip install sentence-transformers

Defaulting to user installation because normal site-packages is not writeable
Collecting sentence-transformers
  Downloading sentence_transformers-2.6.1-py3-none-any.whl.metadata (11 kB)
Collecting transformers<5.0.0,>=4.32.0 (from sentence-transformers)
  Downloading transformers-4.39.3-py3-none-any.whl.metadata (134 kB)
     ---------------------------------------- 0.0/134.8 kB ? eta -:--:--
     ----- ------------------------------- 20.5/134.8 kB 682.7 kB/s eta 0:00:01
     -------------------------------------- 134.8/134.8 kB 2.0 MB/s eta 0:00:00
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.2.2-cp312-cp312-win_amd64.whl.metadata (26 kB)
Collecting scikit-learn (from sentence-transformers)
  Downloading scikit_learn-1.4.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy (from sentence-transformers)
  Downloading scipy-1.13.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.6 kB ? eta -:--:--
   



In [103]:
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

### Choosing the Vector Store

As for the vector database that is going to store the embeddings, the decision is considerably easier. It comes down to two well-known alternatives: `Pinecone` (managed) and `ChromaDB` (self-hosted). To remind the reader, our guiding design principle is to get away with open-source and/or free-tier components for 100% of the pipeline, hence ChromeDB is the obvious choice. To consult with some literature we checked, for instance, [this article](https://medium.com/@sakhamurijaikar/which-vector-database-is-right-for-your-generative-ai-application-pinecone-vs-chromadb-1d849dd5e9df), and it confirmed our assumptions that ChromaDB should be more than enough for what is just a student prototype.

In [104]:
from langchain_community.vectorstores import Chroma