# Introduction
In this notebook, we will pick up where we have left in the [preparation notebook](https://github.com/shaaagri/iat481-nlp-proj/blob/main/LLama2_vanilla_bot.ipynb) and will add a vector store with a retriever to the pipeline. This should be enough to lay the framework to realize our intention - a chatbot powered by RAG (Retrieval Augmented Generation). Just a reminder, the specialized knowledge we plan to inject into the chatbot is concerned with sleep hygiene and related science-backed tips.

# Workflow

1. Setting Up LLama-2 and LangChain
2. Text Embeddings and the Vector Store

# Setting Up LLama-2 and LangChain

The next section merely repeats the code from the preparation notebook. If that notebook has been run already, running this section may not be required as the kernel should keep its state. Otherwise, the same code can be run for convenience.

### Prerequisites

In [12]:
# GPU llama-cpp-python
%set_env CMAKE_ARGS="-DLLAMA_CUBLAS=on"
%set_env FORCE_CMAKE=1
!pip install llama-cpp-python --upgrade --verbose
!pip install huggingface_hub
!pip install llama-cpp-python

env: CMAKE_ARGS="-DLLAMA_CUBLAS=on"
env: FORCE_CMAKE=1
Using pip 24.0 from C:\Program Files\Python312\Lib\site-packages\pip (python 3.12)
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [13]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

### Model

In [14]:
model_name_or_path = "TheBloke/Llama-2-7B-chat-GGUF"
model_basename = "llama-2-7b-chat.Q4_K_M.gguf"

Before downloading the model again, which can be time-consuming, check the Hugging Face Hub's cache folder where it may be stored during the previous notebook runs. 

In [None]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

### LangChain

In [15]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

In [101]:
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Answer helpfully and concisely. Answer only to the question that is being asked and provide only relevant information.

  USER: %s

  ASSISTANT:
  '''

In [44]:
prompt = PromptTemplate.from_template(prompt_template)

In [31]:
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [122]:
llm = LlamaCpp(
    # Make sure the model path is correct for your system!
    model_path="/Users/Narratic-DEV002/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf",
    
    temperature=0.6,
    max_tokens=1024,
    repeat_penalty=1.2,
    top_p=0.9,
    top_k=150,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/Narratic-DEV002/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.

In [39]:
question='Where Simon Fraser University is located?'

In [40]:
print(question)

Where Simon Fraser University is located?


In [129]:
llm.invoke(question)

Llama.generate: prefix-match hit



Simon Fraser University (SFU) is a public research university located in Burnaby, British Columbia, Canada. The main campus of SFU is situated on the traditional lands of the Coast Salish peoples, specifically the territories of the Squamish and Tsleil-Waututh First Nations.
Address: Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada
The university has a second campus in Surrey, which is located approximately 30 minutes southeast of the main campus by car. This campus houses the Faculty of Health Sciences and other programs.
Address: Simon Fraser University – Surrey Campus, 2455 Old Clayburn Road, Surrey, BC V3S 7V9, Canada


llama_print_timings:        load time =     867.06 ms
llama_print_timings:      sample time =      32.50 ms /   177 runs   (    0.18 ms per token,  5445.65 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   17285.27 ms /   177 runs   (   97.66 ms per token,    10.24 tokens per second)
llama_print_timings:       total time =   18022.78 ms /   178 tokens


'\nSimon Fraser University (SFU) is a public research university located in Burnaby, British Columbia, Canada. The main campus of SFU is situated on the traditional lands of the Coast Salish peoples, specifically the territories of the Squamish and Tsleil-Waututh First Nations.\nAddress: Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada\nThe university has a second campus in Surrey, which is located approximately 30 minutes southeast of the main campus by car. This campus houses the Faculty of Health Sciences and other programs.\nAddress: Simon Fraser University – Surrey Campus, 2455 Old Clayburn Road, Surrey, BC V3S 7V9, Canada'

# Text Embeddings and the Vector Store

As our RAG bot is going to rely on the supply of extra knowledge that we will manually package into the project (in the form of Q&A data), here comes a crucial part - choosing a text embedding model and the vector store. The former will take care of converting our textual Q&A data into vector representation which is required to do the semantic similarity comparison later - in other words, to match to the best of our ability the user question to the appropriate piece of information within the extra knowledge. The latter is going to neatly store these representations, providing access to them as needed. These two nodes are cornerstones of any RAG project and the use cases and the range of choices for the models and the vector stores are well documented.

### Choosing the Text Embedding Model

For a long time, there was little choice for a specific model that produces the embeddings beside OpenAI's `ada-002`, which is provided through API requiring a small fee to use. However, by April 2024 (the time of writing this notebook) the range has considerably increased, and now there are not only players in the market (e.g. [Cohere](https://cohere.com/embeddings), [Jina](https://jina.ai/embeddings/) - both offer a free tier) but also open-source text embeddings model can be found, such as `Sentence Transformers` available at Hugging Face ([link](https://huggingface.co/sentence-transformers)). 

As students we are delighted to be able to use another model free of charge; our only question is whether it performs comparably to ada-002. The good news is that our brief research has told us we should be fine with the open-source Sentence Transformers (which come as a [family of models](https://www.sbert.net/docs/pretrained_models.html]) each trading off performance for quality in various ways) - here are the resources we are referring to: [(1)](https://iamnotarobot.substack.com/p/should-you-use-openais-embeddings), [(2)](https://www.reddit.com/r/MachineLearning/comments/11okrni/discussion_compare_openai_and_sentencetransformer/), [(3)](https://supabase.com/blog/fewer-dimensions-are-better-pgvector), ([4](https://weaviate.io/blog/how-to-choose-a-sentence-transformer-from-hugging-face])).

The consensus seems to be that it's not necessary to use ada-002 at all as the open-source models match it and sometimes even exceed it in performance. One particular text embedding model that seems to have an ideal balance between size, speed, and accuracy is `all-MiniLM-L6-v2`. It also has an "older brother", a slightly larger model `all-MiniLM-L12-v2`, and according to [this table](https://www.sbert.net/docs/pretrained_models.html), it's only marginally better than all-MiniLM-L6-v2, while being significantly slower. All in all, we think the all-MiniLM-L6-v2 model is an excellent start. It is also supported by LangChain out of the box.

### Choosing the Vector Store