# Introduction
This is a simple RAG implementation using ollama LLM.

Local (context) dataset is sliced or "chunked", and the chunks are vectorized using ollama's embedding model. The retrieval function takes a query and returns the top N relevant chunks using cosine similarity. We also have instruction_prompt that acts as a guardrail to "focus" ollama's response. 

The user input is used to retrieve from the vectorized database. This, along with the instruction_prompt, is used to call ollama's chat() function and get a response.
***

Areas for improvement:
1. Replace cosine similarity with more robust and context-aware function to retrieve the vectors.
2. Change VECTOR_DB storage location for scale.
3. Different method of chunking as opposed to having each line as a chunk, as we have it here.

***

Pre-work:

    1. Install ollama CLI: 

    ```
    curl -fsSL https://ollama.com/install.sh | sh
    ```

    2. Download the embedding model and the language model from Hugging Face:
    ```
    ollama pull hf.co/CompendiumLabs/bge-base-en-v1.5-gguf
    ollama pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
    ```

    3. Install ollama python package:
    ```
    pip install ollama
    ```

    4. Download the dataset:

    ```
    cd datasets && curl -O https://huggingface.co/ngxson/demo_simple_rag_py/blob/main/cat-facts.txt && cd ..
```

# Load dataset

In [4]:
dataset = []
with open('datasets/cat-facts.txt', 'r') as file:
  dataset = file.readlines()
  print(f'Loaded {len(dataset)} entries')

Loaded 150 entries


# Implement the vector database

- The embedding model from ollama converts chunks to embedding vector. 
- The chunks and the corresponding vectors are then stored in a list.
- This notebook/example considers each line in the dataset as a chunk, for simplicity.

In [5]:
import ollama

EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'

# Each element in the VECTOR_DB will be a tuple (chunk, embedding)
# The embedding is a list of floats, for example: [0.1, 0.04, -0.34, 0.21, ...]
VECTOR_DB = []

def add_chunk_to_database(chunk):
  embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
  VECTOR_DB.append((chunk, embedding))

for i, chunk in enumerate(dataset):
  add_chunk_to_database(chunk)
  print(f'Added chunk {i+1}/{len(dataset)} to the database')

Added chunk 1/150 to the database
Added chunk 2/150 to the database
Added chunk 3/150 to the database
Added chunk 4/150 to the database
Added chunk 5/150 to the database
Added chunk 6/150 to the database
Added chunk 7/150 to the database
Added chunk 8/150 to the database
Added chunk 9/150 to the database
Added chunk 10/150 to the database
Added chunk 11/150 to the database
Added chunk 12/150 to the database
Added chunk 13/150 to the database
Added chunk 14/150 to the database
Added chunk 15/150 to the database
Added chunk 16/150 to the database
Added chunk 17/150 to the database
Added chunk 18/150 to the database
Added chunk 19/150 to the database
Added chunk 20/150 to the database
Added chunk 21/150 to the database
Added chunk 22/150 to the database
Added chunk 23/150 to the database
Added chunk 24/150 to the database
Added chunk 25/150 to the database
Added chunk 26/150 to the database
Added chunk 27/150 to the database
Added chunk 28/150 to the database
Added chunk 29/150 to the dat

# The retrieval function
The retrieval function takes a query or a prompt and returns the top N most relevant chunks based on a selected distance function i.e. 'cosine similarity' in this example.

In [12]:
def cosine_similarity(a, b):
  dot_product = sum([x * y for x, y in zip(a, b)])
  norm_a = sum([x ** 2 for x in a]) ** 0.5
  norm_b = sum([x ** 2 for x in b]) ** 0.5
  return dot_product / (norm_a * norm_b)


def retrieve(query, top_n=3):
  query_embedding = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
  # temporary list to store (chunk, similarity) pairs
  similarities = []
  for chunk, embedding in VECTOR_DB:
    similarity = cosine_similarity(query_embedding, embedding)
    similarities.append((chunk, similarity))
  # sort by similarity in descending order, because higher similarity means more relevant chunks
  similarities.sort(key=lambda x: x[1], reverse=True)
  # finally, return the top N most relevant chunks
  return similarities[:top_n]


# Generation
Two key parameters for output generation are input query and instruction prompt. The first is what the user entered in the chat, and the latter acts as a 'guardrail' to set the context to the local dataset.

In [13]:
# Prepare the input query

# input_query = input('Ask me a question: ')
input_query = 'tell me about the peculiarities of cats'
retrieved_knowledge = retrieve(input_query)

print('Retrieved knowledge:')
for chunk, similarity in retrieved_knowledge:
  print(f' - (similarity: {similarity:.2f}) {chunk}')

line_break = '\n'
instruction_prompt = f'''You are a helpful chatbot.
Use only the following pieces of context to answer the question. Don't make up any new information:
{line_break.join([f' - {chunk}' for chunk, similarity in retrieved_knowledge])}
'''

print(instruction_prompt)

Retrieved knowledge:
 - (similarity: 0.69) The color of the points in Siamese cats is heat related. Cool areas are darker.

 - (similarity: 0.68) A form of AIDS exists in cats.

 - (similarity: 0.68) Two members of the cat family are distinct from all others: the clouded leopard and the cheetah. The clouded leopard does not roar like other big cats, nor does it groom or rest like small cats. The cheetah is unique because it is a running cat; all others are leaping cats. They are leaping cats because they slowly stalk their prey and then leap on it.

You are a helpful chatbot.
Use only the following pieces of context to answer the question. Don't make up any new information:
 - The color of the points in Siamese cats is heat related. Cool areas are darker.

 - A form of AIDS exists in cats.

 - Two members of the cat family are distinct from all others: the clouded leopard and the cheetah. The clouded leopard does not roar like other big cats, nor does it groom or rest like small cats. 

In [15]:
# call Ollama's chat function with the instruction prompt and the input query
# the chat function will return a generator that will yield the chatbot's responses
# one by one, as they are generated
# the stream=True argument is used to enable real-time responses
# the LANGUAGE_MODEL is the model that will be used to generate the responses

LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'

stream = ollama.chat(
  model=LANGUAGE_MODEL,
  messages=[
    {'role': 'system', 'content': instruction_prompt},
    {'role': 'user', 'content': input_query},
  ],
  stream=True,
)

# print the response from the chatbot in real-time
print('Chatbot response:')
for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

Chatbot response:
The curious case of feline characteristics! As you mentioned, cool areas appear darker in Siamese cats due to a heat-related phenomenon, which is an interesting aspect of their behavior. This unique trait can be attributed to the fact that cooler skin absorbs more light, making them appear darker.

As for two members of the cat family being distinct from all others: the clouded leopard and the cheetah - it's actually the opposite! The cheetah is a running cat because they don't stalk prey like other big cats, but leap on their prey instead. And while the clouded leopard does groom and rest like small cats, there are no members of this family that "roar" or engage in leisurely activities like grooming.

The clouded leopard is actually one of the most agile and flexible cats in the world, with a slender build and incredibly short legs. They're well-suited for their arboreal lifestyle, spending much of their time in trees. The cheetah, on the other hand, has long legs th