<a href="https://colab.research.google.com/github/satya-karthik-r-interview/Quantiphi-round-02/blob/main/01_Q%26A_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q&A notebook for Quantiphi round 2

## instructions:

*   Run with `GPU` enabled
*   restart the kernel once the `pip install` is completed

### things taken care in this notebook.
---
✅ Download the pdf from the link above

✅ To make indexing faster, you can pick any 2 chapters from the pdf and treat it as a source [chpter 4 and 5 picked based on the page numbers]

✅ Use any in-memory vector database if required.

✅ Use any open source HuggingFace model as the LLM Model

#### Output artifacts we need for evaluation:
---

✅ Entire codebase in GitHub with links to access. (N/A)

✅ Please add docstrings wherever necessary.

✅ Additional Colab notebook to run the backend logic and evaluations:

✅ Please add text blocks in your Colab to add scenarios/assumptions etc to make it readable.

❌ Any additional artifacts like system design architecture, assumptions, list of issues you couldn't solve because of time constraints and how you can fix it in future.

#### Additional (bonus):
---
❌ Streamlit/Gradio Frontend to interact with your pipeline

❌ Wrap the entire application inside a docker container

❌ Draft and implement all the necessary APIs using FastAPI or any other python web framework of choice

✅ Produce alternative way to do the RAG without using any library like Langchain, LLamaIndex or Haystack



## Imports

In [1]:
%%capture
%pip install --upgrade pymilvus transformers datasets torch PyPDF2 accelerate bitsandbytes streamlit


<font color="red">restart the kernel after running this ☝ to take effect.</font>   
go to `Runtime` >> `Restart session` or `CMD/Ctl + M .`



In [56]:
from transformers import AutoTokenizer, AutoModel, pipeline, BitsAndBytesConfig, AutoModelForCausalLM
from datasets import Dataset
import torch
import torch.nn.functional as F
import requests
from pymilvus import MilvusClient
from PyPDF2 import PdfReader
from pathlib import Path
import pandas as pd
import nltk
nltk.download('punkt')
from typing import List
from IPython.display import display, Markdown
import streamlit as st
import transformers
transformers.logging.set_verbosity_error()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## setting up huggingface

In [2]:
# log into huggingface hub
from huggingface_hub import notebook_login
notebook_login()
# use this token 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

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## prepare the document

In [2]:

def extract_pdf_text(pdf_url, start_page, end_page, max_words=30):
    # Download the PDF file
    response = requests.get(pdf_url)
    pdf_file = Path("downloaded_file.pdf")
    pdf_file.write_bytes(response.content)

    def split_text(text, max_words=max_words):
        words = nltk.word_tokenize(text)
        result = []

        for i in range(0, len(words), max_words):
            result.append(' '.join(words[i:i + max_words]))

        return result

    # Extract text from the specified pages
    reader = PdfReader(str(pdf_file))
    text = ""
    for page_num in range(start_page - 1, end_page):
        text += reader.pages[page_num].extract_text()


    # Split the text into sentences with 30 words each
    sentences = split_text(text)

    # Write the sentences to a text file as a backup
    with open("extracted_text.txt", "w", encoding="utf-8") as file:
        file.write("\n".join(sentences))

    # Remove the downloaded PDF file
    pdf_file.unlink()

    return sentences

### extract the text data from the pdf

In [3]:
# Example usage
pdf_url = "https://assets.openstax.org/oscms-prodcms/media/documents/ConceptsofBiology-WEB.pdf"
# extracting text for chapter 4 to 5
start_page = 103
end_page = 146
extracted_sentences = extract_pdf_text(pdf_url, start_page, end_page)
print(f"Extracted {len(extracted_sentences)} sentences.")

Extracted 860 sentences.


## sentence encoder for RAG

In [5]:
sentence_MODEL ="sentence-transformers/all-MiniLM-L6-v2"

INFERENCE_BATCH_SIZE = 64  # Batch size of model inference

# Load tokenizer & model from HuggingFace Hub
sentence_tokenizer = AutoTokenizer.from_pretrained(sentence_MODEL)
sentence_model = AutoModel.from_pretrained(sentence_MODEL)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
# create a dataset for huggingface model from the list sentences
rag_dataset = Dataset.from_pandas(pd.DataFrame(extracted_sentences, columns=["text"]))

In [7]:
def encode_text(batch):
    # Tokenize sentences
    encoded_input = sentence_tokenizer(
        batch["text"], padding=True, truncation=True, return_tensors="pt"
    )

    # Compute token embeddings
    with torch.no_grad():
        model_output = sentence_model(**encoded_input)

    # Perform pooling
    token_embeddings = model_output[0]
    attention_mask = encoded_input["attention_mask"]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    sentence_embeddings = torch.sum(
        token_embeddings * input_mask_expanded, 1
    ) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    # Normalize embeddings
    batch["text_embedding"] = torch.nn.functional.normalize(
        sentence_embeddings, p=2, dim=1
    )
    return batch

In [8]:
rag_dataset = rag_dataset.map(encode_text, batched=True, batch_size=INFERENCE_BATCH_SIZE)
rag_data_list = rag_dataset.to_list()

Map:   0%|          | 0/860 [00:00<?, ? examples/s]

In [9]:
# Connection URI. this will write to a file in the current directory
MILVUS_URI = "./milvus.db"
# Collection name
COLLECTION_NAME = "biology_book"
# Embedding dimension. depending on model
DIMENSION = 384

milvus_client = MilvusClient(MILVUS_URI)
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    dimension=DIMENSION,
    auto_id=True,  # Enable auto id
    enable_dynamic_field=True,  # Enable dynamic fields
    vector_field_name="text_embedding",  # Map vector field name and embedding column in dataset
    consistency_level="Strong",
)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 21361acf566e4ede869e04dc6120a125
DEBUG:pymilvus.milvus_client.milvus_client:Successfully created collection: biology_book
DEBUG:pymilvus.milvus_client.milvus_client:Successfully created an index on collection: biology_book


### insert data in to vector database

In [10]:
%%capture
milvus_client.insert(collection_name=COLLECTION_NAME, data=rag_data_list)

### Retrive data

In [11]:
def extract_context(results: List[dict], threshold:float):
    """ from given list of dictionaries extract the text of each entry if the score is above the threshold
    """
    results = results[0]
    context = ""
    for r in results:
        if r["distance"] > threshold:
            context += r["entity"]["text"] + "\n"
    if context == "":
        context = "answer by saying couldn't find what you are looking for."
    return context

In [12]:
def get_context(milvus_client, collection_name:str, query:str, threshold:float=0.4, limit=5, output_fields =["text"]):
    query = {"text":[query]}
    query_embedding = [v.tolist() for v in encode_text(query)["text_embedding"]]
    search_results = milvus_client.search(
    collection_name=collection_name,
    data=query_embedding,
    limit=limit,
    output_fields=output_fields,
)
    return extract_context(search_results, threshold)

In [13]:
query = "what is autotroph?"
get_context(milvus_client, COLLECTION_NAME, query)

'is an organism that can pr oduc e its o wn f ood . The Gr eek r oots o f the w ordautotroph mean “ self ” ( auto\nphot osynthesis . Solar Dependenc e and F ood P roduc tion Some or ganisms can carr y out phot osynthesis , wher eas others can not . An autotroph\n'

## answer generation

In [21]:
gen_model_name = "nvidia/Llama3-ChatQA-1.5-8B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

gen_tokenizer = AutoTokenizer.from_pretrained(gen_model_name)
gen_tokenizer.pad_token_id = gen_tokenizer.eos_token_id
gen_model = AutoModelForCausalLM.from_pretrained(gen_model_name, quantization_config=bnb_config)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [22]:
def get_formatted_input(messages, context):
    system = """System: This is a chat between a user and an artificial intelligence assistant.
    The assistant gives helpful, detailed, and polite answers to the user's questions based on the context.
    The assistant should also indicate when the answer cannot be found in the context."""
    instruction = "Please give a full and complete answer for the question."

    for item in messages:
        if item['role'] == "user":
            ## only apply this instruction for the first user turn
            item['content'] = instruction + " " + item['content']
            break

    conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) + "\n\nAssistant:"
    formatted_input = system + "\n\n" + context + "\n\n" + conversation

    return formatted_input

In [50]:
def generate_answer(milvus_client, COLLECTION_NAME, user_question)-> str:
    context = get_context(milvus_client, COLLECTION_NAME, user_question)
    messages = [
        {"role": "user", "content": f"{user_question}"}
    ]

    # toggle this to set the context to nothing.
    # context = ""
    formatted_input = get_formatted_input(messages, context)
    tokenized_prompt = gen_tokenizer(gen_tokenizer.bos_token + formatted_input, return_tensors="pt").to(gen_model.device)

    terminators = [
        gen_tokenizer.eos_token_id,
        gen_tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    outputs = gen_model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)

    response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
    answer = gen_tokenizer.decode(response, skip_special_tokens=True)
    final_answer = f"""
    # Question:
    {user_question}
    =================================================================================================
    # Context:
    {context}
    =================================================================================================
    # Answer:
    {answer}
    """
    return final_answer



## testing the model answers

In [51]:
question = "what is mesophyll?"
print(generate_answer(milvus_client, COLLECTION_NAME, question))


    # Question:
    what is mesophyll?
    # Context:
    Chlor ophyll is r esponsible f or the gr een c olor o f plants . The th ylak oid118 5 • Pho tosynthesis Access f or free at opens
chlor ophyll . 4.From wher e does a het erotroph dir ectly ob tain its ener gy ? a.the sun b.the sun and eating other or ganisms c.eating other or
ee fr om an at om o f the chlor ophyll molecule . Chlor ophyll is ther efore said t o “ donat e ” an electr on ( Figure
see as the c ommon gr een c olor associat ed with plants . Chlor ophyllaabsorbs w avelengths fr om either end o f the visible spectrum ( blue and
gen and h ydrogen ions ar e also f ormed fr om the split ting o f water . The r eplacing o f the electr on enables chlor ophyll

    # Answer:
     Mesophyll is the tissue in the interior of a leaf.
    


In [52]:
question = "what is autotroph?"
print(generate_answer(milvus_client, COLLECTION_NAME, question))


    # Question:
    what is autotroph?
    # Context:
    is an organism that can pr oduc e its o wn f ood . The Gr eek r oots o f the w ordautotroph mean “ self ” ( auto
phot osynthesis . Solar Dependenc e and F ood P roduc tion Some or ganisms can carr y out phot osynthesis , wher eas others can not . An autotroph

    # Answer:
     Autotrophs are organisms that can produce their own food. The Greek roots of the word autotroph mean “self” (auto) and “food” (troph).
    


In [53]:
question = "what is Calvin cycle?"
print(generate_answer(milvus_client, COLLECTION_NAME, question))


    # Question:
    what is Calvin cycle?
    # Context:
    the g as that animals e xhale with each br eath . The Calvin cy cleis the t erm used f or the r eactions o f phot osynthesis that
the Cal vin cy cle •Define carbon fixation •Explain ho w phot osynthesis w orks in the ener gy cycle o f all living or ganisms After the ener gy
eactions function as a cycle . Others cal l it the Cal vin-Benson cy cle t o include the name o f another scientis t involved in its disc overy
forms pr eliminar y reactions o f the Calvin cy cle at night , because opening the s tomata at this time c onser ves w ater due t o
Cal vin cycle . These v ariations incr ease efficiency and help c onser ve water and ener gy . ( credit : Piotr W ojtkowski ) Photosynthesis in P

    # Answer:
     The Calvin cycle is the term used for the reactions of photosynthesis that function as a cycle. Others call it the Calvin-Benson cycle to include the name of another scientist involved in its discovery.
    
