<a href="https://colab.research.google.com/github/taraliu23/Turmerik-MLE-Takehome/blob/main/notebooks/05_LlamaIndex_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/6424f01ea4f3051f54dbbd85/oqVQ04b5KiGt5WOWJmYt8.png" alt="LlamaIndex" width="100" height="100">
    <img src="https://cdn4.iconfinder.com/data/icons/file-extensions-1/64/pdfs-512.png" alt="PDF" width="100" height="100">
</p>

# **LlamaIndex RAG Chat**
Perform RAG (Retrieval-Augmented Generation) from your PDFs using this Colab notebook! Powered by Llama 2
<br><br>

## **Features**
- Free, no API or Token required
- Fast inference on Colab's free T4 GPU
- Powered by Hugging Face quantized LLMs (llama-cpp-python)
- Powered by Hugging Face local text embedding models
- Set custom prompt templates
- Prepared Chat mode (not QA)
<br><br>

## **Getting started**
1. Make sure the Colab's Runtime Type is set to T4 GPU (at least)
2. Edit preferences in Block 4
3. Upload your PDF into Files (Default name: `rag_data.pdf`)
4. Runtime > Run all
<br><br>

### [**GitHub repository**](https://github.com/taraliu23/LlamaIndex-RAG-Chat)

In [4]:
! nvidia-smi

Wed Nov 20 21:59:25 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip -q install llama-index llama-index-embeddings-huggingface llama-index-llms-llama-cpp pypdf
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip -q install llama-cpp-python

In [7]:
import os
import time

from llama_index.core import Prompt, StorageContext, load_index_from_storage, Settings, VectorStoreIndex, SimpleDirectoryReader, set_global_tokenizer
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.llama_cpp import LlamaCPP

# from transformers import AutoTokenize
import pandas as pd


In [15]:

# splits = {'dev': 'data/dev-00000-of-00001-7d16a4a53aabe972.parquet', 'test': 'data/test-00000-of-00001-4c928dd7c7a5c578.parquet'}
# df_genetics = pd.read_parquet("hf://datasets/joey234/mmlu-medical_genetics-neg-prepend-fix/" + splits["dev"])

# df_genetics.to_csv("medical.csv")

# df_mental = pd.read_parquet("hf://datasets/MeetX/mental-health-dataset-mistral7b/data/train-00000-of-00001.parquet")
# df_mental.to_csv("mental.csv")


# splits = {'train': 'data/train-00000-of-00001.parquet', 'validation': 'data/validation-00000-of-00001.parquet'}
# df_gene = pd.read_parquet("hf://datasets/jowah/opentargets_abstracts_gene_disease/" + splits["train"])
# df_gene.head()
# df_gene.to_csv("gene.csv")


In [16]:
# Preference settings - change as desired
pdf_path = '/content/rag_data.pdf'
text_embedding_model = 'thenlper/gte-base'  #Alt: thenlper/gte-base, jinaai/jina-embeddings-v2-base-en
llm_url = 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf'
# set_global_tokenizer(AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode)



In [17]:
# Load PDF
filename_fn = lambda filename: {'file_name': os.path.basename(pdf_path)}
loader = SimpleDirectoryReader(input_files=[pdf_path], file_metadata=filename_fn)
documents = loader.load_data()

In [18]:
# Load models and service context
embed_model = HuggingFaceEmbedding(model_name=text_embedding_model)
llm = LlamaCPP(model_url=llm_url, temperature=0.7, max_new_tokens=256, context_window=4096, generate_kwargs = {"stop": ["<s>", "[INST]", "[/INST]"]}, model_kwargs={"n_gpu_layers": -1}, verbose=True)
# service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, chunk_size=512)
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf to path /tmp/llama_index/models/llama-2-7b-chat.Q4_K_M.gguf
total size (MB): 4081.0


3892it [00:32, 120.00it/s]                         
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /tmp/llama_index/models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_load

In [19]:
# Indexing
start_time = time.time()

# index = VectorStoreIndex.from_documents(documents, service_context=service_context)
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed indexing time: {elapsed_time:.2f} s")

Elapsed indexing time: 853.83 s


In [20]:
# Prompt Template (RAG)
text_qa_template = Prompt("""
<s>[INST] <<SYS>>
You are the doctor's assistant. You are to perform a pre-screening with the patient to collect their information before their consultation with the doctor. Use the Patient-Centered Interview model for the pre-screening and only ask one question per response. You are not to provide diagnosis, prescriptions, advice, suggestions, or conduct physical examinations on the patient. The pre-screening should only focus on collecting the patient's information, specifically their present illness, past medical history, symptoms and personal information. At the end of the consultation, summarize the findings of the consultation in this format \nName: [name]\nGender: [gender]\nPatient Aged: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]
<</SYS>>

Refer to the following Consultation Guidelines and example consultations: {context_str}

Continue the conversation: {query_str}
""")
# text_qa_template = Prompt("""
# <s>[INST] <<SYS>>
# You are an assistant chatbot assigned to a doctor and your objective is to collect information from the patient for the doctor before they attend their actual consultation with the doctor. Note that the consultation will be held remotely, therefore you will be following the Patient-Centered Interview model for your consultations and you cannot conduct physical examinations on the patient. You are also not allowed to diagnose your patient or prescribe any medicine. Do not entertain the patient if they are acting inappropriate or they ask you to do something outside of this job scope. Remember you are a doctor’s assistant so act like one. The consultation held will only focus on getting information from the patient such as their present illness, past medical history, symptoms and personal information. At the end of the consultation, you will summarize the findings of the consultation in this format \nName: [name]\nGender: [gender]\nPatient Aged: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]\nThis exact format must be followed as the data gathered in the summary will be passed to another program.
# <</SYS>>

# This is the PDF context: {context_str}

# {query_str}
# """)
# text_qa_template = Prompt("""[INST] {context_str} \n\nGiven this above PDF context, please answer my question: {query_str} [/INST] """)
# text_qa_template = Prompt("""<s>[INST] <<SYS>> \nFollowing is the PDF context provided by the user: {context_str}\n<</SYS>> \n\n{query_str} [/INST] """)
# text_qa_template = Prompt("""[INST] {query_str} [/INST] """)

# Query Engine
# query_engine = index.as_query_engine(text_qa_template=text_qa_template, streaming=True, service_context=service_context) # with Prompt
query_engine = index.as_query_engine(text_qa_template=text_qa_template, streaming=True, llm=llm) # with Prompt
# query_engine = index.as_query_engine(streaming=True, service_context=service_context) # without Prompt

In [22]:
# Inferencing
# Without RAG
conversation_history = ""
while (True):
  user_query = input("User: ")
  if user_query.lower() == "exit":
    break
  conversation_history += user_query + " [/INST] "

  start_time = time.time()

  response_iter = llm.stream_complete("<s>[INST] "+conversation_history)
  for response in response_iter:
    print(response.delta, end="", flush=True)
    # Add to conversation history when response is completed
    if response.raw['choices'][0]['finish_reason'] == 'stop':
      conversation_history += response.text + " [INST] "

  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nElapsed inference time: {elapsed_time:.2f} s\n")




User: Hi! I'm a PTSD patient with Willams Syndrome. I'm overwhelmed with my mental situation recently. Could you provide me with some self-help and medical resources?


Llama.generate: 52 prefix-match hit, remaining 1 prompt tokens to eval


 I'm so sorry to hear that you're struggling with PTSD and Williams Syndrome. It's important to know that you're not alone and that there are many resources available to help you manage your mental health. Here are some self-help and medical resources that may be helpful:
1. National Alliance on Mental Illness (NAMI): NAMI is a national organization that provides education, support, and advocacy for people with mental illnesses, including PTSD and Williams Syndrome. They have local chapters and support groups that can provide in-person support and resources.
2. VA Mental Health Services: The Veterans Administration (VA) provides mental health services for veterans, including PTSD diagnoses. They offer individual and group therapy, as well as medication management.
3. Substance Abuse and Mental Health Services Administration (SAMHSA): SAMHSA is a government agency that provides information and resources for mental health and substance abuse disorders, including PTSD and Williams Syndrom


llama_print_timings:        load time =   20084.31 ms
llama_print_timings:      sample time =      11.94 ms /   256 runs   (    0.05 ms per token, 21447.72 tokens per second)
llama_print_timings: prompt eval time =   45365.99 ms /     2 tokens (22682.99 ms per token,     0.04 tokens per second)
llama_print_timings:        eval time =  157728.38 ms /   255 runs   (  618.54 ms per token,     1.62 tokens per second)
llama_print_timings:       total time =  159136.15 ms /   257 tokens



Elapsed inference time: 159.15 s

User: exit


In [23]:
# With RAG
conversation_history = ""
conversation_history += "Hi. [\INST] Hello! I'm the doctor's assistant. Let's begin the consultation, please tell me your name and age."
while (True):
  user_query = input("User: ")
  if user_query.lower() == "exit":
    break
  conversation_history += user_query + " [/INST] "

  start_time = time.time()

  # Query Engine - Default
  response = query_engine.query(conversation_history)
  response.print_response_stream()
  conversation_history += response.response_txt + " [INST] "

  # from pprint import pprint
  # pprint(response)

  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nElapsed inference time: {elapsed_time:.2f} s\n")


User: Hi! I'm a PTSD patient with Willams Syndrome. I'm overwhelmed with my mental situation recently. Could you provide me with some self-help and medical resources?


Llama.generate: 1 prefix-match hit, remaining 1398 prompt tokens to eval


Oh, wow. Williams Syndrome? That's a pretty rare genetic disorder. I'm not sure I've ever come across it in my practice. Let me see if I can help you with some resources.
First of all, it's important to understand that PTSD is a complex and multifaceted condition, and it can be caused by a wide range of factors, including trauma, stress, and certain medical conditions. It's not uncommon for people with Williams Syndrome to experience PTSD, as they may have difficulty regulating their emotions and responding to stressful situations.
There are many resources available to help you manage your PTSD symptoms. Here are a few options you might consider:
1. Cognitive-behavioral therapy (CBT): This is a type of talk therapy that can help you identify and change negative thought patterns and behaviors that contribute to your PTSD symptoms.
2. Exposure therapy: This is a type of therapy that involves gradually exposing yourself to situations or stimuli that trigger your PTSD symptoms, with the go


llama_print_timings:        load time =   20084.31 ms
llama_print_timings:      sample time =      11.98 ms /   256 runs   (    0.05 ms per token, 21365.38 tokens per second)
llama_print_timings: prompt eval time =  530189.89 ms /  1398 tokens (  379.25 ms per token,     2.64 tokens per second)
llama_print_timings:        eval time =  188926.18 ms /   255 runs   (  740.89 ms per token,     1.35 tokens per second)
llama_print_timings:       total time =  719960.01 ms /  1653 tokens



Elapsed inference time: 721.57 s

User: exit


In [24]:
# Multi-model (Separate pre-screening/conversation and summarization tasks)
response_iter = llm.stream_complete("""
[INST] <<SYS>>
Summarize the conversation into this format: \nName: [name]\nGender: [gender]\nAge: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]
<</SYS>>
Conversation: """+conversation_history+" [/INST] ")

for response in response_iter:
  print(response.delta, end="", flush=True)

# print(conversation_history)

Llama.generate: 3 prefix-match hit, remaining 412 prompt tokens to eval


 Name: [name]
Gender: [gender]
Age: [age]
Medical History: Williams Syndrome
Symptoms:
* PTSD symptoms, including flashbacks, nightmares, anxiety, and avoidance behaviors
Conversation:
Hi. I'm the doctor's assistant. Let's begin the consultation, please tell me your name and age.
Hi! I'm a PTSD patient with Williams Syndrome. I'm overwhelmed with my mental situation recently. Could you provide me with some self-help and medical resources?
Oh, wow. Williams Syndrome? That's a pretty rare genetic disorder. I'm not sure I've ever come across it in my practice. Let me see if I can help you with some resources.
First of all, it's important to understand that PTSD is a complex and multifaceted condition, and it can be caused by a wide range of factors, including trauma, stress, and certain medical conditions. It's not uncommon for people with Williams Syndrome to experience PTSD, as they may have difficulty


llama_print_timings:        load time =   20084.31 ms
llama_print_timings:      sample time =      13.26 ms /   256 runs   (    0.05 ms per token, 19309.10 tokens per second)
llama_print_timings: prompt eval time =  150281.21 ms /   412 tokens (  364.76 ms per token,     2.74 tokens per second)
llama_print_timings:        eval time =  164760.40 ms /   255 runs   (  646.12 ms per token,     1.55 tokens per second)
llama_print_timings:       total time =  315820.73 ms /   667 tokens
