<a href="https://colab.research.google.com/github/vishnusureshperumbavoor/rag_apps/blob/main/rag_llama3_8b_instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from IPython.display import Markdown, display
display(Markdown("#VSP's RAG app Llama3-8b-instruct"))

#VSP's RAG app Llama3-8b-instruct

## Stack
* Framework - llama Index
* Vector database - VectorStoreIndex
* Embedding - HuggingFaceEmbedding
* Tokenizer & LLM - llama 3 8b instruct

# Prerequisites
1. Get llama3 huggingface access by filling the form (https://huggingface.co/meta-llama/Meta-Llama-3-8B)
2. Get your huggingface token and store it on colab secrets (https://huggingface.co/settings/tokens)
3. Create a folder called data and upload the PDF into it

# Install packages

In [None]:
!pip install -q llama-index==0.10.12
!pip install -q llama-index-llms-huggingface
!pip install -q llama-index-embeddings-huggingface
!pip install -q gradio
!pip install -q accelerate

# Loading the document

In [None]:
# loading the data
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("/content/data").load_data()

# Embedding, LLM & tokenizer model initialization

In [None]:
embedding_model="BAAI/bge-small-en-v1.5"
llm_model="meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer_model="meta-llama/Meta-Llama-3-8B-Instruct"

# Embedding model initialization

In [None]:
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.core import Settings

Settings.embed_model = FastEmbedEmbedding(model_name=embedding_model)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

# Chunk size and overlap initialization

In [None]:
# each chunks will have 512 characters
Settings.chunk_size = 512
# the last 50 characters of chunk1 is stored in the 1st 50 characters of chunk2
# To avoid losing values at the boundary
# Algorithms can understand the relationship between chunks
Settings.chunk_overlap = 50

# Creating Vector database

In [None]:
# chunking --> embedding --> numerical vectors are stored in vector store
from llama_index.core import VectorStoreIndex

vector_store = VectorStoreIndex.from_documents(documents)

# Tokenizer initialization

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Prompt template

In [None]:
from llama_index.core import PromptTemplate

system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on the instructions and context provided."

# This will wrap the default prompts that are internal to llama-index into LLM
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

# LLM

In [None]:
import torch
from llama_index.llms.huggingface import HuggingFaceLLM

Settings.llm = HuggingFaceLLM(
    context_window=8192,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=tokenizer_model,
    model_name=llm_model,
    device_map="auto",
    stopping_ids=stopping_ids,
    tokenizer_kwargs={"max_length": 4096},
    model_kwargs={"torch_dtype": torch.float16}
)




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Query engine

In [None]:
query_engine = vector_store.as_query_engine()

# Response check

In [None]:
print(query_engine.query("What is this pdf is all about"))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 This PDF appears to be an Adult Medical History Form, which is a questionnaire used to gather information about a patient's medical history, including their present health concerns, medications, allergies, personal medical history, surgical history, and immunizations. The form is likely used by healthcare professionals to gather information about a patient's medical history before conducting a physical examination or providing treatment.


In [None]:
#while True:
#  query=input()
#  print(query_engine.query(query))

# User Interface (gradio)

In [None]:
def predict(input, history):
  response = query_engine.query(input)
  return str(response)

In [None]:
import gradio as gr

gr.ChatInterface(predict).launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://b4f3c9bd4ce341b2c3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


