<a href="https://colab.research.google.com/github/vishnusureshperumbavoor/rag_apps/blob/main/rag_llama3_8b_instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from IPython.display import Markdown, display
display(Markdown("#VSP's RAG app Llama3-8b-instruct"))

#VSP's RAG app Llama3-8b-instruct

# Install packages

In [2]:
!pip install -q pypdf
!pip install -q python-dotenv
!pip install llama-index==0.10.12
!pip install -q gradio
!pip install einops
!pip install accelerate
!pip install llama-index-llms-huggingface llama-index-embeddings-fastembed fastembed
!pip install transformers -U

Collecting tokenizers<0.16,>=0.15 (from fastembed)
  Using cached tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
INFO: pip is looking at multiple versions of transformers[torch] to determine which version is compatible with other requirements. This could take a while.
Collecting transformers[torch]<5.0.0,>=4.37.0 (from llama-index-llms-huggingface)
  Using cached transformers-4.40.0-py3-none-any.whl (9.0 MB)
  Using cached transformers-4.39.3-py3-none-any.whl (8.8 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.40.1
    Uninstalling transformers-4.40.1:
      Successfully uninstalled transformers-4.40.1
Successfully installed tokenizers-0.15.2 transformers-4.39.3
Collecting transforme

# Prerequisites
1. Get llama3 huggingface access by filling the form (https://huggingface.co/meta-llama/Meta-Llama-3-8B)
2. Get your huggingface token and store it on colab secrets (https://huggingface.co/settings/tokens)
3. Create a folder called data and upload the PDF into it

# Loading the document

In [3]:
# loading the data
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("/content/data").load_data()

# Embedding, LLM & tokenizer model initialization

In [4]:
embedding_model="BAAI/bge-small-en-v1.5"
llm_model="meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer_model="meta-llama/Meta-Llama-3-8B-Instruct"

# Embedding model initialization

In [5]:
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.core import Settings

Settings.embed_model = FastEmbedEmbedding(model_name=embedding_model)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

# Chunk size and overlap initialization

In [6]:
# each chunks will have 512 characters
Settings.chunk_size = 512
# the last 50 characters of chunk1 is stored in the 1st 50 characters of chunk2
# To avoid losing values at the boundary
# Algorithms can understand the relationship between chunks
Settings.chunk_overlap = 50

# Creating Vector database

In [7]:
from llama_index.core import VectorStoreIndex

vector_store = VectorStoreIndex.from_documents(documents)

# Tokenizer initialization

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Prompt template

In [9]:
from llama_index.core import PromptTemplate

system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on the instructions and context provided."

# This will wrap the default prompts that are internal to llama-index into LLM
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

# LLM

In [10]:
import torch
from llama_index.llms.huggingface import HuggingFaceLLM

Settings.llm = HuggingFaceLLM(
    context_window=8192,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=tokenizer_model,
    model_name=llm_model,
    device_map="auto",
    stopping_ids=stopping_ids,
    tokenizer_kwargs={"max_length": 4096},
    model_kwargs={"torch_dtype": torch.float16}
)




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Query engine

In [11]:
query_engine = vector_store.as_query_engine()

# Response check

In [12]:
print(query_engine.query("What is this pdf is all about"))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 This PDF appears to be an Adult Medical History Form, which is a questionnaire used to gather information about a patient's medical history, including their present health concerns, medications, allergies, personal medical history, surgical history, and immunizations. The form is likely used by healthcare professionals to gather information about a patient's medical history before conducting a physical examination or providing treatment.


In [None]:
#while True:
#  query=input()
#  print(query_engine.query(query))

# User Interface (gradio)

In [15]:
def predict(input, history):
  response = query_engine.query(input)
  return str(response)

In [16]:
import gradio as gr

gr.ChatInterface(predict).launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://b4f3c9bd4ce341b2c3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


