# Description
This is a Retrieval-Augmented Generation (RAG) Q&A system using the `llama-index` library and the **Meta LLaMA 3-8B Instruct model** from Hugging Face. It loads and indexes documents, retrieves relevant information based on user input, and generates responses. The system detects the language of the input (English or Bengali), processes it accordingly, and provides answers through a Gradio chat interface.

## Installing libraries
Press "cancel" if it asks to restart the seession.

In [1]:
!pip install -U transformers
!pip install -q pypdf
!pip install -q python-dotenv
!pip install  llama-index==0.10.12
!pip install -q gradio
!pip install einops
!pip install accelerate
!pip install llama-index-llms-huggingface
!pip install llama-index-embeddings-fastembed
!pip install fastembed
!pip install deep-translator

Collecting transformers
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.46.2
    Uninstalling transformers-4.46.2:
      Successfully uninstalled transformers-4.46.2
Successfully installed transformers-4.46.3
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index==0.10.12
  Downloading llama_index-0.10.12-py3-none-any.whl.metadata (8.8 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.4 (from llama-index==0.10.12)
  Downloading llama_index_agent_o

Collecting deep-translator
  Downloading deep_translator-1.11.4-py3-none-any.whl.metadata (30 kB)
Downloading deep_translator-1.11.4-py3-none-any.whl (42 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: deep-translator
Successfully installed deep-translator-1.11.4


## Imporing liabraries

In [2]:
import gradio as gr
import logging
import sys
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.core import PromptTemplate
from transformers import AutoTokenizer
import torch
from deep_translator import GoogleTranslator
from huggingface_hub import login
import os








## Loading data
"A Game of Thrones" book pdf is loaded which the system will query. You can use any pdf document.

In [3]:
from google.colab import drive
drive.mount('/content/drive')
documents = SimpleDirectoryReader("/content/drive/My Drive/Data").load_data()

Mounted at /content/drive


## Configure logging

In [4]:
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Setting up embedding model

In [5]:
embed_model = FastEmbedEmbedding(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
Settings.embed_model = embed_model
Settings.chunk_size = 512

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/673 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

model_optimized.onnx:   0%|          | 0.00/235M [00:00<?, ?B/s]

## Defining system prompt and query wrapper

In [6]:
system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on the instructions and context provided."

query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")


## Hugging face model and tokenizer setup

In [13]:
# Retrieve the Hugging Face token from Colab secrets (using os.getenv)
hf_token = os.getenv("HF_TOKEN")

# Authenticate with Hugging Face using the token
login(token=hf_token)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [14]:
# Proceed with the tokenizer and LLM setup
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

# LLM setup
llm = HuggingFaceLLM(
    context_window=8192,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    device_map="auto",
    stopping_ids=stopping_ids,
    tokenizer_kwargs={"max_length": 4096},
    model_kwargs={"torch_dtype": torch.float16}
)

Settings.llm = llm
Settings.chunk_size = 512

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



Some parameters are on the meta device because they were offloaded to the cpu.


## Building index from documents and setting up query engine

In [15]:
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

## Language detection for input/output and querying the system

In [16]:
from deep_translator import GoogleTranslator

def predict(input, history):
    detected_lang = GoogleTranslator(source='auto', target='en').translate(input)

    if detected_lang == input:
        response = query_engine.query(input)
        return str(response)
    else:
        input_english = GoogleTranslator(source='auto', target='en').translate(input)
        response_english = query_engine.query(input_english)
        response_bengali = GoogleTranslator(source='en', target='bn').translate(str(response_english))
        return response_bengali

## Launch the gradio interface

In [17]:
gr.ChatInterface(predict).launch(share=True)



Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://a8391f34eaa2575457.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


