## RAG Using LlamaIndex, and HuggingFace

In [None]:
# !pip install llama_index
# !pip install llama-index-embeddings-huggingface
# !pip install llama-index-llms-huggingface

from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex

In [None]:
# To read the pdfs in current folder
loader = SimpleDirectoryReader(
    input_dir=".",
    recursive=True,
    required_exts=[".pdf"],
)

# Load the documents
documents = loader.load_data()
documents

In [35]:
# Load (or create) the embedding model for vector storage
# "BAAI/bge-small-en-v1.5" is a small English embedding model that encodes text into a vector space for similarity-based retrieval.
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

print(embedding_model._model.device)  # Device that the model is running on

cpu


In [36]:
# Creates embeddings for the sentences and stores them
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embedding_model,
)

# Save the index in the current directory
index.storage_context.persist(persist_dir="./huggingfaceembeddings")


In [37]:
# Viewing the chunks
for doc in index.docstore.docs.values():
    print("Document ID:", doc.ref_doc_id)
    print("Text Chunk:", doc.text)
    print("=" * 50)

Document ID: 162b19ec-44d8-4b39-bf0b-c4513a7fa1b2
Text Chunk: Shraddha Piparia Computational Biologist, Richland, WA | +1-940-297-9424 | spiparia@health.ucsd.edu Professional Experience Postdoctoral Research Associate | 2021-Present | University of California, San Diego  • Conducted research on asthma pharmacogenetics and computational epidemiology, integrating genomic sequencing data with clinical phenotypes • Developed statistical models and machine learning pipelines to identify genetic determinants in respiratory diseases • Collaborated with multidisciplinary teams in a remote setting, contributing to high-impact publications Application Developer | 2013-2016 | Oracle India Private Limited | Telangana, India • Developed enterprise-level applications and ML-based sentiment analysis tools • Enhanced software testing processes through automated test case generation Technical Skills Programming: R, Python, C, SQL, MySQL, PostgreSQL, git/GitHub, docker, AWS ML Framework: scikit-learn, s

In [38]:
# 1. EMBEDDING MODEL & INDEX PERSISTENCE
from llama_index.core import StorageContext, load_index_from_storage

# Load the existing index from a persisted folder
storage_context = StorageContext.from_defaults(persist_dir="./huggingfaceembeddings")
index = load_index_from_storage(storage_context, embed_model=embedding_model)

In [39]:
# Authentication for using a gated HF repo
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read)

In [40]:
# 2. LOADING THE LOCAL LLM (Meta Llama 3.2–3B Instruct)
from transformers import AutoModelForCausalLM, AutoTokenizer
from llama_index.llms.huggingface import HuggingFaceLLM
import torch

device = (
    torch.device("cuda") if torch.cuda.is_available() else
    torch.device("mps") if torch.backends.mps.is_available() else
    torch.device("cpu")
)

# Load a model and tokenizer from Hugging Face
# "meta-llama/Llama-3.2-3B-Instruct" is an instruction-tuned 3B-parameter Llama model specialized
# for generating coherent answers in response to prompts.
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct").to(device)

# Initialize HuggingFaceLLM - a simple LLM wrapper
# huggingface_llm = HuggingFaceLLM(
#     model=model,
#     tokenizer=tokenizer,
# )
# Adding generation params
huggingface_llm = HuggingFaceLLM(
    model=model,
    tokenizer=tokenizer,
    generate_kwargs={
        # "max_new_tokens": 128,  # To cap output length
        "temperature": 0.7,      # Creativity level: higher = more varied responses
        "top_p": 0.9,            # Nucleus sampling threshold
        "top_k": 50,             # Consider top-k tokens
        "do_sample": True        # Enable sampling instead of greedy decoding
    }
)

# 3. BUILD A QUERY ENGINE FOR RAG
# Combine the vector store index + local LLM
# RAG (Retrieval-Augmented Generation) means the query engine will first retrieve relevant text chunks
# from the index, then feed them to the local LLM to produce a context-aware answer.
query_engine = index.as_query_engine(llm=huggingface_llm)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [29]:
!pip freeze | grep llama-index
!pip freeze | grep transformers

llama-index==0.12.23
llama-index-agent-openai==0.4.6
llama-index-cli==0.4.1
llama-index-core==0.12.23.post2
llama-index-embeddings-huggingface==0.5.2
llama-index-embeddings-openai==0.3.1
llama-index-indices-managed-llama-cloud==0.6.8
llama-index-llms-huggingface==0.4.2
llama-index-llms-openai==0.3.25
llama-index-multi-modal-llms-openai==0.4.3
llama-index-program-openai==0.3.1
llama-index-question-gen-openai==0.3.0
llama-index-readers-file==0.4.6
llama-index-readers-llama-parse==0.4.0
sentence-transformers==3.4.1
transformers==4.48.3


In [42]:
# Set the LLM to use
while True:
    question = input("Question: ")
    if question.lower() == "quit":
        break
    response = query_engine.query(question)
    print(response)


Question: whats the article about


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The article is about the academic and professional background of Shraddha Piparia, a Computational Biologist, including her research experience, technical skills, and education. It appears to be her resume or CV.
Question: quit
