<a href="https://colab.research.google.com/github/theAbheekMukherjee/RAG-Assignment/blob/main/RAG(Assignment).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#RAG System utilising Llama Index
This code will demonstrate how to construct a comprehensive RAG pipeline utilising the **Llama Index**.
We will locally deploy a small-scale, open-source LLM called **Phi-3**, *developed by Microsoft*, on our Colab instance for the purpose of generation. In order to enhance the outcomes, we will grant the LLM access to Brochure Data, i.e., ** 2025 brochure available as Postgraduate 2025
brochure.pdf on my.wbs**, which will be stored in Chroma database as a vector store. Ultimately, we employ **[Ollama]**(https://ollama.com/) as a means to engage with Phi-3 on our device.

Initially,
We will start by installing all the packages necessary to interact with LLM (in our case Phi-3) and perform efficient information retrieval.

In [None]:
# Installing prerequistes for interacting with LLMs, chunking and embedding
!pip install llama_index.core                       #for core components of Llama_Index
!pip install llama_index.readers.file
!pip install faiss-gpu                              #installing separate library for faiss
!pip install llama-index-embeddings-huggingface     #for generating embeddings
!pip install llama-index-vector-stores-faiss

**Step 1:**
Prepares the data from a file, after loading into the directory, for further processing by splitting it into smaller, manageable chunks based on sentence splitter chunking technique.

In [None]:
from llama_index.readers.file import FlatReader, PDFReader
from llama_index.core.node_parser import SentenceSplitter #import sentence splitter for chunking
from pathlib import Path  #for finding the file

Brochure_doc = PDFReader().load_data(Path("/content/Postgraduate 2025 brochure--FINAL-ONLINE.pdf")) #loading the dataset

parser = SentenceSplitter(chunk_size=300, chunk_overlap=100) #Chunk size and chunk overlap can be changed - type of hyperparameter
Brochure_doc = parser.get_nodes_from_documents(Brochure_doc)

**Step 2:**
Configuring *LlamaIndex* to leverage a specific pre-trained model from Hugging Face ***("BAAI/bge-small-en-v1.5")***  for the task of generating numerical representations ***(embeddings)*** from text data, i.e. * Postgraduate 2025
brochure.pdf*

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

# Initialize a HuggingFace Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") #pre-trained model to understand chunking output

# Specify the embedding model into LlamaIndex's settings
Settings.embed_model = embed_model

**Step 2a:**
Setting up the Faiss index itself, which is a separate library for efficient similarity search.

In [None]:
!pip install faiss-cpu

In [None]:
import faiss #used for similarity search in high-dimensional vector spaces
# create the empty Faiss database
d = 384  #embedding dimensions representing dimensionality - a hyperparameter
faiss_index = faiss.IndexFlatIP(d)

**Step 3:**
Integrating Faiss into LlamaIndex. It builds upon the Faiss index created earlier and integrates it within the LlamaIndex framework for creating a searchable data structure.

In [None]:
from llama_index.core import (
    load_index_from_storage,
    VectorStoreIndex,
    StorageContext,
)

from llama_index.vector_stores.faiss import FaissVectorStore

# create a vector store variable
vector_store = FaissVectorStore(faiss_index=faiss_index) #creating the vector store

# set the vector database into the storage context of LlamaIndex
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create the Faiss database
li_index = VectorStoreIndex(Brochure_doc, storage_context=storage_context)

# save index to disk
li_index.storage_context.persist()

print(f"Number of vectors in the Faiss index: {faiss_index.ntotal}")

**Step 4:**
We used *Llama Index* to prepare the question for searchable index interaction in previous steps. We then compare the question's meaning to the stored text data in our embedding space by transforming it into a vector using the *same pre-trained model* as the index. This lets the index find Brochure texts semantically equivalent to the Q & A prompt.
Subsequently, we then pick the most relevant sentences that answer the question based on embedding space similarity.

In [None]:
from sentence_transformers import SentenceTransformer

# Instantiate the sentence-level DistilBERT
model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# Q & A prompt-to store as a text string
qna_prompt = "Where is Warwick Business School?"

# Convert Q&A prompt to vectors
rag_embedding = model.encode(qna_prompt, show_progress_bar=True)

**Step 4a:**
Choosing the most relevant sentences that directly address the question by considering their similarity in the embedding space.

In [None]:
import numpy as np

# Retrieve the top nearest neighbour
cs_similarity, similar = faiss_index.search(np.array([rag_embedding]), k=4) #performing search operation, where k=4 is number of neighbours to be identified(hyperparameter)
similar = similar.flatten().tolist()

# Print the indices of the four most similar passages
print(f'Top results: {similar}')

#printing the results
for result in similar:
  print(Brochure_doc[result]) # some results are not very good
  print("\n")

**Step 5:**
**Putting everything together and building a RAG.**

We'll start by installing ***Ollama***, setting up the model, and finally checking its performance.

**Step 5a:**
Installing the *Ollama*.

In [None]:
# Install Ollama v0.1.30
!curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.30#' | sh


**Step 5b:**
Next, we perform some tasks to set up *Ollama* in the background of our Colab (Linux) instance. We donâ€™t have to worry too much about this code; it mainly consists of Linux/BASH commands.

In [None]:
# Setup the model as a global variable
OLLAMA_MODEL='phi:latest'

# Add the model to the environment of the operating system
import os
os.environ['OLLAMA_MODEL'] = OLLAMA_MODEL
!echo $OLLAMA_MODEL # print the global variable to check it saved

import subprocess
import time

# Start ollama on the server ("serve") in the background
command = "nohup ollama serve &"

# Use subprocess.Popen to run the command
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

print("Process ID:", process.pid) # print the process ID
time.sleep(10)  # Increased wait time to 10 seconds to allow the server to initialize

**Step 5c:**
Now that everything is setup, we can query the model to generate some text about the Brochure. As an pre-trained LLM, we can check the outcome, which can be used to compare performance of our model later.

In [None]:
# Query the model via the command line
# First time running it will "pull" (import) the model
!ollama run $OLLAMA_MODEL "What is the location of Warwick Business School?"

**Step 6:**
As everything works well, we can now build our RAG. Firstly, we start with preparing the environment for using Llama Index with Ollama.
This allows us to *leverage strength of LLMs with text indexing and search functionalities.*

In [None]:
# Libraries that needs to be downloaded
# Install prerequisites
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-ollama               #integration of ollama with LLM
!pip install llama-index ipywidgets
!pip install llama-index-llms-huggingface

# Uninstall conflicting opentelemetry packages
!pip uninstall -y opentelemetry-api opentelemetry-sdk

# Install compatible opentelemetry versions
!pip install opentelemetry-api==1.20.0 opentelemetry-sdk==1.20.0

# Access to chroma vector store for efficient data storage
!pip install llama-index-vector-stores-chroma      #access to chroma vector store for efficient data storage
!pip install chromadb


# Import required modules from the llama_index library
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core import StorageContext

# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Import the Ollama class
from llama_index.llms.ollama import Ollama

**Step 6a:**
Importing Ollama and setting a timeout to raise an error. And, further integration of Ollama with LlamaIndex.

In [None]:
# Use the global variable (OLLAMA_MODEL) as our LLM
# Set a timeout of 4 minutes
OLLAMA_MODEL='phi:latest' # Define OLLAMA_MODEL in this cell as a workaround
llm = Ollama(model=OLLAMA_MODEL, request_timeout=480.0) # Increased timeout to 8 minutes

# Specify the LLM and embedding model into LlamaIndex's settings
Settings.llm = llm

**Step 7:**
Creation of a reusable ***Prompt Template*** for using Ollama (or any integrated LLM) within LlamaIndex for question answering.

In [None]:
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate

qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "Always answer the question,even if the context is limited."
        ),
    ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str),
]

text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

**Step 8:**
Demonstrating how to use the configured LLM and prompt template within a temporary LlamaIndex query engine to answer the question asked. Now, we can compare the output of RAG and the pre-trained LLMs.

In [None]:
query_engine = li_index.as_query_engine(
                                    text_qa_template=text_qa_template,
                                    llm=llm,
                                    response_mode="compact")

response = query_engine.query("What is the location of Warwick Business School?")
response.response

Run this code cell if Ollama takes too long to establish connection

In [None]:
import subprocess
import time

# Stop any running Ollama processes
!pkill ollama || true
time.sleep(5) # Give it a moment to stop

# Start ollama on the server ("serve") in the background
command = "nohup ollama serve &"

# Use subprocess.Popen to run the command
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

print("Ollama server process ID:", process.pid) # print the process ID
time.sleep(10)  # Wait for the server to initialize




In [None]:
query_engine = li_index.as_query_engine(
                                    text_qa_template=text_qa_template,
                                    llm=llm,
                                    response_mode="tree_summarize")

response = query_engine.query("What is the location of Warwick Business School?")
response.response

In [None]:
print(
    li_index.as_query_engine(                 #creating a temporary query engine
        text_qa_template=text_qa_template,
        llm=llm,
        response_mode="tree_summarize"        #various types of response modes are available(a hyperparameter)
    ).query("Where is the nearest study library in the campus from Warwick Business School?")
)

**Step 9:**
Checking the outcome under different query modes and with different prompts to verify the model's performance.

We can change response modes of query engine to see the change in the outcomes.

In [None]:
query_engine = li_index.as_query_engine(
                                    text_qa_template=text_qa_template,
                                    llm=llm,
                                    response_mode="compact")

response = query_engine.query("What is the eligibility to get into criteria in Warwick Business School?")
response.response

Now we can see by selecting the "Tree Summarize"

In [None]:
query_engine = li_index.as_query_engine(
                                    text_qa_template=text_qa_template,
                                    llm=llm,
                                    response_mode="tree_summarize")

response = query_engine.query("What is the eligibility to get into criteria in Warwick Business School?")
response.response

In [None]:
query_engine = li_index.as_query_engine(
                                    text_qa_template=text_qa_template,
                                    llm=llm,
                                    response_mode="compact")

response = query_engine.query("How is the course Msc Business Analytics course taught?")
response.response

In [None]:
query_engine = li_index.as_query_engine(
                                    text_qa_template=text_qa_template,
                                    llm=llm,
                                    response_mode="tree_summarize")

response = query_engine.query("How is the course Msc Business Analytics course taught?")
response.response