# __Project Name__ : __Personalized Research Assistant Using RAG__

>> __Problem Description__ : __In this era of AI, Students and Researchers spend lot of their time browsing research paper's meaning and its understanding through various sources.__

>> __Solution__ : __The simple solution could be that a user send their paper and ask questions about it, e.g.,, Key Findings of the paper, Summarization of the paper, etc.__

> ### This file is the testing file for the project, the actual project can be done in .py file i.e., Python file and a Frontend file also.

In [1]:
# Taking the necessary Imports...

# 1. For LLMs (Embeddings and Chats)
import os
from dotenv import load_dotenv

# Accessing API keys
load_dotenv()

hf_api_key = os.getenv("HF_API_KEY")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = hf_api_key

from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint, HuggingFaceEndpointEmbeddings

# For prompting 
from langchain_core.prompts import PromptTemplate

# For output parsers
from langchain_core.output_parsers import StrOutputParser

# for document loading (PDF Docs)
from langchain_community.document_loaders import PyPDFLoader

# For splitting the docs
from langchain.text_splitter import RecursiveCharacterTextSplitter

# For vectore store and also Retriever
from langchain_community.vectorstores import FAISS

>> Step-1 Indexing (Document Ingestion)

In [50]:
# Loader the pdf using loader
pdf_loader = PyPDFLoader('F.pdf')
docs = pdf_loader.load()

In [51]:
# here is the the docs of total 11 pages, same as pdf pages
docs

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-02-12T02:08:57+00:00', 'author': '', 'keywords': '', 'moddate': '2025-02-12T02:08:57+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'F.pdf', 'total_pages': 24, 'page': 0, 'page_label': '1'}, page_content='THE FAISS LIBRARY\nMatthijs Douze\nFAIR, Meta\nAlexandr Guzhva\nZilliz\nChengqi Deng\nDeepSeek\nJeff Johnson\nFAIR, Meta\nGergely Szilvasy\nFAIR, Meta\nPierre-Emmanuel Mazar´e\nFAIR, Meta\nMaria Lomeli\nFAIR, Meta\nLucas Hosseini\nSkip Labs\nHerv´e J´egou\nFAIR, Meta\nAbstract\nVector databases typically manage large collections of\nembedding vectors. Currently, AI applications are\ngrowing rapidly, consequently, the number of embed-\ndings that need to be stored and indexed is increas-\ning. The Faiss library is dedicated to vector similarity\nsearch, a

In [52]:
print(len(docs))

24


In [43]:
docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.26', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-07-31T08:46:52+00:00', 'author': '', 'keywords': '', 'moddate': '2025-07-31T08:46:52+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) kpathsea version 6.4.0', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'RPR.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Rikin Pithadia +91-6353865443\nRoll No : 23035010453 d.pithadia@op.iitg.ac.in\nB.Sc Hons - Data Science and Artificial Intelligence rikinpithadia98@gmail.com\nIndian Institute Of Technology, Guwahati github.com/rikin-2911\nlinkedin.com/in/rikin-pithadia\nEducation\nDegree/Certificate Institute/Board CGPA/Percentage Year\nB.Sc.(Hons) in Data Science and AI Indian Institute of Technology,\nGuwahati\n7.42 (Current) 2023 - Present\nB.Tech in Mechanical Engineering Government Engineering College,\nGandhinagar\n7.79 (Current) 2023 - Present\nMinor in Internet 

>> Step2 Indexing (Chunking)

In [53]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", " ", ""]
)

In [54]:
chunks = splitter.split_documents(docs)

In [55]:
len(chunks)

134

In [56]:
chunks[1]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-02-12T02:08:57+00:00', 'author': '', 'keywords': '', 'moddate': '2025-02-12T02:08:57+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'F.pdf', 'total_pages': 24, 'page': 0, 'page_label': '1'}, page_content='We benchmark key features of the library and discuss\na few selected applications to highlight its broad ap-\nplicability.\n1 Introduction\nThe emergence of deep learning has induced a shift in\nhow complex data is stored and searched, noticeably\nby the development of embeddings. Embeddings are\nvector representations, typically produced by a neu-\nral network, that map (embed) the input media item\ninto a vector space, where the locality encodes the se-\nmantics of the input. Embeddings are extracted from\nvarious forms of media: words [59, 10], text [2

>> Step 3 Indexing (Embeddings and Vector Stores)

In [57]:
# first we setup our LLM for embedding task
embedding_llm = HuggingFaceEndpointEmbeddings(model="sentence-transformers/all-MiniLM-L6-v2")


# Now we create a vector store for storing our embeddings/vectors
vector_store = FAISS.from_documents(chunks, embedding_llm)

In [None]:
# For viewing the particular id's  ($0 id's for 40 chunks)
vector_store.index_to_docstore_id

{0: 'e495f8c6-eb2f-4b4d-ace2-02dfc2fbfee3',
 1: '080840da-8c31-427d-9036-cb8e276c3d5f',
 2: 'c7eca0ef-8a14-4098-a419-4007b7535821',
 3: 'b1fd5697-3194-4d35-87ea-0ddd28641f26',
 4: '7a15314d-9415-4942-b154-860353b92bbd'}

In [None]:
vector_store.get_by_ids(['99caca0e-4324-469b-9e06-a9367afd9e0d'])

[]

>> Step 4 Reteiver

In [58]:
# for retreiving, the process is straight forward  beacuse our data is in the vector store
# Therefore we use vector store as our retriever 
retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k":4}
)

In [59]:
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEndpointEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000002D6A7438E90>, search_type='mmr', search_kwargs={'k': 4})

>> Step 5 Augmentation

In [2]:
# Setting up llm for text generation
qwen_llm = HuggingFaceEndpoint(
    #repo_id="Qwen/Qwen3-235B-A22B-Instruct-2507",
    #repo_id="openai/gpt-oss-120b",   # For now this is good, but not for last step
    #repo_id="meta-llama/Llama-3.2-3B-Instruct",
    #repo_id="meta-llama/Llama-3.1-8B-Instruct",
    repo_id="google/gemma-2-2b-it",
    #repo_id="deepseek-ai/DeepSeek-R1-0528",
    #repo_id="perplexity-ai/r1-1776-distill-llama-70b",
    task="text-generation",
    max_new_tokens=500,
    huggingfacehub_api_token=os.getenv("HF_API_KEY")
)

llm = ChatHuggingFace(llm=qwen_llm)

In [61]:
# Constructing prompt for LLM input
prompt = PromptTemplate(
    template="""
        You are a personalized research Assitant for providing meaningful answers.
        Answer in simple meaning and detailed explanation of the topic for user understanding.
        Answer ONLY from the provided transcript context.
        If the context is insufficient, just say you don't know,
            
        {context}
        Question: {question}
    """,
    input_variables=['context', 'question']
)

In [62]:
print(prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} template="\n        You are a personalized research Assitant for providing meaningful answers.\n        Answer in simple meaning and detailed explanation of the topic for user understanding.\n        Answer ONLY from the provided transcript context.\n        If the context is insufficient, just say you don't know,\n            \n        {context}\n        Question: {question}\n    "


In [63]:
question = "what is skills"
retrived_docs = retriever.invoke(question)

In [None]:
retrived_docs

[Document(id='7a15314d-9415-4942-b154-860353b92bbd', metadata={'producer': 'pdfTeX-1.40.26', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-07-31T08:46:52+00:00', 'author': '', 'keywords': '', 'moddate': '2025-07-31T08:46:52+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) kpathsea version 6.4.0', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'RPR.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='• Electrical and Electronics: Signals and Systems\nCertifications and Achievements\n• Team Leader - Placement Fair Volunteer Section, GEC-Gandhinagar 2025\n• LangChain Essentials for Generative AI, LangChain Web 2025\n• Trustworthy Generative AI, Coursera 2024\n• Machine Learning with Python, Jovian 2024\n• Business Writing - Effective Communication, Coursera 2024'),
 Document(id='080840da-8c31-427d-9036-cb8e276c3d5f', metadata={'producer': 'pdfTeX-1.40.26', 'creator': 'LaTeX with hyperref', 'creationdate': '20

In [64]:
context_text = "\n\n".join(doc.page_content for doc in retrived_docs)

In [None]:
context_text

'• Electrical and Electronics: Signals and Systems\nCertifications and Achievements\n• Team Leader - Placement Fair Volunteer Section, GEC-Gandhinagar 2025\n• LangChain Essentials for Generative AI, LangChain Web 2025\n• Trustworthy Generative AI, Coursera 2024\n• Machine Learning with Python, Jovian 2024\n• Business Writing - Effective Communication, Coursera 2024\n\n– Developed a customResNet9 CNN modelusing PyTorchto classify natural scene images into6 categories, trained\non 14,000+ images and obtained90.27% accuracyon validation set(3,000+ images) using advancedimage augmen-\ntation techniques(Resize, Random-Crop, Random-Rotation, Normalization)\n– Automated dataset retrieval viaKaggle APIand deployed the model with aStreamlit web appfor real-time image\nclassification and can be utilized inTerrain Assessment Systems.\n• 2. Generative AI for Text Generation | LSTM, Deepseek-R1, Streamlit Feb 2025 - March 2025\nHack the Spring - Technoverse, GEC Gandhinagar github.com/rikin-2911/Te

In [65]:
final_prompt = prompt.invoke({'context':context_text, 'question':question})

In [None]:
print(final_prompt)

text="\n        You are a personalized research Assitant for providing meaningful answers.\n        Answer in simple meaning and detailed explanation of the topic for user understanding.\n        Answer ONLY from the provided transcript context.\n        If the context is insufficient, just say you don't know,\n            \n        • Electrical and Electronics: Signals and Systems\nCertifications and Achievements\n• Team Leader - Placement Fair Volunteer Section, GEC-Gandhinagar 2025\n• LangChain Essentials for Generative AI, LangChain Web 2025\n• Trustworthy Generative AI, Coursera 2024\n• Machine Learning with Python, Jovian 2024\n• Business Writing - Effective Communication, Coursera 2024\n\n– Developed a customResNet9 CNN modelusing PyTorchto classify natural scene images into6 categories, trained\non 14,000+ images and obtained90.27% accuracyon validation set(3,000+ images) using advancedimage augmen-\ntation techniques(Resize, Random-Crop, Random-Rotation, Normalization)\n– Auto

In [None]:
final_prompt

StringPromptValue(text="\n        You are a personalized research Assitant for providing meaningful answers.\n        Answer in simple meaning and detailed explanation of the topic for user understanding.\n        Answer ONLY from the provided transcript context.\n        If the context is insufficient, just say you don't know,\n            \n        • Electrical and Electronics: Signals and Systems\nCertifications and Achievements\n• Team Leader - Placement Fair Volunteer Section, GEC-Gandhinagar 2025\n• LangChain Essentials for Generative AI, LangChain Web 2025\n• Trustworthy Generative AI, Coursera 2024\n• Machine Learning with Python, Jovian 2024\n• Business Writing - Effective Communication, Coursera 2024\n\n– Developed a customResNet9 CNN modelusing PyTorchto classify natural scene images into6 categories, trained\non 14,000+ images and obtained90.27% accuracyon validation set(3,000+ images) using advancedimage augmen-\ntation techniques(Resize, Random-Crop, Random-Rotation, Norm

In [66]:
results = llm.invoke(final_prompt)

In [67]:
print(results.content)

This document is about **Faiss**, a powerful index library used in machine learning tasks. To understand skills, let's break down the table of contents and look at some key concepts:

**Skills:**  Are generally abilities, knowledge, and traits that help in performing tasks and accomplishing goals. 
 
**Faiss:**  This name appears in a scientific context, referring to an open-source library. What this library is used for is crucial to understanding its "skills"

Here's the key takeaway to understand the concept of skills from this document:

* **Artificial Intelligence (AI) and Computer Vision:** This document focuses on how AI researchers are processing large amounts of data (pictures, video), using a library made for this task to build great models.
* **Machine Learning (ML):** This is the process where computers are trained to learn from data to model various things. AI that relates to pattern recognition in images or videos, FAISS can be a versatile tool to set up machine learning m

In [None]:
context_text = "\n\n".join(doc.page_content for doc in chunks)
summary_prompt=PromptTemplate(
            template="""
                    Summarize the following documents:- \n {context_text}
            """,
            input_variables=['context_text']
        )

In [None]:
print(summary_prompt)

input_variables=['context_text'] input_types={} partial_variables={} template='\n                    Summarize the following documents:- \n {context_text}\n            '
