<a href="https://colab.research.google.com/github/quantumhome/DataAnalysisCaseStudy/blob/master/Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install langchain langchain.core langchain_community langchain_google_genai pypdf chromadb sentence_transformers



In [None]:
import os
from google.colab import userdata
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_google_genai import ChatGoogleGenerativeAI # Use ChatGoogleGenerativeAI
from langchain.chains.retrieval_qa.base import RetrievalQA

# Set your API key in the environment variables
# Get your API key from the Google AI Studio
#os.environ["GOOGLE_API_KEY"] = "AIzaSyC9QtOW8_jq4o5cTNjrD9ClLW1HstoIh4Q"
os.environ["GOOGLE_API_KEY"] = userdata.get('DeveloperKey')


# 1. Load and chunk the PDF
pdf_file_path = "/content/sample_data/AUG_2025_PaySlip.pdf"
loader = PyPDFLoader(pdf_file_path)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# 2. Create embeddings using a Hugging Face model and store in ChromaDB
# Using a good quality open source embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
db = Chroma.from_documents(texts, embeddings, persist_directory="./chroma_db")
db.persist()

# 3. Initialize the alternative LLM (e.g., Google Gemini)
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.3)

# 4. Build the RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff", # 'stuff' works well here as the RAG retriever manages context size
    retriever=db.as_retriever()
)

# 5. Ask a question or request a summary
query = "Provide a comprehensive summary of the document, focusing on key findings and conclusions."
result = qa_chain.invoke({"query": query})

print(result["result"])


This document contains two distinct types of information:

1.  **Technical/Academic Information:**
    *   It provides a definition of an **attention function**, describing it as a mechanism that maps a query and a set of key-value pairs to an output vector, computed as a weighted sum.
    *   It includes a list of academic references (numbered 25-29) primarily related to computational linguistics and natural language processing, covering topics such as the Penn Treebank, self-training for parsing, decomposable attention models, abstractive summarization, and tree annotation.

2.  **Financial/Salary Information (in INR):**
    *   **Current Month's Gross Earnings:** 130,755.25 INR, comprising components like Basic Salary (41,697.29), HRA (20,849.03), Car Allowance (13,548.39), Compensatory Allowance (35,266.06), and Engagement PB (7,791.48), among others.
    *   **Current Month's Gross Deductions:** 13,748.00 INR, including Medical Premium Recoverable (750.00), Ee PF contribution (5,0

In [None]:
# use the chromadb to retrieve the information
query1 = "Provie the basic salary details."
result = qa_chain.invoke({"query": query1})

print(result["result"])

Here are the basic salary details:

*   **Standard Monthly Salary (INR):** 53,859.00
*   **Earnings (INR):** 41,697.29
