In [20]:
!pip uninstall -y langchain langchain-openai langchain_community langchain_huggingface langchain_google_genai langchain-core pandas
!pip install -q --upgrade "langchain>=0.0.300" langchain-openai langchain_community langchain_huggingface langchain_google_genai sentence-transformers openai pdfplumber pandas==2.2.2 python-dotenv requests==2.32.4

Found existing installation: langchain 0.3.27
Uninstalling langchain-0.3.27:
  Successfully uninstalled langchain-0.3.27
Found existing installation: langchain-openai 0.3.35
Uninstalling langchain-openai-0.3.35:
  Successfully uninstalled langchain-openai-0.3.35
Found existing installation: langchain-community 0.3.27
Uninstalling langchain-community-0.3.27:
  Successfully uninstalled langchain-community-0.3.27
Found existing installation: langchain-huggingface 0.3.1
Uninstalling langchain-huggingface-0.3.1:
  Successfully uninstalled langchain-huggingface-0.3.1
Found existing installation: langchain-google-genai 2.1.12
Uninstalling langchain-google-genai-2.1.12:
  Successfully uninstalled langchain-google-genai-2.1.12
Found existing installation: langchain-core 0.3.79
Uninstalling langchain-core-0.3.79:
  Successfully uninstalled langchain-core-0.3.79
Found existing installation: pandas 2.3.3
Uninstalling pandas-2.3.3:
  Successfully uninstalled pandas-2.3.3
[2K   [90m━━━━━━━━━━━━━━━

In [21]:
from google.colab import userdata
import os
OpenAI_API_KEY = os.environ["OPENAI_API_KEY"] = userdata.get('OpenAI_API_KEY')
GEMINI_API_KEY = os.environ["GOOGLE_API_KEY"] = userdata.get('GEMINI_API_KEY')

First, let's upload your PDF file to the Colab environment. You can use the files.upload() function from google.colab to do this. After running the cell, a button will appear allowing you to select and upload your file.

In [22]:
from google.colab import files

uploaded = files.upload()

# Get the name of the uploaded file
for fn in uploaded.keys():
  print(f'User uploaded file "{fn}"')
  pdf_file_path = fn

Saving TCS_interview.pdf to TCS_interview (3).pdf
User uploaded file "TCS_interview (3).pdf"


## Document Loaders

Now that the PDF file is uploaded, we can use `PDFLoader` from `langchain_community.document_loaders` to load its content.

In [23]:
!pip install pypdf



In [24]:
from langchain_community.document_loaders import PyPDFLoader

# Ensure pdf_file_path is defined from the previous upload step
# If you rerun this cell independently, you might need to manually set pdf_file_path = 'your_uploaded_file_name.pdf'

loader = PyPDFLoader(pdf_file_path)
documents = loader.load()
print(documents[1].page_content)

TCS Recruitment Process
1.   Interview Process
2.   Interview Rounds
TCS Technical Interview Questions: Freshers and
Experienced
3.   What is Socket Programming? What Are The Benefits And Drawbacks Of Java
Sockets?
4.   What is IPsec? What are its components?
5.   What do you understand about a Subnet Mask?
6.   What is NAT?
7.   What is piggybacking?
8.   What does a database schema imply? What are its types?
9.   What is the diﬀerence between a clustered index and non clustered index ?
10.   What do you understand about round trip time?
11.   What is a Ping?
12.   What do you know about SLIP?
13.   What is Ethernet?
14.   What is the tunnel mode in networking?
15.   Discuss the RSA algorithm in brief.
16.   In a so ware program, what is cyclomatic complexity?
17.   Give an instance where there was a bug that you didn't find in black box testing
but discovered in white box testing.
18.   What is slice splicing in so ware testing? What are its two types?
Page 1 © Copyright by Interview

## Splitting the data(Chunking)

In [25]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# initialise the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap = 300
)

# split the documents(pages in pdf)
chunks = text_splitter.split_documents(documents)

#printing 1st page content
# print([chunk.page_content for chunk in chunks])
# print(chunks[1].page_content)
print("Chunks created:", len(chunks))
print("Sample chunk metadata:",chunks[0].metadata)
print(chunks[5].page_content[:400])

Chunks created: 37
Sample chunk metadata: {'producer': 'Skia/PDF m85', 'creator': 'Chromium', 'creationdate': '2023-11-11T07:45:01+00:00', 'moddate': '2023-11-11T07:45:01+00:00', 'source': 'TCS_interview (3).pdf', 'total_pages': 30, 'page': 0, 'page_label': '1'}
TCS Interview Questions
TCS Recruitment Process
1.   Interview Process
TCS is an excellent location to begin your career as a new employee. It provides a
fantastic workplace as well as a welcoming setting with a good ambiance conducive
to individual and company progress. TCS holds a mass recruiting procedure every
year to find applicants for the position of  So ware Engineer. This article not only


## Embedding Model - Hugging face hub

Embedding models aim to capture the "meaning" of text

In [26]:
from langchain_huggingface import HuggingFaceEmbeddings

In [27]:
emb_model_name = "all-MiniLM-L6-v2"  # good free model
embeddings = HuggingFaceEmbeddings(model_name=emb_model_name)

# quick smoke test embedding
test_emb = embeddings.embed_documents([chunks[0].page_content[:200]])
print("Embedding vector length:",len(test_emb[0]))

Embedding vector length: 384


## Store the embedding chunks in Vector DB as vectors

In [11]:
!pip install chromadb



## Define Chroma DB as Vector store

In [28]:
# using Chroma as vector db
from langchain.vectorstores import Chroma
db = Chroma.from_documents(
                  documents = chunks,
                  embedding = embeddings
     )


In [24]:
# can count the no.of splits embedded and stored in db
print(db._collection.count())

609


In [19]:
# get the id's of that vectors in db
print(db._collection.get())

{'ids': ['d6169d55-5fbf-4604-99e3-d9f55f557923', '8597a1f4-bcc9-4de1-9268-ea187a63d773', '38163800-ff93-4324-9eb3-1e63e56c91e7', '10c32c4c-ca82-4969-81f2-736e1670830d', 'b4a6471c-7ad6-4de4-9b3b-c5ac45477862', 'd012002b-4afc-4aae-8913-7b95974fb19a', 'f06aaf5c-783d-4b9d-810f-220db902caa7', '5551d667-f923-4d70-87f0-0319b49989e5', '65053e4e-0246-4046-8762-6425725edc3f', '8147b50e-dfb6-4dcd-a14b-9ffce1e5543f', 'aa36b9f4-4e7d-4b53-85ea-f4f8be856dab', '145e6764-04b6-4f2d-a4dc-0b0fa98cf257', '3edd0816-6022-43e7-8f1a-872b09b838b1', '07df0a88-d649-4d27-80c1-a5d31a27b22d', 'bad63aa2-51e2-41a0-9f75-475431886656', '7788cc21-ab67-4619-9104-e59f369e8322', 'e98e8a89-edde-4d3f-bff1-4481ebd4ce70', '0ba5e206-6f87-401b-93f8-83856a2cb348', '6a789f7b-082f-4729-99ef-84015572f315', 'c7bf634f-dca8-4ac6-9373-91c14be7e5e7', '2e41d07c-a4a9-464d-961a-0f55f478a165', '9d484ed3-f821-463e-a4a0-3ca60bc8ba35', '61ae90fe-2452-418c-90b3-737c43316ef0', 'da3a1c12-f1d7-407b-82c5-83a28ba2e727', '95dad15f-4d91-4c12-8f0d-2892ea

In [25]:
# we can see the documents and embeddings from particular id
print(db._collection.get(ids = ['d6169d55-5fbf-4604-99e3-d9f55f557923'],include=['documents','embeddings']))

{'ids': ['d6169d55-5fbf-4604-99e3-d9f55f557923'], 'embeddings': array([[-1.15898706e-01,  7.42866471e-02, -1.97487622e-02,
         5.56280538e-02,  5.29825240e-02,  3.99922654e-02,
         2.51608714e-02, -3.20511796e-02, -3.92155796e-02,
         1.40811075e-02, -5.67454621e-02,  1.28618367e-02,
        -2.33617648e-02, -2.24867165e-02, -4.82673533e-02,
        -9.96196270e-02,  5.70688397e-02, -7.31768161e-02,
        -3.62602109e-03, -3.77334133e-02, -8.53090137e-02,
        -1.86699964e-02,  6.58945041e-03, -2.85180174e-02,
         8.70462973e-03, -3.68316956e-02,  7.81755545e-04,
         1.24251600e-02,  6.38417825e-02,  5.03268577e-02,
        -7.48619810e-02,  3.58098187e-02, -4.29339632e-02,
         3.07869930e-02,  3.84877212e-02, -2.06308644e-02,
        -1.04079563e-02,  5.96803129e-02, -1.56391170e-02,
         2.86992732e-02,  4.70096897e-03, -5.93037046e-02,
        -9.85627621e-03, -6.59784442e-03, -2.61561256e-02,
        -1.29470021e-01, -5.90064004e-02,  3.753076

# RAG Pipeline(Retrieval Chain)

## 1.My vectore db act like "Retriever"

## 2.Augementation ->(Query + Context)-> ChatPromptTemplate(consists of context & question)

## 3.LLM Creation

In [29]:
# 1. creating a retriever and format the chunks
retriever = db.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k": 2}
)

In [30]:
# 2. getting default prompt from hub
from langchain import hub
prompt = hub.pull("rlm/rag-prompt")
print(prompt.messages[0].prompt)
# see reference prompt template
"""
from langchain.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
          You are an assistant for question-answering tasks.
          Use the following pieces of retrieved context to answer
          the question. If you don't know the answer, just say
          that you don't know. Use three sentences maximum
          and keep the answer concise.

          Question: {question}
          Context: {context}
          Answer:
        )
"""

input_variables=['context', 'question'] input_types={} partial_variables={} template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"


"\nfrom langchain.prompts import ChatPromptTemplate\nprompt = ChatPromptTemplate.from_template(\n          You are an assistant for question-answering tasks. \n          Use the following pieces of retrieved context to answer\n          the question. If you don't know the answer, just say\n          that you don't know. Use three sentences maximum \n          and keep the answer concise.\n\n          Question: {question} \n          Context: {context} \n          Answer:\n        )\n"

In [31]:
!pip install langchain_google_genai



In [32]:
# # import OPenAI(if you've premium) or ChatGoogleGenerativeAI(for free)
# # from langchain_openai import ChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key=GEMINI_API_KEY
)

In [14]:
# 3. LLM Creation
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model = "gpt-4o-mini",api_key = OpenAI_API_KEY)

# RAG Chain

In [33]:
# context -> the output (chunks or no.of splits) from the retriever should
# be formatted in "relevant" manner and Join them
def format_chunks(chunks):
  return "\n".join(chunk.page_content for chunk in chunks)


In [34]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
rag_chain = (
    {
        # CONTEXT from retriever passed to function to fromat
        "context": retriever | format_chunks,
        "question": RunnablePassthrough()
    }
    | prompt # pass the question and context to prompt
    | llm    # pass prompt to LLM
    | StrOutputParser() # send the generated output in String format

)

## Test the RAG Chain-> using .invoke()

In [35]:
rag_chain.invoke("What is NAT?")

'NAT stands for Network Address Translation. It involves modifying the IP headers of packets as they are transported over a traffic routing device. This process is used to remap one IP address space to another.'

In [38]:
rag_chain.invoke("What is salary for freshers in TCS?")

"I don't know the answer to the question. The provided context discusses interview questions for freshers and experienced employees at TCS, but it does not specify salary figures for freshers."