## Set Up Environment - Pip Install

In [57]:
!pip install langchain llama-index pypdf faiss-cpu chromadb

In [58]:
!pip install sentence-transformers langchain-community langchain-huggingface langchain_ollama

In [113]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Load the Resume (PDF) as Data Source

In [114]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("pdfs/Vignesh_R_Resume.pdf")
pages = loader.load()

## Chunk & Embed the Text

In [115]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

In [116]:
# Step 1: Chunk the content
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = text_splitter.split_documents(pages)

In [117]:
# Step 2: Create embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2") #all-mpnet-base-v2, all-MiniLM-L6-v2

In [118]:
# Step 3: Store in FAISS vector store
vectorstore = FAISS.from_documents(docs, embedding=embeddings)

## Connect to Ollama Model

In [119]:
from langchain_community.llms import Ollama

llm = Ollama(model="gemma3:4b", temperature=0.2, num_ctx=128000)

## Build the RetrievalQA Pipeline

In [120]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

## Ask Questions about the Resume

In [121]:
%%time
from IPython.display import Markdown, display

questions = [
    "1. What are my key skills and technologies mentioned in the resume?",
    "2. What kind of projects have I worked on?",
    "3. What job roles am I best suited for?",
    "4. Generate a short bio from my resume"
]

for q in questions:
    result = qa_chain.invoke(q)
    display(Markdown(f"### ✅ {q}\n**Answer:**\n{result['result']}"))

### ✅ 1. What are my key skills and technologies mentioned in the resume?
**Answer:**
Here’s a breakdown of V Vignesh’s key skills and technologies as listed in the resume:

**Programming Languages:** Python, SQL

**Data Technologies & Methodologies:**
*   Data Analytics
*   Data Engineering
*   Data Science
*   Machine Learning
*   Natural Language Processing (NLP)
*   Computer Vision
*   Large Language Models (LLM)
*   NLTK
*   Big Data
*   Data Modeling
*   Automation
*   Medallion Architecture

**Cloud Platforms:** AWS, Google Cloud Platform (GCP)

**Data Tools & Platforms:**
*   Airflow
*   Apache Kafka
*   Hadoop
*   MySQL
*   NumPy
*   Pandas
*   Postgres
*   Tableau
*   Looker
*   Power BI
*   Databricks
*   Superset
*   PySpark

**Other Relevant Skills:** Git, Analytics, Customer Journey Analysis

### ✅ 2. What kind of projects have I worked on?
**Answer:**
Based on the provided context, Vinesh has worked on projects including:

*   **Scalable ETL pipelines & Data models:** Architecting and constructing these for seamless data flow.
*   **Sales Retrospective Analytics:** Measuring the effectiveness of sales and marketing campaigns.
*   **Effort Score Calculation:** Developing a heuristic model to measure customer struggles during the customer journey.



### ✅ 3. What job roles am I best suited for?
**Answer:**
Based on the provided context, V Vignesh is best suited for roles in Data Analytics, Data Engineering, and Data Science. Specifically, his experience with ETL pipelines, data modeling, Airflow, and various data technologies (SQL, Python, etc.) aligns well with these fields.

### ✅ 4. Generate a short bio from my resume
**Answer:**
Here’s a short bio based on the provided resume:

“Versatile Data Wizard with experience in Data Analytics, Engineering, and Science. As a Data Analytics Engineer at [Company Name - *not provided in the text*], I architected and constructed scalable ETL pipelines, utilized technologies like Python, SQL, and Airflow, and led a team to automate analytics processes, saving 500 hours of manual effort. I’m passionate about leveraging data to drive insights and solutions.”


CPU times: user 204 ms, sys: 1.85 s, total: 2.05 s
Wall time: 27.7 s


In [122]:
%%time
questions = [
    "Give me the candidate's Name, phone number and location in Key: value format such as Name:<name>\nPhone:<phone>\nLocation:<location>"
]

for q in questions:
    result = qa_chain.invoke(q)
    display(Markdown(f"### ✅ {q}\n\n**Answer:**\n\n```\n{result['result']}\n```"))

### ✅ Give me the candidate's Name, phone number and location in Key: value format such as Name:<name>
Phone:<phone>
Location:<location>

**Answer:**

```
Name:Vignesh R
Phone: +91 86085 77937
Location: Bengaluru, Karnataka, India
```

CPU times: user 37.7 ms, sys: 524 ms, total: 562 ms
Wall time: 2.4 s


In [123]:
%%time
questions = [
    '''Mention tools, technologies, languages, frameworks, libraries, environment only. For eg: 
Mention Jenkins instead of CI/CD tools, Mention AWS instead of Cloud, Mention YOLO instead of CV Libraries, Mention Figma instead of Wireframing tools, Mention Tableau instead of Visualisation tools, Mention Selenium instead of Testing tools'''
]

for q in questions:
    result = qa_chain.invoke(q)
    display(Markdown(f"### ✅ {q}\n\n**Answer:**\n\n```\n{result['result']}\n```"))

### ✅ Mention tools, technologies, languages, frameworks, libraries, environment only. For eg: 
Mention Jenkins instead of CI/CD tools, Mention AWS instead of Cloud, Mention YOLO instead of CV Libraries, Mention Figma instead of Wireframing tools, Mention Tableau instead of Visualisation tools, Mention Selenium instead of Testing tools

**Answer:**

```
Here’s a list of tools, technologies, languages, frameworks, libraries, and environments based on the provided context:

*   YOLO
*   Tableau
*   Python
*   SQL
*   Machine Learning
*   Natural Language Processing (NLP)
*   Computer Vision
*   Large Language Models (LLM)
*   NLTK
*   Big Data
*   Business Analytics
*   AWS
*   Google Cloud Platform (GCP)
*   Airflow
*   Apache Kafka
*   Hadoop
*   MySQL
*   NumPy
*   Pandas
*   Postgres
*   Tableau
*   Looker
*   Power BI
*   Git
*   Databricks
*   PySpark
*   Superset
*   JIRA
*   CI/CD tools (Implied by automation strategies)
*   Cloud (Implied by AWS and GCP)
*   JIRA
```

CPU times: user 61.7 ms, sys: 510 ms, total: 572 ms
Wall time: 7.47 s


In [124]:
print("\nSources used:\n", result["source_documents"])


Sources used:
 [Document(id='29eee783-b318-4cd2-aedd-6c150b416a0a', metadata={'producer': 'react-pdf', 'creator': 'react-pdf', 'creationdate': '2025-07-17T03:51:31+00:00', 'source': 'pdfs/Vignesh_R_Resume.pdf', 'total_pages': 2, 'page': 1, 'page_label': '2'}, page_content="»  Managed ad-hoc Requests : Handled various data-pull requests from several teams inside the organisations, \nto effectively contribute requesting team with Insights\nEDUCATION\n \n Sri Shakthi Institute of Engineering Technology August 2015 - April 2019\nBachelor's, Computer Engineering GPA: 7.3\nCERTIFICATIONS\n \n AWS Certified Solutions Architect\nJava Full Stack Certification\nJapanese Language"), Document(id='b1097bb0-e1ac-4092-9dab-fb9dfadbe91d', metadata={'producer': 'react-pdf', 'creator': 'react-pdf', 'creationdate': '2025-07-17T03:51:31+00:00', 'source': 'pdfs/Vignesh_R_Resume.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='LinkedIn Bengaluru, KA, India\nData Analytics Engineer March

# -----------------------------------------------------------------------------

## RAG with Qdrant + OpenAI + Ollama

In [125]:
!pip install langchain qdrant-client openai pypdf langchain-community



In [126]:
!docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

zsh:1: command not found: docker


In [127]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama
import os

In [128]:
# --- Step 1: Load and parse multiple PDF files ---
pdf_dir = "./pdfs"
all_docs = []

for filename in os.listdir(pdf_dir):
    if filename.endswith(".pdf"):
        loader = PyPDFLoader(os.path.join(pdf_dir, filename))
        all_docs.extend(loader.load())

In [129]:
# --- Step 2: Chunk the documents ---
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
chunks = text_splitter.split_documents(all_docs)

In [130]:
# --- Step 3: Embeddings using Hugging Face model ---
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [131]:
# --- Step 4: Load or create Qdrant Vector Store ---
qdrant = Qdrant.from_documents(
    documents=chunks,
    embedding=embedding_model,
    location="http://localhost:6333",
    collection_name="python_docs"
)

In [132]:
# --- Step 5: Load Ollama (Gemma 3b) ---
llm = Ollama(model="gemma3:4b", temperature=0.2, num_ctx=128000)

In [133]:
# --- Step 6: Retrieval QA Chain ---
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=qdrant.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

In [134]:
# --- Step 7: Ask a question ---
from IPython.display import display, Markdown

# Sample list of questions
questions = [
    "What is the difference between Python 2 and Python 3 mentioned in Core_Python_Programming?",
    "Explain decorators with examples from Introduction_to_Python_Programming.",
    "What scientific libraries are covered in Python_Scientific_Programming?",
    "What are the key principles of designing scalable data systems according to Designing_Data_Intensive_Applications?",
    "How is rate limiting explained in System_Design_Interview?"
]

In [135]:
%%time
# Loop through each question and invoke your LLM (qa_chain or similar)
for q in questions:
    result = qa_chain.invoke(q)  # Replace with your actual LLM chain
    answer = result['result'] if isinstance(result, dict) else result
    display(Markdown(f"### 🧠 {q}\n\n**Answer:**\n\n```\n{answer}\n```"))
    for doc in result["source_documents"]:
        print("\n📄 Sources:")
        print("-", doc.metadata.get("source"))

### 🧠 What is the difference between Python 2 and Python 3 mentioned in Core_Python_Programming?

**Answer:**

```
According to the text, there are only two differences between Python 2 and Python 3.
```


📄 Sources:
- ./pdfs/Python_Scientific_Programming.pdf

📄 Sources:
- ./pdfs/Python_Scientific_Programming.pdf

📄 Sources:
- ./pdfs/Core_Python_Programming.pdf


### 🧠 Explain decorators with examples from Introduction_to_Python_Programming.

**Answer:**

```
Based on the provided text, here’s an explanation of decorators, drawing from the context:

Decorators in Python are “overlays” applied to function calls. They are additional calls that are executed when a function or method is declared. The syntax uses an “at-sign” ( @ ) followed by the decorator function name and any optional arguments.

The text doesn't provide a concrete example of how decorators are used, but it explains the underlying concept and syntax. It mentions that decorators were introduced in Python 2.4 and that older versions needed to be replaced with a specific assignment.

I don't know how to provide a full example of decorators with code, as the provided text only describes the concept and syntax.
```


📄 Sources:
- ./pdfs/Core_Python_Programming.pdf

📄 Sources:
- ./pdfs/Core_Python_Programming.pdf

📄 Sources:
- ./pdfs/Core_Python_Programming.pdf


### 🧠 What scientific libraries are covered in Python_Scientific_Programming?

**Answer:**

```
Based on the provided text, the scientific libraries covered in Python are NumPy, Pandas, Matplotlib, Seaborn, and scikit-learn.
```


📄 Sources:
- ./pdfs/Introduction_to_Python_Programming_-_WEB.pdf

📄 Sources:
- ./pdfs/Introduction_to_Python_Programming_-_WEB.pdf

📄 Sources:
- ./pdfs/Introduction_to_Python_Programming_-_WEB.pdf


### 🧠 What are the key principles of designing scalable data systems according to Designing_Data_Intensive_Applications?

**Answer:**

```
According to Designing_Data_Intensive_Applications, the key principles of designing scalable data systems are:

*   **Reliability:** Ensuring the system continues to operate correctly.
*   **Scalability:** The ability to handle increasing amounts of data and user traffic.
*   **Maintainability:** The ease with which the system can be modified and updated.

The text also cautions against premature optimization, suggesting that you shouldn't build for scale if you don't need it, and emphasizes choosing the right tool for the job.
```


📄 Sources:
- ./pdfs/Designing_Data_Intensive_Applications.pdf

📄 Sources:
- ./pdfs/Designing_Data_Intensive_Applications.pdf

📄 Sources:
- ./pdfs/Designing_Data_Intensive_Applications.pdf


### 🧠 How is rate limiting explained in System_Design_Interview?

**Answer:**

```
According to the provided text, rate limiting is explained as focusing on the server-side API rate limiter.
```


📄 Sources:
- ./pdfs/System_Design_Interview.pdf

📄 Sources:
- ./pdfs/System_Design_Interview.pdf

📄 Sources:
- ./pdfs/System_Design_Interview.pdf
CPU times: user 165 ms, sys: 2.12 s, total: 2.28 s
Wall time: 23.2 s
