In [7]:
# Install LangChain library
# LangChain helps us connect Large Language Models (LLMs) with prompts, chains, and tools
# LangChain → Framework to build AI applications using LLMs
!pip install langchain

# Install OpenAI wrapper (optional, future-ready)
# Useful if we switch from Gemini to OpenAI models later
# Wrapper → A layer that connects external models with LangChain
!pip install langchain_openai

# Install Google Gemini integration for LangChain
# Enables us to use Google’s Gemini models inside LangChain
!pip install langchain_google_genai

Collecting langchain_openai
  Downloading langchain_openai-1.1.7-py3-none-any.whl.metadata (2.6 kB)
Downloading langchain_openai-1.1.7-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.8/84.8 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain_openai
Successfully installed langchain_openai-1.1.7
Collecting langchain_google_genai
  Downloading langchain_google_genai-4.2.0-py3-none-any.whl.metadata (2.7 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain_google_genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Downloading langchain_google_genai-4.2.0-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.5/66.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Installing collected packages: filetype, langchain_google_genai
Successfully installed filetype-1.2.0 langchain_google_genai-4.2.0


In [8]:
# Import OS module
# Used to interact with environment variables securely
# os → Python module to manage system-level operations like API keys
import os

In [9]:
# Import userdata from Google Colab
# This allows us to safely access stored secrets
from google.colab import userdata

gem = userdata.get('Gemini') # Fetch Gemini API key from Colab secrets

os.environ["GOOGLE_API_KEY"]=gem

# Store the API key as an environment variable
# LangChain automatically reads API keys from environment variables

In [10]:
# Import Gemini text and chat models from LangChain
from langchain_google_genai import GoogleGenerativeAI ,ChatGoogleGenerativeAI

In [11]:

model =ChatGoogleGenerativeAI(model = "gemini-3-flash-preview")

Note : When using TypedDict in LangChain, always use a chat-based LLM model. Chat models are optimized for structured outputs, schema enforcement, and reliable JSON-like responses. Regular LLMs generate free-form text and may not follow strict data structures.

Rule:
TypedDict → Chat Model → Structured Output

In [12]:
from typing import TypedDict

We mention TypedDict from the typing module to define the expected structure of a dictionary with specific keys and value types.

It helps:

- Enforce structured data format

- Improve code readability

- Enable type checking

- Ensure LLM outputs match a fixed schema when used with chat models

In [13]:
# we are going to create our own schema (template for our data format)

In [14]:
class DataFormat(TypedDict):
   summary:str
   experience:int

   # This defines a structured dictionary format where summary must be a string and experience must be an integer, ensuring consistent and type-safe data output.

In [15]:
#example
#obj = DataFormat(summary ="Shrutika",experience=1)

In [16]:
response = model.invoke("i am having a 10 years of experience in opencv, agentic ai, ml, and dl, And i love to play cricket, skating")

In [17]:
response.content

[{'type': 'text',
  'text': 'That is an incredibly powerful combination of skills! With 10 years in the field, you’ve likely seen the evolution from traditional feature descriptors (like SIFT/SURF in OpenCV) to the current era of Foundation Models and Autonomous Agents.\n\nSince you have such a deep technical background and active hobbies, there is a lot of "cross-pollination" we could talk about. Here are a few ways your world might intersect:\n\n### 1. The Professional Edge: From Vision to Agents\n*   **The Transition:** Having a decade of experience means you likely started with C++/Python OpenCV for image processing and moved into CNNs/Transformers. Now, integrating **Agentic AI** means you’re probably looking at how LLMs can use vision tools to reason about the physical world.\n*   **Agentic Vision:** Are you building systems where an agent uses OpenCV as a "tool" to perform specific inspections or spatial reasoning? That is one of the most exciting frontiers in AI right now.\n\n#

In [18]:
fm = model.with_structured_output(DataFormat)

In [19]:
response = fm.invoke("i am having a 10 years of experience in opencv, agentic ai, ml, and dl, And i love to play cricket, skating")

In [20]:
response

{'summary': 'Experienced professional with 10 years of expertise in OpenCV, Agentic AI, Machine Learning, and Deep Learning, with a personal interest in cricket and skating.',
 'experience': 10}

In [21]:
response["summary"]

'Experienced professional with 10 years of expertise in OpenCV, Agentic AI, Machine Learning, and Deep Learning, with a personal interest in cricket and skating.'

In [22]:
response["experience"]

10

In [23]:
from typing import Optional
# Optional means a variable may have a value or may be None, allowing that field to be missing or unknown without causing an error.

class DataFormat(TypedDict):
   summary:str
   experience:Optional[int]
   skills :list[str]
   hobbies : list[str]

   # This defines a typed dictionary schema where summary is required, experience is optional (can be an integer or None), and skills and hobbies must be lists of strings, ensuring structured and flexible data output.

In [24]:
fm = model.with_structured_output(DataFormat)

In [25]:
response2 = fm.invoke("i am having a experience in opencv, agentic ai, ml, and dl, And i love to play cricket, skating")

In [26]:
response2

{'summary': 'A professional with experience in computer vision and artificial intelligence, specializing in OpenCV, Agentic AI, Machine Learning, and Deep Learning.',
 'experience': 1,
 'skills': ['OpenCV', 'Agentic AI', 'ML', 'DL'],
 'hobbies': ['Cricket', 'Skating']}

In [27]:
from typing import Annotated
# Annotated is used to attach extra metadata or instructions to a type, without changing the type itself—commonly used to give LLMs or frameworks additional guidance on how to interpret or validate a value.

In [28]:
class DataFormat1(TypedDict):
  su:Annotated[str,"give the summary of the resume"]
  ex:Annotated[Optional[int], "if the experience is there you return the experience or else return NA"]

# This defines a structured output schema where su is a string summary of the resume and ex is an optional integer experience field, with Annotated providing instructions to guide the LLM on how each field should be generated.

In [29]:
fm3 = model.with_structured_output(DataFormat1)

In [30]:
response2 = fm3.invoke("i am having a experience in opencv, agentic ai, ml, and dl, And i love to play cricket, skating etc")

In [31]:
response2  #work perfectly on new models

{'su': 'opencv, agentic ai, ml, dl, cricket, skating', 'ex': 0}

In [None]:
!pip install --upgrade pymupdf

# !pip install --upgrade pymupdf installs or updates PyMuPDF, a library used for extracting and processing text and content from PDF documents.



In [34]:
import pymupdf
doc = pymupdf.open("/content/Data_Analysis_shrutika_kapade_resume_Final.pdf")

# This code imports the PyMuPDF library and opens the specified PDF file(resume), loading it into memory for text extraction and document analysis.

In [35]:
type(doc)

In [36]:
text =" "
for page in doc:
  text = text+page.get_text()
# This code iterates through each page of the document, extracts the text from every page, and concatenates it into a single string for complete document text processing.

In [37]:
print(text)

 SHRUTIKA KAPADE 
+91 8624929322 | shrutikakapade@gmail.com | LinkedIn | GitHub 
Professional Summary 
Data Scientist skilled in Python, SQL, EDA, Machine Learning, and Data Visualisation with hands-on experience in end-to-end 
analytics workflows. Completed practical projects involving data cleaning, feature engineering, and predictive modelling 
using standard ML techniques. Currently working as an AI Agent Intern and Data Science and GenAI Intern at Innomatics 
Research Labs, contributing to workflow automation, data validation, and process improvement. Focused on applying data-
driven problem-solving to deliver clear insights and support effective business decision-making. 
Education 
Bachelor of Technology (B. Tech) Data Science 
Nov 2022 – Jun 2025 
G.H. Raisoni College of Engineering and Management, Jalgaon. CGPA: 8.71 
Technical Skills 
    Language:  
 
Python (Pandas, NumPy, Scikit-learn), SQL (Window Functions, CTEs, Optimisations).  
    BI Tools:  
 
Power BI (Advanced DAX

In [38]:
class DataFormat(TypedDict):
  summary:str
  experience:Optional[int]
  skills:list[str]
  links:Annotated[list[str],"if any links found in the text return me the links as list of string make sure the link should be active"]

This defines a typed dictionary schema for structured output where summary is a text overview, experience is optional, skills is a list of skills, and links is a list of valid, active URLs, with Annotated guiding the LLM to return only usable links.

In [39]:
final_model = model.with_structured_output(DataFormat)

In [40]:
final_response = final_model.invoke(text)

In [41]:
final_response

{'summary': 'Data Scientist skilled in Python, SQL, EDA, Machine Learning, and Data Visualisation with hands-on experience in end-to-end analytics workflows. Completed practical projects involving data cleaning, feature engineering, and predictive modelling using standard ML techniques. Currently working as an AI Agent Intern and Data Science and GenAI Intern at Innomatics Research Labs, contributing to workflow automation, data validation, and process improvement. Focused on applying data-driven problem-solving to deliver clear insights and support effective business decision-making.',
 'experience': 0,
 'skills': ['Python',
  'Pandas',
  'NumPy',
  'Scikit-learn',
  'SQL',
  'Window Functions',
  'CTEs',
  'Optimisations',
  'Power BI',
  'Advanced DAX',
  'Power Query',
  'Excel',
  'Power Pivot',
  'VLOOKUP',
  'Data Quality Assessment',
  'Data Modelling',
  'Star Schema',
  'MySQL',
  'Oracle SQL',
  'Statistical Analysis',
  'EDA',
  'Anomaly Detection',
  'Hypothesis Testing',


### Why do we use Pydantic instead of TypedDict?
- We use Pydantic because it provides runtime data validation, automatic type conversion, and detailed error handling, whereas TypedDict only defines structure and does not validate data at runtime.