# 💼 Resume Reviewer – Job Match Assistant

## 📌1. Project Overview

This project builds a GenAI-powered assistant that analyzes a resume in the context of a job description and offers actionable feedback. The assistant helps job seekers:

- Understand how well their resume aligns with a job
- Get specific suggestions for edits and improvements
- Automatically generate a personalized cover letter
- View results in a clean JSON format that can be reused or visualized

## 🧠 GenAI Capabilities Demonstrated

This notebook showcases:

1. **Long-context handling** — process resume and job description together  
2. **Few-shot prompting** — guide output using examples  
3. **Structured output (JSON)** — receive parseable, machine-readable results  
4. **Document understanding** — analyze resumes and JDs  
5. **Retrieval-Augmented Generation (RAG)** — surface relevant sections  
6. **Context caching** — avoid repeating work on same inputs

## 🔧 Technologies Used

- Google Gemini API (`google-generativeai`)
- FAISS for similarity search
- Joblib for caching
- Python & Jupyter


## 🟦 2. Imports & Setup

We will:

- Use only the required libraries for GenAI, RAG, and caching
- Install Gemini SDK if needed
- Load secrets using Kaggle's secure interface


In [1]:
# Remove conflicting JupyterLab packages
!pip uninstall -qy jupyterlab jupyterlab-lsp
# Reinstall compatible JupyterLab (ensure version 3.x)
!pip install -qU 'jupyterlab>=3.1.0,<4.0.0a0'
!pip install -qU jupyterlab-lsp

# Install Gemini SDK, FAISS for vector search, and PDF parser
!pip install -qU google-genai==1.7.0 faiss-cpu pdfminer.six
!pip install PyPDF2 --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m48.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m75.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m 

In [2]:
import re
import os
import numpy as np
import pandas as pd
import faiss
from pdfminer.high_level import extract_text
from google import genai
from google.genai import types
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from kaggle_secrets import UserSecretsClient
import requests, io, json
from pathlib import Path
from PyPDF2 import PdfReader
from collections import OrderedDict
import time
from joblib import Memory
from google.genai import types, errors


# Load API key
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
client = genai.Client(api_key=GOOGLE_API_KEY)

## 🟧 3. Data Upload + Preprocessing

- Upload your resume and job description as Kaggle Dataset inputs
- Extract and clean text from PDF or TXT files
- Parse sections like Summary, Skills, Experience, etc.
- Filter out irrelevant boilerplate from job descriptions


In [3]:
def extract_sections_by_heading(text: str) -> dict[str,str]:
    """
    Split `text` at lines that are headings, collecting each heading + its content.
    A heading is a line in ALL‑CAPS (at least 3 letters) or ending in a colon.
    """
    lines = text.splitlines()
    sections = OrderedDict()
    current = None
    buf = []

    heading_pattern = re.compile(r"^([A-Z][A-Z &]{2,}|.+:)$")
    for line in lines:
        if heading_pattern.match(line.strip()):
            # Save previous
            if current:
                sections[current] = "\n".join(buf).strip()
            # Start new
            current = line.strip().rstrip(":")
            buf = []
        else:
            if current:
                buf.append(line)
    # Final flush
    if current:
        sections[current] = "\n".join(buf).strip()

    return dict(sections)

In [4]:
def fetch_document(source: str, dest_folder: Path) -> Path:
    """
    1. Google‑Drive share → direct PDF download.
    2. URL ending in .pdf → download.
    3. Otherwise → download HTML.
    Saves under dest_folder; returns local Path.
    """
    dest_folder.mkdir(parents=True, exist_ok=True)

    if "drive.google.com/file/d/" in source:
        fid     = re.search(r"/d/([^/]+)", source).group(1)
        dl_url  = f"https://drive.google.com/uc?export=download&id={fid}"
        filename= f"{fid}.pdf"
        content = requests.get(dl_url).content

    elif source.lower().endswith(".pdf"):
        filename= Path(urlparse(source).path).name
        content = requests.get(source).content

    else:
        parsed = urlparse(source)
        stem   = Path(parsed.path).stem or "page"
        filename= f"{stem}.html"
        content = requests.get(source).content

    local_path = dest_folder / filename
    local_path.write_bytes(content)
    return local_path

def pdf_to_text(pdf_path: Path) -> str:
    reader = PdfReader(str(pdf_path))
    return "\n".join(p.extract_text() or "" for p in reader.pages)

def html_to_text(html_path: Path) -> str:
    raw = html_path.read_text(encoding="utf8")
    return BeautifulSoup(raw, "html.parser").get_text(separator="\n")

def load_and_parse(source: str, workdir: Path) -> str:
    p = fetch_document(source, workdir)
    if p.suffix.lower() == ".pdf":
        return pdf_to_text(p)
    elif p.suffix.lower() in (".html","htm"):
        return html_to_text(p)
    else:
        return p.read_text()

# # — Ask user for sources —
# resume_src = input("Resume URL or Kaggle-input path: ").strip()
# jd_src     = input("Job    URL or Kaggle-input path: ").strip()


resume_src = "https://drive.google.com/file/d/1u373zWQWxZWD9yAcnFtw2nmcy62b2ULr/view?usp=sharing"
jd_src     = "https://amazon.jobs/en-gb/jobs/2772291/2025-software-dev-engineer-intern-uk"



workdir    = Path("/kaggle/working/docs")
resume_text= load_and_parse(resume_src, workdir)
jd_text    = load_and_parse(jd_src,    workdir)

# — (Optional) Section extraction —
def extract_sections(text: str) -> dict[str,str]:
    # Try named‑section regex first
    named = {}
    for hdr in ["Summary","Experience","Skills","Education"]:
        pat = rf"{hdr}[:\n](.*?)(?=\n[A-Z][a-z]+:|\Z)"
        m = re.search(pat, text, re.DOTALL)
        named[hdr] = m.group(1).strip() if m else ""

    # If none found, do heading-driven split
    if all(not v for v in named.values()):
        return extract_sections_by_heading(text)

    return {k:v for k,v in named.items() if v}

# Then:
resume_secs = extract_sections(resume_text)
if not resume_secs:
    resume_chunks = [resume_text]
else:
    resume_chunks = list(resume_secs.values())


In [5]:
print(resume_secs)

{'SIDDHARTH SHARMA': '+44 7407841707 ⋄U.K.\ns1ddh9rth@gmail.com ⋄LinkedIn ⋄GitHub ⋄Leetcode', 'SUMMARY': 'Aspiring AI/ML Engineer with a Master’s degree from Aston University, specializing in Machine Learning. Proficient\nin Python, with experience in designing and training AI models using PyTorch, Scikit-learn and TensorFlow .\nPassionate about applying advanced algorithms to solve real-world problems and leveraging AI to drive innovation.', 'EDUCATION': 'M.Sc. Artificial Intelligence , Aston University, U.K. Sept 2023 - Oct 2024\nRelevant Coursework: Machine Learning, Deep Learning, Data Mining, Multi Agent Systems, Robotics and Mathe-\nmatics for AI.\nDissertation: Enhancing Multi-Limb Coordination for Humanoid Robots using Reinforcement Learning.\nB.Sc. Computer Application , Aligarh Muslim University, IN Aug 2017 - Oct 2020', 'TECHNICAL SKILLS': '•Proficient: Machine Learning Algorithms, Reinforcement Learning, Data Structures and Algorithms, Docker,\nLarge-Language Models , CI/CD

In [6]:
jd_secs = extract_sections(jd_text)
if not jd_secs:
    jd_chunks = [jd_text]
else:
    jd_chunks = list(jd_secs.values())

In [7]:
print(jd_secs)

{'FAQ': 'Interview tips\nReview application status\nProvisions for disabled candidates\nLegal disclosures and notices\n Amazon is an Equal Opportunity Employer – Minority / Women / Disability / Veteran / Gender Identity / Sexual Orientation / Age.\nPrivacy and Data\nCompany Information\n© 1996-2025, Amazon.com, Inc. or its affiliates', 'DESCRIPTION': 'Do you want to solve business challenges through innovative technology? Do you enjoy working on cutting-edge, scalable services technology in a team environment? Do you like working on industry-defining projects that move the needle? \nAt Amazon, we hire the best minds in technology to innovate and build on behalf of our customers. The intense focus we have on our customers is why we are one of the world’s most beloved brands – customer obsession is part of our company DNA. Our interns write real software and collaborate with a selected group of experienced software development engineers who help interns on projects that matter to our cus

## 🟨 4. Context Caching Logic

To save time and quota, we cache:

- Extracted resume text
- Extracted JD text
- Embeddings of both

This ensures we don’t re-embed or re-parse the same files.


In [8]:
cache = Memory(location="/kaggle/working/.cache", verbose=0)

@cache.cache
def embed_texts(texts: list[str]) -> np.ndarray:
    for attempt in range(5):
        try:
            resp = client.models.embed_content(
                model    = "models/text-embedding-004",
                contents = texts,
                config   = types.EmbedContentConfig(task_type="semantic_similarity")
            )
            # <-- use .values, not .embedding
            vectors = [e.values for e in resp.embeddings]
            return np.stack(vectors).astype(np.float32)

        except errors.ClientError as e:
            if e.status == 429 and attempt < 4:
                wait = 2 ** attempt
                print(f"Rate‑limited; retrying in {wait}s…")
                time.sleep(wait)
            else:
                raise


In [9]:
resume_embs   = embed_texts(resume_chunks)
print("Embeddings shape:", resume_embs.shape)


Embeddings shape: (7, 768)


In [10]:
jd_embs = embed_texts(jd_chunks)
print("JD embeddings shape:", jd_embs.shape) 

JD embeddings shape: (6, 768)


## 🟪 5. RAG: Retrieval-Augmented Generation

- We create FAISS indexes for both resume and JD embeddings
- When a user asks a question (e.g., "what can I improve?"), we:
  - Embed the query
  - Find the most relevant sections from resume + JD
  - Pass those chunks into the prompting pipeline


In [11]:
d     = resume_embs.shape[1]
index = faiss.IndexFlatIP(d)
faiss.normalize_L2(resume_embs)
index.add(resume_embs)



In [12]:
d_j = resume_embs.shape[1]  

index_jd = faiss.IndexFlatIP(d_j)
faiss.normalize_L2(jd_embs)
index_jd.add(jd_embs)


In [13]:
def retrieve_resume_and_jd(query: str, k: int = 3):
    q_emb = embed_texts([query])
    faiss.normalize_L2(q_emb)

    # Resume hits
    Dr, Ir = index.search(q_emb, k)
    resume_hits = [
        ("resume", resume_chunks[i], float(Dr[0,j]))
        for j,i in enumerate(Ir[0])
    ]

    # JD hits
    Dj, Ij = index_jd.search(q_emb, k)
    jd_hits = [
        ("jd", jd_chunks[i], float(Dj[0,j]))
        for j,i in enumerate(Ij[0])
    ]

    # Combine & pick top‑k overall
    all_hits = resume_hits + jd_hits
    all_hits.sort(key=lambda x: x[2], reverse=True)
    return all_hits[:k]



In [14]:
hits = retrieve_resume_and_jd("what skills match this role?")
for src, txt, score in hits:
    print(f"[{src.upper()} • {score:.3f}]\n{txt}\n")


[JD • 0.649]
- Previous technical internship(s) if applicable
- Experience with distributed, multi-tiered systems, algorithms, and relational databases
- Experience in optimization mathematics such as linear programming and nonlinear optimisation
- Ability to effectively articulate technical challenges and solutions
- Adept at handling ambiguous or undefined problems as well as ability to think abstractly
Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice (https://www.amazon.jobs/en/privacy_page) to know more about how we collect, use and transfer the personal data of our candidates.
Amazon is committed to a diverse and inclusive workplace. Amazon is an 

## 🟫 6. Few-shot Prompting

We guide the model with curated examples for:

- Suggested edits (before/after changes)
- Cover letter structure and phrasing

These examples help the model understand the desired format and tone.


In [15]:
edit_examples = [
  {
    "before": "Led team of analysts",
    "after":  "Led a team of 5 analysts to deliver 20% efficiency gains"
  },
]

cover_examples = [
  {
    "role":    "Data Analyst",
    "opening": "Dear Hiring Manager at ${COMPANY},\nI’m excited to apply because…",
    "closing": "Thank you for your time and consideration.\nSincerely,\n${YOUR_NAME}"
  },
]


In [16]:
print(edit_examples)
print(cover_examples)


[{'before': 'Led team of analysts', 'after': 'Led a team of 5 analysts to deliver 20% efficiency gains'}]
[{'role': 'Data Analyst', 'opening': 'Dear Hiring Manager at ${COMPANY},\nI’m excited to apply because…', 'closing': 'Thank you for your time and consideration.\nSincerely,\n${YOUR_NAME}'}]


## 🟩 7. Structured Output (JSON)

The final result is a structured JSON containing:

```json
{
  "match_score": 87,
  "suggestions": [
    {"section": "Summary", "action": "Add quantifiable metrics"},
    ...
  ],
  "cover_letter": "Dear Hiring Manager..."
}


In [18]:
user_q = "How well does my resume match the job description?"
top_hits     = retrieve_resume_and_jd(user_q, k=5)
scores       = [score for _,_,score in top_hits]
match_score  = sum(scores) / len(scores)
rag_contexts = [txt for _,txt,_ in top_hits]

# --- 2) Build the payload, including our pre‑computed match_score ---
payload = {
  "match_score":      match_score,
  "resume":           resume_secs,
  "job_description":  jd_secs,
  "user_question":    user_q,
  "rag_contexts":     rag_contexts,
  "examples": {
    "edits": edit_examples,
    "cover": cover_examples
  }
}

# --- 3) Strict system prompt (no markdown, must echo match_score) ---
system_prompt = """
You will receive a JSON payload.
Your job is to OUTPUT EXACTLY one valid JSON object with keys:
1) match_score  – use the provided number.
2) suggestions  – a LIST of {section,action} objects.
3) cover_letter – a SINGLE escaped string.

The cover_letter must:
- Be at least 200 words long.
- Be formatted as a professional letter: opening paragraph, 2 body paragraphs, closing paragraph.
- Use '\\n' for all line breaks.
- NOT include any markdown or code fences.

Do NOT output any extra keys or nested objects—only raw JSON.
""".strip()


contents = [
    system_prompt,
    json.dumps(payload, separators=(",",":"))
]

answer = client.models.generate_content(
    model    = "gemini-2.0-flash",
    contents = contents,
    config   = types.GenerateContentConfig(temperature=0.2)
)

raw = answer.text.strip()
m = re.search(r"(\{.*\})", raw, flags=re.DOTALL)
if not m:
    raise ValueError(f"No JSON found in model output:\n{raw}")
json_str = m.group(1)

result = json.loads(json_str)
print(f"🎯 Match score: {result['match_score']:.2f}\n\n📌 Suggestions:")
for s in result["suggestions"]:
    print(f" - [{s['section']}] {s['action']}")
print("\n✉️ Cover Letter:\n")
print(result["cover_letter"].replace("\\n","\n"))


🎯 Match score: 0.58

📌 Suggestions:
 - [TECHNICAL SKILLS] Prioritize skills mentioned in the job description, such as experience with distributed systems and relational databases.
 - [WORK EXPERIENCE] Quantify achievements and align them with the responsibilities outlined in the job description.
 - [PROJECTS] Highlight projects that demonstrate experience with relevant technologies and problem-solving skills.

✉️ Cover Letter:

Dear Amazon Hiring Team,

I am writing to express my keen interest in the Software Development Intern position at Amazon, as advertised on your careers page. With a Master's degree in Artificial Intelligence and a strong foundation in computer science principles, I am confident that my skills and experience align well with the requirements outlined in the job description. My passion for leveraging technology to solve complex business challenges, coupled with my hands-on experience in developing and deploying machine learning models, makes me a strong candidate f