# Identifying Resume Data Structure

## Jake's Resume
Jake's Resume is a very popular resume for SWE Roles, and widely used by developers as base in the process of writing the resume.

**Topics**:
- Education: Describes the formal education of the candidate, e.g. Bachelor's Degree, Masters' Degree and others.
    - School/University
    - Location
    - Course Description/Title/Type
    - Start Date - Conclusion Date
 - Experience: Describes the formal work experiences of the candidate
    - Title of Job
    - Company Name
    - Start Date - End Date
    - Location
    - Relevant Topics
        - Each topic is described as bullet point, usually with XYZ or STAR model
- Projects: Describes the personal or college projects that the candidate developed
    - Title of the Project
    - Tech Stack
    - Start Date - End Date
    - Relevant Topics
        - Each topic is described as bullet point, usually with XYZ or STAR model
- Technical Skills
    - Languages: Programming Languages used
    - Frameworks
    - Developer Tools
    - Libraries

## FAANGPath Resume
FAANGPath Resume is a very popular resume for FAANG entry-level/intern roles.

**Topics**:
- Objective: Gives a introduction of candidate and the objective related to career/job seek
- Education: Describes the formal education of the candidate, e.g. Bachelor's Degree, Masters' Degree and others.
    - Course Description/Title/Type
    - School/University
    - Location
    - Start Date - Conclusion Date
    - Relevant Coursework
- Skills
    - Technical Skills
    - Soft Skills
    - XPTO
 - Experience: Describes the formal work experiences of the candidate
    - Title of Job
    - Company Name
    - Start Date - End Date
    - Location
    - Relevant Topics
        - Each topic is described as bullet point, usually with XYZ or STAR model
- Projects: Describes the personal or college projects that the candidate developed
    - Title of the Project
    - Description of the Project
- Extra-curricular Activities
- Leadership

## Engineering Resume
- Skills
    - [Field]: Skills, XPTO
- Experience: Describes the formal work experiences of the candidate
    - Title of Job
    - Company Name
    - Start Date - End Date
    - Location
    - Relevant Topics
        - Each topic is described as bullet point, usually with XYZ or STAR model
- Projects: Describes the personal or college projects that the candidate developed
    - Title of the Project
    - Project Link
    - Relevant Topics
        - Each topic is described as bullet point, usually with XYZ or STAR model
- Education: Describes the formal education of the candidate, e.g. Bachelor's Degree, Masters' Degree and others.
    - Course Description/Title/Type
    - School/University
    - Start Date - Conclusion Date

In [None]:
# To install the project libraries and setup the environment, run the commands below
# %uv pip install -r ./requirements.txt -q

In [3]:
from typing import List, Optional
from pydantic import BaseModel, Field, HttpUrl, EmailStr, ConfigDict

class BaseSchema(BaseModel):
    model_config = ConfigDict(
        populate_by_name=True,
        extra='ignore', 
        str_strip_whitespace=True
    )

class Location(BaseSchema):
    city: Optional[str] = None
    state: Optional[str] = None
    country_code: Optional[str] = Field(None, alias="countryCode") # We must use ISO 3166-1 alpha-2

class Profile(BaseSchema):
    network: str = Field(..., description="LinkedIn, GitHub, Portfolio")
    username: Optional[str] = None
    url: Optional[HttpUrl] = None

class Basics(BaseSchema):
    name: str
    label: Optional[str] = Field(None, description="Target role, e.g.: Software Engineer")
    email: Optional[EmailStr] = None
    phone: Optional[str] = None
    summary: Optional[str] = Field(None, description="Professional summary or Objective")
    location: Optional[Location] = None
    profiles: List[Profile] = Field(default_factory=list)

class Education(BaseSchema):
    institution: str
    area: str
    study_type: str = Field(..., alias="studyType", description="Bachelor, Master, etc.")
    start_date: Optional[str] = Field(None, alias="startDate")
    end_date: Optional[str] = Field(None, alias="endDate")
    score: Optional[str] = Field(None, description="GPA or average, e.g.: 9.5")
    courses: List[str] = Field(default_factory=list, description="Relevant Coursework")

class Work(BaseSchema):
    name: str = Field(..., description="Company name")
    position: str
    url: Optional[HttpUrl] = None
    start_date: Optional[str] = Field(None, alias="startDate")
    end_date: Optional[str] = Field(None, alias="endDate") # String to accept 'Current'
    summary: Optional[str] = None
    highlights: List[str] = Field(default_factory=list, description="Bullets with STAR Method")

class Project(BaseSchema):
    name: str
    description: Optional[str] = None
    highlights: List[str] = Field(default_factory=list)
    keywords: List[str] = Field(default_factory=list, description="Tech stack used in the project")
    url: Optional[HttpUrl] = None
    start_date: Optional[str] = Field(None, alias="startDate")
    end_date: Optional[str] = Field(None, alias="endDate")

class Skill(BaseSchema):
    name: str = Field(..., description="Category: Languages, Frameworks, Soft Skills")
    level: Optional[str] = None # Advanced, Intermediate
    keywords: List[str] = Field(default_factory=list, description="List: Python, AWS, Docker")

class Award(BaseSchema):
    title: str
    date: Optional[str] = None
    awarder: Optional[str] = None
    summary: Optional[str] = None


class ResumeSchema(BaseSchema):
    basics: Basics
    work: List[Work] = Field(default_factory=list)
    education: List[Education] = Field(default_factory=list)
    awards: List[Award] = Field(default_factory=list)
    projects: List[Project] = Field(default_factory=list)
    skills: List[Skill] = Field(default_factory=list)

    
    @property
    def all_hard_skills(self) -> set[str]:
        skill_set = set()
        
        # 1. Get from explicit Skills section
        for skill_group in self.skills:
            for kw in skill_group.keywords:
                skill_set.add(kw.lower())
        
        # 2. Get from Projects keywords
        for proj in self.projects:
            for kw in proj.keywords:
                skill_set.add(kw.lower())
                
        return skill_set

    @property
    def full_experience_text(self) -> str:
        texts = []
        for job in self.work:
            texts.extend(job.highlights)
        for proj in self.projects:
            texts.extend(proj.highlights)
        return " ".join(texts)

In [6]:
import json

# Load resume data
RESUME_PATH = "data/resumes/myResume.json"

with open(RESUME_PATH, "r") as f:
    resume_json = json.load(f)

print(resume_json)

resume_data = ResumeSchema(**resume_json)

{'basics': {'name': 'Ricardo Fernandes', 'label': 'Software Engineer', 'email': 'ricardoviniciusaf@gmail.com', 'location': {'city': 'Macei√≥', 'state': 'AL', 'countryCode': 'BR'}, 'profiles': [{'network': 'LinkedIn', 'username': 'ricardovini', 'url': 'https://linkedin.com/in/ricardovini'}, {'network': 'GitHub', 'username': 'ricardovinicius', 'url': 'https://github.com/ricardovinicius'}]}, 'work': [{'name': 'Centro de Inova√ß√£o EDGE', 'position': 'Software Engineer', 'startDate': '2023-07', 'endDate': 'Current', 'highlights': ['Developed a client environment simulation application (FastAPI and Vue.js) used for end-to-end testing of an RPA application', 'Worked on the development of a REST API using FastAPI for monitoring and managing custom RPA bots, ensuring real-time visibility and greater operational control over automations', 'Participated in integrating RPA workflows with RAG pipelines and LLMs, contributing to the automation of multilingual document processing and analysis', 'Ref

# Identifying Job Data Structure

- Job Title
    - Role
    - Specification
    - Level/Seniority
- Work Model (e.g. On-site, Remote, Hybrid, Flexible)
- Location
- Key Responsabilities
- Key Requirements / What we expect of you
    - Tech Stack
    - Experience Time
    - Education/Degree
    - Soft Skills can be here too
- Nice to Have / Preferred Qualifications / Bonus Points
- Soft Skills
- Benefits / What we Offer
- Tech Stack / Keywords - Extract from Key Responsabilities, Requirements and Good to Have







In [7]:
from typing import List, Optional, Literal
from pydantic import BaseModel, Field
from enum import Enum

class SeniorityLevel(Enum):
    INTERN = 0
    JR = 1
    MID = 2
    SENIOR = 3
    STAFF = 4
    LEAD = 5


class JobWorkplace(BaseModel):
    model: Literal['remote', 'hybrid', 'onsite']
    country: str = "BR" # We must use ISO 3166-1 alpha-2
    state: Optional[str] = None
    city: Optional[str] = None

class SalaryRange(BaseModel):
    currency: str = "BRL"
    min_amount: Optional[float] = None
    max_amount: Optional[float] = None
    frequency: Literal['monthly', 'yearly', 'hourly'] = 'monthly'


class JobPostingSchema(BaseModel):
    # 1. Job Basic Information
    title: str
    company_name: str
    original_url: Optional[str] = None
    
    # 2. Seniority (List, as it can be 'Jr', 'Mid', 'Senior', 'Staff', 'Lead', 'Intern')
    seniority_level: List[SeniorityLevel] = Field(..., description="List of accepted seniority levels")
    
    # 3. Logistics (List, as it can be 'Hybrid' OR 'Remote')
    work_options: List[JobWorkplace] = Field(..., description="Accepted work model options")

    # 4. Hard Requirements (For Mathematical Match)
    # This is where "Tech Stack / Keywords - Extract" fits in
    required_hard_skills: List[str] = Field(default_factory=list, description="Normalized mandatory skills")
    nice_to_have_skills: List[str] = Field(default_factory=list, description="Nice-to-have skills (Bonus Points)")
    
    min_experience_years: int = Field(0, description="Minimum years of experience required")
    degree_required: bool = Field(False, description="Whether a formal higher education degree is required")
    languages: List[str] = Field(default_factory=list, description="e.g., ['Advanced English']")

    # 5. Semantic Context (For LLM processing)
    # Includes "Key Responsibilities", "Soft Skills", and "Description"
    key_responsibilities: List[str] = Field(..., description="Summary focused on responsibilities and challenges")
    soft_skills_context: List[str] = Field(default_factory=list, description="List of soft skills for profile analysis")
    
    # 6. Benefits (For display only, does not affect score)
    benefits_text: Optional[str] = None
    salary: Optional[SalaryRange] = None

Using LLM-based strategy to parse job postings

In [59]:
import instructor
from dotenv import load_dotenv
from pydantic import BaseModel
from openai import OpenAI
import os

JOB_TEXT_PATH = "data/jobs/texts/btg.txt"

load_dotenv()

client = instructor.from_openai(
    OpenAI(
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
        api_key=os.getenv("GOOGLE_API_KEY")
    ),
    mode=instructor.Mode.JSON_O1, 
)

with open(JOB_TEXT_PATH, "r") as f:
    job_text = f.read()

def extract_job_data(job_text):
    prompt = """Extract detailed data from this Job Description:\n\n{job_text}"""

    response_data, response_raw = client.chat.completions.create_with_completion(
        model="gemma-3-12b-it",
        response_model=JobPostingSchema,
        messages=[
            {"role": "user", "content": prompt.format(job_text=job_text)},
        ],
        temperature=0.1,
    )

    print(response_data)
    print(f"DEBUG: {response_raw}")

    return response_data

job_posting_data = extract_job_data(job_text)

title='Engenheiro de Software Junior ou Pleno (Java) | Investment Products' company_name='BTG Pactual' original_url=None seniority_level=[<SeniorityLevel.JR: 1>, <SeniorityLevel.MID: 2>] work_options=[JobWorkplace(model='hybrid', country='BR', state='SP', city='S√£o Paulo')] required_hard_skills=['Java', 'Spring Boot', 'Web Services', 'REST', 'SOAP', 'Mensageria', 'Kubernetes', 'AWS', 'Docker', 'Oracle', 'PostgreSQL', 'DynamoDB', 'CI/CD', 'Git', 'Jenkins', 'Github Actions'] nice_to_have_skills=['Node'] min_experience_years=0 degree_required=True languages=[] key_responsibilities=['Refer√™ncia t√©cnica e atua√ß√£o junto √† equipe de projetos que utiliza em seu trabalho algumas tecnologias como Java, Spring Boot, WebServices REST/SOAP, Mensageria, Kubernetes e AWS.', 'Trabalhar com arquitetura orientada a microsservi√ßos, modelo DevOps e experi√™ncia com troubleshooting e an√°lise de logs.', 'Desenvolver e expandir do pipeline de produtos do Portal de Clientes do BTG e solu√ß√µes para a 

In [27]:
import json

SKILLS_JSON_PATH = "data/skills.json"

with open(SKILLS_JSON_PATH, "r") as f:
    skills = json.load(f)
    
def create_bidirectional_graph(data):
    """
    Flattens hierarchical JSON into a lowercase bidirectional map.
    Key (Canonical) -> [Aliases]
    Key (Alias)     -> [Canonical]
    """
    bidirectional_map = {}

    # Iterate through categories (databases, languages, etc.)
    for category, tech_items in data.items():
        for canonical_name, aliases in tech_items.items():
            
            # Normalize to lowercase for consistent lookup
            canonical_key = canonical_name.lower()
            alias_keys = [alias.lower() for alias in aliases]

            # 1. Forward Link: Canonical -> [Aliases]
            # e.g. "javascript" -> ["js", "node", ...]
            if canonical_key not in bidirectional_map:
                bidirectional_map[canonical_key] = []
            
            # Extend list, ensuring no duplicates
            for alias in alias_keys:
                if alias not in bidirectional_map[canonical_key]:
                    bidirectional_map[canonical_key].append(alias)

            # 2. Reverse Link: Alias -> [Canonical]
            # e.g. "js" -> ["javascript"]
            for alias in alias_keys:
                if alias not in bidirectional_map:
                    bidirectional_map[alias] = []
                
                if canonical_key not in bidirectional_map[alias]:
                    bidirectional_map[alias].append(canonical_key)
    
    return bidirectional_map

skills_graph = create_bidirectional_graph(skills)

print(skills_graph)

{'postgresql': ['postgres', 'pgsql', 'postgre', 'postgresql db'], 'postgres': ['postgresql'], 'pgsql': ['postgresql'], 'postgre': ['postgresql'], 'postgresql db': ['postgresql'], 'mysql': ['my-sql', 'mysql db'], 'my-sql': ['mysql'], 'mysql db': ['mysql'], 'mongodb': ['mongo', 'mongo db', 'mongodb atlas'], 'mongo': ['mongodb'], 'mongo db': ['mongodb'], 'mongodb atlas': ['mongodb'], 'microsoft sql server': ['mssql', 'sql server', 'ms sql'], 'mssql': ['microsoft sql server'], 'sql server': ['microsoft sql server'], 'ms sql': ['microsoft sql server'], 'sqlite': ['sqlite3', 'sqlite db'], 'sqlite3': ['sqlite'], 'sqlite db': ['sqlite'], 'redis': ['redis db', 'redis cache'], 'redis db': ['redis'], 'redis cache': ['redis'], 'elasticsearch': ['elastic', 'elk stack', 'es'], 'elastic': ['elasticsearch'], 'elk stack': ['elasticsearch'], 'es': ['elasticsearch'], 'cassandra': ['apache cassandra'], 'apache cassandra': ['cassandra'], 'dynamodb': ['aws dynamodb', 'dynamo'], 'aws dynamodb': ['dynamodb'],

In [71]:
"""
Semantic Skill Matching Module

Uses sentence-transformers to match job skills against resume skills
using semantic embeddings instead of exact string matching.
"""

from typing import Literal
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer, util
import torch


class SkillMatch(BaseModel):
    """Result of matching a single job skill against resume."""
    skill_name: str                      # Job skill being checked
    canonical_name: str                  # Matched resume skill (if found)
    match_type: Literal["exact", "semantic", "text_evidence", "none"]
    is_matched: bool
    confidence_score: float = 0.0        # 0.0-1.0 similarity score
    evidence_source: str = ""            # Where the match was found


class SemanticSkillMatcher:
    """
    Semantic skill matcher using sentence embeddings.
    
    Example usage:
        matcher = SemanticSkillMatcher()
        
        job_skills = ["Java", "REST", "AWS", "CI/CD", "Mensageria"]
        resume_skills = ["Python", "Java", "Spring Boot", "Docker", "RabbitMQ"]
        resume_text = "Worked on REST APIs using FastAPI..."
        
        matches = matcher.match_skills(job_skills, resume_skills, resume_text)
        for m in matches:
            print(m.model_dump())
    """
    
    def __init__(
        self, 
        model_name: str = "paraphrase-multilingual-MiniLM-L12-v2",
        strong_threshold: float = 0.75,
        moderate_threshold: float = 0.60
    ):
        """
        Initialize the semantic skill matcher.
        
        Args:
            model_name: SentenceTransformer model to use
            strong_threshold: Similarity >= this = strong match
            moderate_threshold: Similarity >= this = moderate match
        """
        print(f"Loading embedding model: {model_name}...")
        self.model = SentenceTransformer(model_name)
        self.strong_threshold = strong_threshold
        self.moderate_threshold = moderate_threshold
        print("Model loaded successfully!")
    
    def match_skills(
        self,
        job_skills: list[str],
        resume_skills: list[str],
        resume_full_text: str = ""
    ) -> list[SkillMatch]:
        """
        Match job skills against resume using semantic embeddings.
        
        Args:
            job_skills: Skills required in job posting
            resume_skills: Explicit skills from resume
            resume_full_text: Full resume text for evidence search
        
        Returns:
            List of SkillMatch with confidence scores
        """
        matches = []
        
        # Pre-compute embeddings for all resume skills
        resume_skill_embeddings = (
            self.model.encode(resume_skills, convert_to_tensor=True) 
            if resume_skills else None
        )
        
        # Embed full resume text for fallback evidence search
        resume_text_embedding = (
            self.model.encode(resume_full_text, convert_to_tensor=True)
            if resume_full_text else None
        )
        
        for job_skill in job_skills:
            match = SkillMatch(
                skill_name=job_skill,
                canonical_name="",
                match_type="none",
                is_matched=False,
                confidence_score=0.0,
                evidence_source=""
            )
            
            job_skill_lower = job_skill.lower()
            
            # 1. Check exact match first (fastest)
            for resume_skill in resume_skills:
                if resume_skill.lower() == job_skill_lower:
                    match.canonical_name = resume_skill
                    match.match_type = "exact"
                    match.is_matched = True
                    match.confidence_score = 1.0
                    match.evidence_source = "resume_skills"
                    break
            
            # 2. If no exact match, try semantic matching against resume skills
            if not match.is_matched and resume_skill_embeddings is not None:
                job_skill_embedding = self.model.encode(job_skill, convert_to_tensor=True)
                similarities = util.cos_sim(job_skill_embedding, resume_skill_embeddings)[0]
                
                best_idx = int(torch.argmax(similarities))
                best_score = float(similarities[best_idx])
                
                if best_score >= self.moderate_threshold:
                    match.canonical_name = resume_skills[best_idx]
                    match.match_type = "semantic"
                    match.is_matched = True
                    match.confidence_score = round(best_score, 3)
                    match.evidence_source = "resume_skills"
            
            # 3. Fallback: Check if skill is mentioned in full resume text
            if not match.is_matched and resume_text_embedding is not None:
                job_skill_embedding = self.model.encode(job_skill, convert_to_tensor=True)
                text_similarity = float(
                    util.cos_sim(job_skill_embedding, resume_text_embedding)[0][0]
                )
                
                # Lower threshold for text evidence (less precise)
                # Also check if the skill term appears directly in the text
                skill_in_text = job_skill.lower() in resume_full_text.lower()
                if text_similarity >= self.moderate_threshold * 0.75 or skill_in_text:
                    match.canonical_name = "(found in resume text)" if skill_in_text else "(inferred from experience)"
                    match.match_type = "text_evidence"
                    match.is_matched = True
                    match.confidence_score = round(text_similarity, 3)
                    match.evidence_source = "resume_text"
            
            matches.append(match)
        
        return matches
    
    def compare_skills(
        self, 
        skill1: str, 
        skill2: str
    ) -> float:
        """
        Compare two skills and return their semantic similarity.
        
        Useful for debugging/tuning thresholds.
        """
        emb1 = self.model.encode(skill1, convert_to_tensor=True)
        emb2 = self.model.encode(skill2, convert_to_tensor=True)
        return float(util.cos_sim(emb1, emb2)[0][0])


# Convenience function for quick usage
def match_skills_semantic(
    job_skills: list[str],
    resume_skills: list[str],
    resume_full_text: str,
    matcher: SemanticSkillMatcher
) -> list[SkillMatch]:
    """
    Convenience wrapper for SemanticSkillMatcher.match_skills().
    
    Args:
        job_skills: Skills required in job posting
        resume_skills: Explicit skills from resume
        resume_full_text: Full resume text for evidence search
        matcher: SemanticSkillMatcher instance
    
    Returns:
        List of SkillMatch with confidence scores
    """
    return matcher.match_skills(job_skills, resume_skills, resume_full_text)


  from .autonotebook import tqdm as notebook_tqdm


In [74]:
# Match resume skills with job description skills

resume_skills = [] 

for skill in resume_data.skills:
    resume_skills.extend(skill.keywords)

job_skills = job_posting_data.required_hard_skills + job_posting_data.nice_to_have_skills

matcher = SemanticSkillMatcher()
skills_matches = match_skills_semantic(job_skills, resume_skills, resume_data.full_experience_text, matcher)

for match in skills_matches:
    print(f"{match.skill_name:15} -> {match.canonical_name:25} "
          f"[{match.match_type:13}] score={match.confidence_score:.2f}")

Loading embedding model: paraphrase-multilingual-MiniLM-L12-v2...


Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 199/199 [00:00<00:00, 377.16it/s, Materializing param=pooler.dense.weight]                               
BertModel LOAD REPORT from: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Model loaded successfully!
Java            -> Java                      [exact        ] score=1.00
Spring Boot     -> Spring Boot               [exact        ] score=1.00
Web Services    ->                           [none         ] score=0.00
REST            -> pika                      [semantic     ] score=0.71
SOAP            ->                           [none         ] score=0.00
Mensageria      -> pika                      [semantic     ] score=0.71
Kubernetes      -> Kubernetes                [exact        ] score=1.00
AWS             -> pika                      [semantic     ] score=0.64
Docker          -> Docker                    [exact        ] score=1.00
Oracle          -> SQL                       [semantic     ] score=0.63
PostgreSQL      -> SQL                       [semantic     ] score=0.63
DynamoDB        ->                           [none         ] score=0.00
CI/CD           -> (found in resume text)    [text_evidence] score=0.07
Git             -> Git               

In [30]:
from pydantic import BaseModel
from typing import List, Optional, Literal

class CandidatePreferences(BaseModel):
    minimum_salary: int # TODO: handle multi-currency
    preferred_locations: List[Location] # TODO: improve this location model
    work_arrangement: List[Literal['remote', 'hybrid', 'onsite']]

candidate_preferences = CandidatePreferences(
    minimum_salary=5000,
    preferred_locations=[Location(
        city="Maceio",
        state="Alagoas",
        country_code="BR"
    )],
    work_arrangement=['remote', 'hybrid', 'onsite']
)

In [31]:
from enum import Enum
from typing import List, Literal, Optional
from datetime import datetime
from pydantic import BaseModel
import dateutil.parser # pip install python-dateutil

# --- Enums ---
class PrerequisiteType(str, Enum):
    DEGREE = "degree"
    YEARS_OF_EXPERIENCE = "years_of_experience"
    SKILLS = "skills"
    LOCATION = "location"
    LANGUAGE = "language"
    SALARY = "salary"

class MatchStatus(str, Enum):
    PASSED = "passed"
    FAILED = "failed"
    WARNING = "warning" # For partial matches or missing info
    NEUTRAL = "neutral" # When info is missing but not mandatory

# --- Output Schema ---
class PrerequisiteMatch(BaseModel):
    category: PrerequisiteType
    is_mandatory: bool
    status: MatchStatus
    rationale: str
    score: float = 0.0 # 0.0 to 1.0 (Useful for ranking)

def calculate_years_of_experience(work_history: List[Work]) -> float:
    total_days = 0
    
    for job in work_history:
        if not job.start_date:
            continue
            
        try:
            start = dateutil.parser.parse(job.start_date)
            
            if job.end_date and job.end_date.lower() != "current":
                end = dateutil.parser.parse(job.end_date)
            else:
                end = datetime.now()
                
            delta = end - start
            total_days += delta.days
        except Exception:
            # If date parsing fails, skip this entry or assume 0
            continue
            
    return round(total_days / 365.25, 1)

def match_prerequisites(
    job: JobPostingSchema, 
    preferences: CandidatePreferences, 
    resume: ResumeSchema
) -> List[PrerequisiteMatch]:
    
    matches = []

    # ---------------------------------------------------------
    # 1. YEARS OF EXPERIENCE
    # ---------------------------------------------------------
    candidate_years = calculate_years_of_experience(resume.work)
    exp_status = MatchStatus.PASSED
    exp_rationale = f"Candidate has {candidate_years} years. Job requires {job.min_experience_years}."
    
    if candidate_years < job.min_experience_years:
        # Allow a small buffer (e.g., 0.5 years) to be 'WARNING' instead of 'FAILED'
        if candidate_years >= (job.min_experience_years - 0.5):
            exp_status = MatchStatus.WARNING
        else:
            exp_status = MatchStatus.FAILED
    
    matches.append(PrerequisiteMatch(
        category=PrerequisiteType.YEARS_OF_EXPERIENCE,
        is_mandatory=True,
        status=exp_status,
        rationale=exp_rationale,
        score=min(candidate_years / job.min_experience_years, 1.0) if job.min_experience_years > 0 else 1.0
    ))

    # ---------------------------------------------------------
    # 2. DEGREE / EDUCATION
    # ---------------------------------------------------------
    # Logic: If job requires degree, check if 'education' list is not empty.
    # Advanced: specific parsing for 'Bachelor', 'Master' in study_type.
    degree_status = MatchStatus.PASSED
    degree_rationale = "Education requirements met."
    
    if job.degree_required:
        if not resume.education:
            degree_status = MatchStatus.FAILED
            degree_rationale = "Job requires a degree, but no education listed."
        else:
            # Check for 'Bachelor' or higher keywords if needed
            has_higher_ed = any(
                term in edu.study_type.lower() 
                for edu in resume.education 
                for term in ['bachelor', 'master', 'phd', 'mba', 'gradu', 'bacharel', 'licenciatura']
            )
            if not has_higher_ed:
                degree_status = MatchStatus.WARNING
                degree_rationale = "Higher education listed, but specific degree type not detected."

    matches.append(PrerequisiteMatch(
        category=PrerequisiteType.DEGREE,
        is_mandatory=job.degree_required,
        status=degree_status,
        rationale=degree_rationale,
        score=1.0 if degree_status == MatchStatus.PASSED else 0.0
    ))

    # ---------------------------------------------------------
    # 3. LOCATION & WORK MODEL
    # ---------------------------------------------------------
    # Logic: Intersection of Job Options vs Candidate Preferences + Current Location
    loc_status = MatchStatus.FAILED
    loc_rationale = "No matching work arrangement or location found."
    
    candidate_city = resume.basics.location.city.lower() if resume.basics.location and resume.basics.location.city else ""
    pref_cities = [l.city.lower() for l in preferences.preferred_locations if l.city]
    
    # Check each option provided by the job
    for option in job.work_options:
        # 1. Remote Check
        if option.model == 'remote' and 'remote' in preferences.work_arrangement:
            loc_status = MatchStatus.PASSED
            loc_rationale = "Matches REMOTE preference."
            break
            
        # 2. Hybrid/Onsite Check
        if option.model in ['hybrid', 'onsite'] and option.model in preferences.work_arrangement:
            job_city = option.city.lower() if option.city else ""
            
            # Match if candidate is already there OR wants to be there
            if job_city == candidate_city or job_city in pref_cities:
                loc_status = MatchStatus.PASSED
                loc_rationale = f"Matches {option.model.upper()} in {option.city}."
                break
    
    matches.append(PrerequisiteMatch(
        category=PrerequisiteType.LOCATION,
        is_mandatory=True, # Usually mandatory unless relocation is offered
        status=loc_status,
        rationale=loc_rationale,
        score=1.0 if loc_status == MatchStatus.PASSED else 0.0
    ))

    # ---------------------------------------------------------
    # 4. SALARY
    # ---------------------------------------------------------
    sal_status = MatchStatus.NEUTRAL
    sal_rationale = "Salary not disclosed in job posting."
    
    if job.salary and job.salary.max_amount:
        # Check if Job Max < Candidate Min
        if job.salary.max_amount < preferences.minimum_salary:
            sal_status = MatchStatus.FAILED
            sal_rationale = f"Job max ({job.salary.max_amount}) is below candidate minimum ({preferences.minimum_salary})."
        else:
            sal_status = MatchStatus.PASSED
            sal_rationale = "Salary expectations met."
            
    matches.append(PrerequisiteMatch(
        category=PrerequisiteType.SALARY,
        is_mandatory=False, # Often negotiable
        status=sal_status,
        rationale=sal_rationale,
        score=1.0 if sal_status == MatchStatus.PASSED else 0.0
    ))

    return matches

prerequisites_matches = match_prerequisites(
    job = job_posting_data, 
    preferences = candidate_preferences, 
    resume = resume_data
)
print(prerequisites_matches)

[PrerequisiteMatch(category=<PrerequisiteType.YEARS_OF_EXPERIENCE: 'years_of_experience'>, is_mandatory=True, status=<MatchStatus.FAILED: 'failed'>, rationale='Candidate has 2.6 years. Job requires 5.', score=0.52), PrerequisiteMatch(category=<PrerequisiteType.DEGREE: 'degree'>, is_mandatory=False, status=<MatchStatus.PASSED: 'passed'>, rationale='Education requirements met.', score=1.0), PrerequisiteMatch(category=<PrerequisiteType.LOCATION: 'location'>, is_mandatory=True, status=<MatchStatus.FAILED: 'failed'>, rationale='No matching work arrangement or location found.', score=0.0), PrerequisiteMatch(category=<PrerequisiteType.SALARY: 'salary'>, is_mandatory=False, status=<MatchStatus.NEUTRAL: 'neutral'>, rationale='Salary not disclosed in job posting.', score=0.0)]


In [16]:
from typing import List, Literal
from pydantic import BaseModel, Field

class EvaluationPoint(BaseModel):
    category: Literal["Technical Depth", "Leadership", "Culture", "Project Relevance"]
    description: str = Field(..., description="Explain the point clearly")
    evidence: str = Field(..., description="Quote specific part of resume or job to support this")

class SemanticAnalysis(BaseModel):
    # O Veredito resume a impress√£o geral
    verdict: Literal["Strong Match", "Potential Match", "Weak Match", "Not a Fit"]
    
    # Pontos Fortes: Onde o candidato brilha em rela√ß√£o √† vaga
    strengths: List[EvaluationPoint] = Field(..., description="Aspects where candidate exceeds or perfectly meets requirements")
    
    # Pontos Fracos: Gaps t√©cnicos ou de experi√™ncia
    weaknesses: List[EvaluationPoint] = Field(..., description="Missing skills or insufficient experience depth")
    
    # Pontos de Aten√ß√£o: Coisas para perguntar na entrevista
    interview_questions: List[str] = Field(..., description="Specific questions to probe the red flags/ambiguities")
    
    # An√°lise de Soft Skills (Inferida do texto)
    soft_skills_analysis: str = Field(..., description="Brief analysis of communication, leadership, and drive based on project descriptions")

In [64]:
def format_resume_for_llm(resume: ResumeSchema) -> str:
    text = f"CANDIDATE: ({resume.basics.label})\n\n"

    if resume.basics.summary:
        text += f"SUMMARY:\n{resume.basics.summary}\n"
    
    text += "WORK HISTORY:\n"
    for work in resume.work:
        text += f"- {work.position} at {work.name} ({work.start_date} to {work.end_date})\n"
        text += f"  Summary: {work.summary}\n"
        text += f"  Highlights: {'; '.join(work.highlights)}\n\n"
        
    text += "PROJECTS:\n"
    for proj in resume.projects:
        text += f"- {proj.name}: {proj.description}\n"
        text += f"  Stack: {', '.join(proj.keywords)}\n"
        
    return text

def analyze_semantic_match(
    job: JobPostingSchema, 
    resume: ResumeSchema, 
    skill_matches: list[SkillMatch],
    prerequisites: list[PrerequisiteMatch],
    client
) -> SemanticAnalysis:
    
    # 1. Prepare resume text
    resume_text = format_resume_for_llm(resume)
    
    # 2. Prepare job context
    job_context = f"""
    JOB TITLE: {job.title}
    SENIORITY: {', '.join([s.name for s in job.seniority_level])}

    KEY RESPONSIBILITIES:
    {chr(10).join(['- ' + r for r in job.key_responsibilities])}

    SOFT SKILLS DESIRED:
    {', '.join(job.soft_skills_context)}
    """

    # 3. Format pre-computed matching data
    matched_skills = [m for m in skill_matches if m.is_matched]
    missing_skills = [m.skill_name for m in skill_matches if not m.is_matched]
    alias_matches = [f"{m.skill_name} ‚Üí {m.canonical_name}" for m in skill_matches if m.match_type == "alias"]
    
    match_percentage = (len(matched_skills) / len(skill_matches) * 100) if skill_matches else 0
    
    prereq_summary = "\n".join([
        f"- {p.category.value}: {p.status.value} ({p.rationale})" 
        for p in prerequisites
    ])

    # 4. Build the enhanced prompt
    prompt = f"""You are an expert Technical Recruiter with 15+ years of experience screening candidates.
    ---
    ## JOB CONTEXT
    {job_context}

    ---
    ## CANDIDATE RESUME
    {resume_text}

    ---
    ## PRE-COMPUTED MATCHING DATA (Trust these as facts)

    ### Skills Match: {len(matched_skills)}/{len(skill_matches)} ({match_percentage:.0f}%)
    - Missing Required Skills: {', '.join(missing_skills) if missing_skills else 'None'}
    - Skills Matched via Alias: {', '.join(alias_matches) if alias_matches else 'None'}

    ### Prerequisites Check:
    {prereq_summary}

    ---
    ## EVALUATION INSTRUCTIONS

    1. **ROLE ALIGNMENT**: Compare career trajectory. Developer vs Manager = mismatch.
    2. **EVIDENCE ONLY**: Accept concrete actions ("Led", "Built", "Reduced X by Y%"). Reject vague claims.
    3. **USE THE DATA**: The skill matches above are FACTS. Focus your analysis on:
    - Are the missing skills critical blockers or quickly learnable?
    - Does experience DEPTH match the role, not just keywords?
    - What specific concerns should be probed in an interview?

    ### SCORING CRITERIA
    - **Strong Match**: 80%+ skills, matching seniority, proven impact
    - **Potential Match**: 60%+ skills, 1 level gap, transferable experience  
    - **Weak Match**: Major gaps OR career mismatch OR insufficient depth
    - **Not a Fit**: Fundamental misalignment

    Be STRICT. When in doubt, rate DOWN.

    Provide your analysis in the SemanticAnalysis JSON format."""

    # 5. Call LLM
    try:
        response_data, _ = client.chat.completions.create_with_completion(
            model="gemma-3-27b-it", 
            response_model=SemanticAnalysis,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2
        )
        return response_data
    except Exception as e:
        print(f"Error in semantic analysis: {e}")
        return None

In [70]:
semantic_match = analyze_semantic_match(job_posting_data, resume_data, skills_matches, prerequisites_matches, client)

print(semantic_match.model_dump_json(indent=4))

{
    "verdict": "Weak Match",
    "strengths": [
        {
            "category": "Technical Depth",
            "description": "The candidate demonstrates experience with microservices architecture, containerization (Docker), and CI/CD pipelines (GitLab, Kubernetes). This aligns with key responsibilities outlined in the job description.",
            "evidence": "Participated in deploying applications to the client's internal environment, ensuring security and quality standards through containerization (Docker), CI/CD pipeline creation (GitLab), and Kubernetes orchestration"
        },
        {
            "category": "Technical Depth",
            "description": "Experience with messaging queues (RabbitMQ) is a positive, indicating familiarity with asynchronous processing, which is relevant to the 'Mensageria' requirement.",
            "evidence": "Implemented a messaging pipeline with RabbitMQ for asynchronous processing of ML/LLM jobs between microservices, ensuring delivery an

In [84]:
class Engine:
    def __init__(self, ai_client, skills_graph):
        self.client = ai_client
        self.skills_graph = skills_graph
        self.skill_matcher = SemanticSkillMatcher()

    def run(self, job_text: str, resume_json: dict):
        print("‚öôÔ∏è 1. Starting Pipeline...")
        
        # 1. Parsing
        print("üìÑ Extracting Job Data...")
        job_data = extract_job_data(job_text)
        
        print("üë§ Processing Resume...")
        resume_data: ResumeSchema = ResumeSchema.model_validate(resume_json)  # Fixed: use model_validate
        
        # 2. Logical Filters
        print("üõ°Ô∏è Checking Prerequisites...")
        prefs = CandidatePreferences(
            minimum_salary=0, 
            preferred_locations=[], 
            work_arrangement=["remote", "hybrid", "onsite"]
        )
        prereqs = match_prerequisites(job_data, prefs, resume_data)

        # 3. Skills Graph
        print("üîó Checking Skills in Graph...")
        job_skills = job_data.required_hard_skills
        resume_skills_list = list(resume_data.all_hard_skills)
        resume_full_text = resume_data.full_experience_text
        skill_matches = self.skill_matcher.match_skills(job_skills, resume_skills, resume_full_text)
        
        # 4. Semantic AI (now with pre-computed data!)
        print("üß† Performing Semantic Analysis...")
        semantic_res = analyze_semantic_match(
            job_data, 
            resume_data, 
            skill_matches,      # Pass skill matches
            prereqs,            # Pass prerequisites
            self.client
        )
        
        # 5. Final Score
        print("üìä Calculating Score...")
        
        verdict_map = {"Strong Match": 95, "Potential Match": 75, "Weak Match": 40, "Not a Fit": 10}
        sem_score = verdict_map.get(semantic_res.verdict, 50)
        
        tech_score = 100
        if job_skills:
            matches = sum(1 for m in skill_matches if m.is_matched)
            tech_score = (matches / len(job_skills)) * 100
            
        prereq_score = 100 if all(p.status == MatchStatus.PASSED for p in prereqs) else 0
        
        final_score = (prereq_score * 0.2) + (tech_score * 0.3) + (sem_score * 0.5)
        
        return {
            "Total Score": round(final_score, 1),
            "Verdict": semantic_res.verdict,
            "Prereq Score": round(prereq_score, 1),
            "Tech Score": round(tech_score, 1),
            "Semantic Score": sem_score,
            "Strengths": [s.description for s in semantic_res.strengths],
            "Weaknesses": [w.description for w in semantic_res.weaknesses],
            "Interview Questions": semantic_res.interview_questions,
            "Feedback": semantic_res.soft_skills_analysis
        }

In [85]:
engine = Engine(client, skills_graph)
result = engine.run(job_text, resume_json)

import pandas as pd
from IPython.display import display, Markdown

print("\n" + "="*50)
print(f"üèÜ FINAL SCORE: {result['Total Score']}%")
print(f"‚öñÔ∏è VERDICT: {result['Verdict']}")
print(f"üíª PREREQ SCORE: {result['Prereq Score']}%")
print(f"üíª TECH SCORE: {result['Tech Score']}%")
print(f"üß† SEMANTIC SCORE: {result['Semantic Score']}")
print("="*50 + "\n")

display(Markdown(f"### ‚úÖ Strong Points:\n" + "\n".join([f"- {s}" for s in result['Strengths']])))
display(Markdown(f"### üõë Weak Points:\n" + "\n".join([f"- {w}" for w in result['Weaknesses']])))
display(Markdown(f"### ‚ùì Interview Questions:\n" + "\n".join([f"- {q}" for q in result['Interview Questions']])))
display(Markdown(f"### üí¨ Soft Skills Feedback:\n{result['Feedback']}"))

Loading embedding model: paraphrase-multilingual-MiniLM-L12-v2...


Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 199/199 [00:00<00:00, 382.69it/s, Materializing param=pooler.dense.weight]                               
BertModel LOAD REPORT from: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Model loaded successfully!
‚öôÔ∏è 1. Starting Pipeline...
üìÑ Extracting Job Data...
title='Engenheiro de Software Junior ou Pleno (Java) | Investment Products' company_name='BTG Pactual' original_url=None seniority_level=[<SeniorityLevel.JR: 1>, <SeniorityLevel.MID: 2>] work_options=[JobWorkplace(model='hybrid', country='BR', state='SP', city='S√£o Paulo')] required_hard_skills=['Java', 'Spring Boot', 'Web Services', 'REST', 'SOAP', 'Mensageria', 'Kubernetes', 'AWS', 'Docker', 'Oracle', 'PostgreSQL', 'DynamoDB', 'CI/CD', 'Git', 'Jenkins', 'Github Actions'] nice_to_have_skills=['Node'] min_experience_years=0 degree_required=True languages=[] key_responsibilities=['Refer√™ncia t√©cnica e atua√ß√£o junto √† equipe de projetos que utiliza em seu trabalho algumas tecnologias como Java, Spring Boot, WebServices REST/SOAP, Mensageria, Kubernetes e AWS.', 'Trabalhar com arquitetura orientada a microsservi√ßos, modelo DevOps e experi√™ncia com troubleshooting e an√°lise de logs.', 'Desenvolve

### ‚úÖ Strong Points:
- The candidate demonstrates solid experience with Java and Spring Boot, as evidenced by their work on the 'Odontolog' project and their contributions at Centro de Inova√ß√£o EDGE.
- Experience with microservices architecture, messaging queues (RabbitMQ), containerization (Docker), and CI/CD pipelines (GitLab) aligns well with the job description's requirements.
- The candidate's focus on testability and code quality (90% test coverage, applying design patterns) is a positive indicator of their engineering practices.

### üõë Weak Points:
- The candidate lacks explicit experience with Web Services (SOAP specifically) as indicated by the skills match. While they have REST API experience (FastAPI), SOAP is a specific requirement.
- The candidate has no listed experience with DynamoDB, which is a required skill. This is a potential blocker depending on the importance of DynamoDB in the role.
- The candidate lacks experience with Jenkins. While they use GitLab for CI/CD, familiarity with Jenkins could be beneficial.
- The candidate's location is not a match, indicating a potential issue with work arrangement or location preferences.

### ‚ùì Interview Questions:
- Can you describe your experience with SOAP web services? If limited, how quickly do you think you could get up to speed?
- What is your understanding of DynamoDB and its use cases? Have you worked with any NoSQL databases?
- Describe your experience with CI/CD pipelines. How does GitLab compare to Jenkins in your experience?
- Can you elaborate on your experience with troubleshooting and analyzing logs in a microservices environment?
- Tell me about a time you had to learn a new technology quickly to solve a problem. What was your approach?
- What are your location preferences and are you open to remote work or relocation?

### üí¨ Soft Skills Feedback:
The candidate demonstrates analytical skills through their focus on observability (logging, error tracking) and test coverage. Their work on refactoring and implementing design patterns suggests a desire for code quality and maintainability. The project descriptions don't provide much insight into communication or leadership skills, but their contributions to multiple areas within projects suggest a proactive and engaged team member.