# Resume-Job Matching: Building a Talent Search Engine

## The Challenge

Recruiting teams face a fundamental matching problem:

1. **Skill Terminology Varies**: "React.js" vs "ReactJS" vs "React" are the same skill
2. **Job Titles Are Inconsistent**: "Software Engineer" vs "Developer" vs "Programmer"
3. **Multi-Dimensional Matching**: Skills + experience + location + salary all matter
4. **Volume**: Enterprise ATS systems process thousands of resumes per job posting

Manual resume screening is expensive (~$15-20 per resume for recruiter time). Automated matching can pre-filter candidates, letting recruiters focus on qualified applicants.

## What You'll Learn

- Using **TokenSet** field type for skill list matching
- **SchemaIndex** for multi-field candidate scoring
- Building a skills normalization system
- Weighting strategies for required vs. nice-to-have skills
- Creating a complete talent matching pipeline

## Dataset

We'll use a **skills taxonomy** based on:
- O*NET (Occupational Information Network) skill categories
- Common tech industry variations
- Embedded sample data for demonstration

For production use, consider:
- [O*NET Database](https://www.onetcenter.org/database.html) (free, comprehensive)
- [LinkedIn Skills API](https://docs.microsoft.com/en-us/linkedin/) (requires partnership)
- [ESCO Skills Taxonomy](https://esco.ec.europa.eu/) (EU standard)

In [None]:
import fuzzyrust as fr
import re
from collections import defaultdict

print(f"FuzzyRust loaded for talent matching")

## 1. The Skill Matching Problem

Skills are messy. The same capability can be written many ways:

| Resume Says | Job Requires | Are They the Same? |
|-------------|--------------|--------------------|
| "React.js" | "ReactJS" | Yes |
| "Python 3" | "Python" | Yes (superset) |
| "AWS" | "Amazon Web Services" | Yes (abbreviation) |
| "Machine Learning" | "ML" | Yes |
| "Data Science" | "Machine Learning" | Partial overlap |
| "JavaScript" | "Java" | NO! Common confusion |

Simple string matching fails. We need:
1. **Skill normalization**: Map variants to canonical forms
2. **Set-based matching**: Compare skill sets, not individual strings
3. **Weighted scoring**: Some skills matter more than others

In [None]:
# Demonstrate why simple string matching fails
skill_pairs = [
    ("React.js", "ReactJS"),
    ("Python 3", "Python"),
    ("AWS", "Amazon Web Services"),
    ("Machine Learning", "ML"),
    ("JavaScript", "Java"),  # Should NOT match!
]

print(f"{'Skill 1':<25} {'Skill 2':<25} {'Levenshtein':<12} {'Jaro-Winkler'}")
print("=" * 75)

for s1, s2 in skill_pairs:
    lev = fr.levenshtein_similarity(s1.lower(), s2.lower())
    jw = fr.jaro_winkler_similarity(s1.lower(), s2.lower())
    
    # Flag dangerous false positives
    warning = "  DANGER!" if s1 == "JavaScript" and jw > 0.7 else ""
    print(f"{s1:<25} {s2:<25} {lev:.2%}        {jw:.2%}{warning}")

**Key Insight**: Jaro-Winkler gives JavaScript and Java a 78% similarity score - a dangerous false positive! We need domain-specific skill normalization, not just string similarity.

## 2. Building a Skills Taxonomy

A skills taxonomy maps variations to canonical forms and defines relationships:

```
"react" (canonical)
  ├── "react.js" (alias)
  ├── "reactjs" (alias)
  └── "react js" (alias)

"python" (canonical)
  ├── "python 2" (implies python)
  ├── "python 3" (implies python)
  └── "py" (abbreviation)
```

In [None]:
# Comprehensive tech skills taxonomy
SKILLS_TAXONOMY = {
    # Programming Languages
    'python': ['python', 'python3', 'python 3', 'python2', 'python 2', 'py'],
    'javascript': ['javascript', 'js', 'ecmascript', 'es6', 'es2015', 'es2020'],
    'typescript': ['typescript', 'ts'],
    'java': ['java', 'java8', 'java 8', 'java11', 'java 11', 'java17'],
    'csharp': ['c#', 'csharp', 'c sharp', '.net c#'],
    'cpp': ['c++', 'cpp', 'cplusplus'],
    'c': ['c language', 'c programming'],
    'go': ['go', 'golang', 'go lang'],
    'rust': ['rust', 'rust lang', 'rustlang'],
    'ruby': ['ruby', 'ruby on rails', 'ror'],
    'php': ['php', 'php7', 'php8'],
    'swift': ['swift', 'swift 5', 'swiftui'],
    'kotlin': ['kotlin', 'kt'],
    'scala': ['scala'],
    'r': ['r', 'r programming', 'r language', 'rlang'],
    'sql': ['sql', 'structured query language', 'tsql', 't-sql', 'plsql', 'pl/sql'],
    
    # Frontend Frameworks
    'react': ['react', 'react.js', 'reactjs', 'react js'],
    'angular': ['angular', 'angularjs', 'angular.js', 'angular 2+'],
    'vue': ['vue', 'vue.js', 'vuejs', 'vue 3'],
    'svelte': ['svelte', 'sveltejs'],
    'nextjs': ['next.js', 'nextjs', 'next js'],
    'jquery': ['jquery', 'jquery ui'],
    
    # Backend Frameworks
    'nodejs': ['node.js', 'nodejs', 'node js', 'node'],
    'express': ['express', 'express.js', 'expressjs'],
    'django': ['django', 'django rest', 'drf'],
    'flask': ['flask'],
    'fastapi': ['fastapi', 'fast api'],
    'spring': ['spring', 'spring boot', 'springboot', 'spring framework'],
    'rails': ['rails', 'ruby on rails', 'ror'],
    'laravel': ['laravel'],
    'dotnet': ['.net', 'dotnet', '.net core', 'asp.net', 'aspnet'],
    
    # Cloud Platforms
    'aws': ['aws', 'amazon web services', 'amazon aws'],
    'azure': ['azure', 'microsoft azure', 'ms azure'],
    'gcp': ['gcp', 'google cloud', 'google cloud platform'],
    'heroku': ['heroku'],
    'digitalocean': ['digitalocean', 'digital ocean', 'do'],
    
    # DevOps & Infrastructure
    'docker': ['docker', 'docker compose', 'dockerfile'],
    'kubernetes': ['kubernetes', 'k8s', 'kube'],
    'terraform': ['terraform', 'tf', 'hashicorp terraform'],
    'ansible': ['ansible'],
    'jenkins': ['jenkins', 'jenkins ci'],
    'cicd': ['ci/cd', 'cicd', 'ci cd', 'continuous integration', 'continuous deployment'],
    'git': ['git', 'github', 'gitlab', 'bitbucket', 'version control'],
    'linux': ['linux', 'unix', 'ubuntu', 'centos', 'debian', 'rhel'],
    
    # Databases
    'postgresql': ['postgresql', 'postgres', 'psql', 'pg'],
    'mysql': ['mysql', 'mariadb'],
    'mongodb': ['mongodb', 'mongo', 'mongoose'],
    'redis': ['redis'],
    'elasticsearch': ['elasticsearch', 'elastic', 'es', 'elk'],
    'dynamodb': ['dynamodb', 'dynamo db', 'aws dynamodb'],
    'cassandra': ['cassandra', 'apache cassandra'],
    
    # Data Science & ML
    'machine_learning': ['machine learning', 'ml', 'statistical learning'],
    'deep_learning': ['deep learning', 'dl', 'neural networks', 'nn'],
    'tensorflow': ['tensorflow', 'tf', 'keras'],
    'pytorch': ['pytorch', 'torch'],
    'pandas': ['pandas', 'python pandas'],
    'numpy': ['numpy', 'np'],
    'scikit_learn': ['scikit-learn', 'sklearn', 'scikit learn'],
    'nlp': ['nlp', 'natural language processing', 'text analytics'],
    'computer_vision': ['computer vision', 'cv', 'image processing', 'opencv'],
    'data_analysis': ['data analysis', 'data analytics', 'analytics'],
    'data_visualization': ['data visualization', 'tableau', 'power bi', 'looker'],
    
    # Soft Skills (often in job postings)
    'communication': ['communication', 'communication skills', 'written communication', 'verbal communication'],
    'teamwork': ['teamwork', 'team player', 'collaboration', 'collaborative'],
    'leadership': ['leadership', 'team lead', 'tech lead', 'management'],
    'problem_solving': ['problem solving', 'problem-solving', 'analytical thinking', 'critical thinking'],
    'agile': ['agile', 'scrum', 'kanban', 'agile methodology', 'sprint planning'],
}

# Build reverse lookup: variation -> canonical
SKILL_ALIASES = {}
for canonical, variations in SKILLS_TAXONOMY.items():
    for var in variations:
        SKILL_ALIASES[var.lower()] = canonical

print(f"Taxonomy contains {len(SKILLS_TAXONOMY)} canonical skills")
print(f"Total aliases: {len(SKILL_ALIASES)}")

In [None]:
def normalize_skill(skill: str) -> str:
    """
    Normalize a skill to its canonical form.
    
    Returns the canonical skill name, or the cleaned input if unknown.
    """
    # Clean the skill
    cleaned = skill.lower().strip()
    
    # Remove version numbers (e.g., "Python 3.9" -> "python")
    cleaned = re.sub(r'\s+\d+\.?\d*', '', cleaned)
    
    # Check direct alias match
    if cleaned in SKILL_ALIASES:
        return SKILL_ALIASES[cleaned]
    
    # Try without punctuation
    no_punct = re.sub(r'[^a-z0-9\s]', '', cleaned)
    if no_punct in SKILL_ALIASES:
        return SKILL_ALIASES[no_punct]
    
    # Return cleaned version if no match
    return cleaned


def normalize_skills(skills: list) -> list:
    """
    Normalize a list of skills, removing duplicates.
    """
    normalized = set()
    for skill in skills:
        norm = normalize_skill(skill)
        if norm:
            normalized.add(norm)
    return sorted(normalized)


# Test normalization
test_skills = [
    "React.js",
    "Python 3.9",
    "Amazon Web Services",
    "K8s",
    "node.js",
    "Machine Learning",
    "scikit-learn",
    "Unknown Skill XYZ",  # Should pass through
]

print("Skill Normalization:")
print("-" * 50)
for skill in test_skills:
    norm = normalize_skill(skill)
    status = "(known)" if norm in SKILLS_TAXONOMY else "(unknown)"
    print(f"{skill:<25} -> {norm:<20} {status}")

## 3. Sample Data: Jobs and Candidates

Let's create realistic job postings and candidate profiles.

In [None]:
# Sample job postings
JOB_POSTINGS = [
    {
        'id': 'JOB001',
        'title': 'Senior Full Stack Developer',
        'company': 'TechCorp',
        'location': 'San Francisco, CA',
        'required_skills': ['React', 'Node.js', 'PostgreSQL', 'TypeScript', 'AWS'],
        'nice_to_have': ['GraphQL', 'Docker', 'Kubernetes', 'CI/CD'],
        'experience_years': 5,
        'description': 'Looking for a senior full stack developer to build scalable web applications.'
    },
    {
        'id': 'JOB002',
        'title': 'Machine Learning Engineer',
        'company': 'AI Startup',
        'location': 'Remote',
        'required_skills': ['Python', 'Machine Learning', 'TensorFlow', 'SQL'],
        'nice_to_have': ['PyTorch', 'NLP', 'Computer Vision', 'AWS', 'Docker'],
        'experience_years': 3,
        'description': 'Join our ML team to build cutting-edge AI models for production systems.'
    },
    {
        'id': 'JOB003',
        'title': 'DevOps Engineer',
        'company': 'CloudScale',
        'location': 'New York, NY',
        'required_skills': ['AWS', 'Kubernetes', 'Terraform', 'Linux', 'CI/CD'],
        'nice_to_have': ['Python', 'Go', 'Ansible', 'Prometheus', 'Grafana'],
        'experience_years': 4,
        'description': 'Build and maintain cloud infrastructure for high-traffic applications.'
    },
    {
        'id': 'JOB004',
        'title': 'Frontend Developer',
        'company': 'DesignFirst',
        'location': 'Austin, TX',
        'required_skills': ['JavaScript', 'React', 'CSS', 'HTML'],
        'nice_to_have': ['TypeScript', 'Next.js', 'Testing', 'Accessibility'],
        'experience_years': 2,
        'description': 'Create beautiful, responsive user interfaces for web applications.'
    },
]

# Sample candidate profiles
CANDIDATES = [
    {
        'id': 'CAND001',
        'name': 'Alice Johnson',
        'title': 'Full Stack Software Engineer',
        'location': 'San Francisco, CA',
        'skills': ['React.js', 'Node.js', 'PostgreSQL', 'TypeScript', 'AWS', 'Docker', 'Git'],
        'experience_years': 6,
        'summary': 'Experienced full stack developer with focus on scalable web applications.'
    },
    {
        'id': 'CAND002',
        'name': 'Bob Smith',
        'title': 'Data Scientist',
        'location': 'Remote',
        'skills': ['Python 3', 'Machine Learning', 'TensorFlow', 'PyTorch', 'SQL', 'Pandas', 'Scikit-learn'],
        'experience_years': 4,
        'summary': 'ML engineer specializing in NLP and recommendation systems.'
    },
    {
        'id': 'CAND003',
        'name': 'Carol Davis',
        'title': 'Cloud Infrastructure Engineer',
        'location': 'Seattle, WA',
        'skills': ['Amazon Web Services', 'K8s', 'Terraform', 'Linux', 'CI/CD', 'Python', 'Go'],
        'experience_years': 5,
        'summary': 'DevOps specialist with expertise in cloud-native architectures.'
    },
    {
        'id': 'CAND004',
        'name': 'David Lee',
        'title': 'Junior Web Developer',
        'location': 'Austin, TX',
        'skills': ['JavaScript', 'React', 'HTML', 'CSS', 'Git'],
        'experience_years': 1,
        'summary': 'Recent bootcamp graduate passionate about frontend development.'
    },
    {
        'id': 'CAND005',
        'name': 'Eva Martinez',
        'title': 'Backend Developer',
        'location': 'Los Angeles, CA',
        'skills': ['Java', 'Spring Boot', 'MySQL', 'AWS', 'Docker'],
        'experience_years': 4,
        'summary': 'Backend engineer experienced with enterprise Java applications.'
    },
]

print(f"Loaded {len(JOB_POSTINGS)} job postings")
print(f"Loaded {len(CANDIDATES)} candidates")

## 4. Skill-Based Matching

The core of talent matching is comparing skill sets. We need to:

1. Normalize skills to canonical forms
2. Calculate overlap between candidate skills and job requirements
3. Weight required skills higher than nice-to-have

**Jaccard similarity** is ideal for set comparison:
```
Jaccard(A, B) = |A ∩ B| / |A ∪ B|
```

FuzzyRust's **TokenSet** field type handles this automatically.

In [None]:
def calculate_skill_match(candidate_skills: list, required_skills: list, 
                          nice_to_have: list = None) -> dict:
    """
    Calculate how well a candidate's skills match job requirements.
    
    Returns:
        {
            'required_match': % of required skills matched,
            'nice_to_have_match': % of nice-to-have matched,
            'overall_score': weighted score,
            'matched_required': list of matched required skills,
            'missing_required': list of missing required skills,
            'matched_nice_to_have': list of matched nice-to-have
        }
    """
    # Normalize all skills
    cand_norm = set(normalize_skills(candidate_skills))
    req_norm = set(normalize_skills(required_skills))
    nice_norm = set(normalize_skills(nice_to_have or []))
    
    # Calculate matches
    matched_required = cand_norm & req_norm
    missing_required = req_norm - cand_norm
    matched_nice = cand_norm & nice_norm
    
    # Calculate scores
    required_score = len(matched_required) / len(req_norm) if req_norm else 0
    nice_score = len(matched_nice) / len(nice_norm) if nice_norm else 0
    
    # Weighted overall score (required 70%, nice-to-have 30%)
    overall = (required_score * 0.7) + (nice_score * 0.3)
    
    return {
        'required_match': required_score,
        'nice_to_have_match': nice_score,
        'overall_score': overall,
        'matched_required': sorted(matched_required),
        'missing_required': sorted(missing_required),
        'matched_nice_to_have': sorted(matched_nice)
    }


# Test skill matching
job = JOB_POSTINGS[0]  # Senior Full Stack Developer
candidate = CANDIDATES[0]  # Alice Johnson

print(f"Job: {job['title']} at {job['company']}")
print(f"Required: {job['required_skills']}")
print(f"Nice-to-have: {job['nice_to_have']}")
print()
print(f"Candidate: {candidate['name']}")
print(f"Skills: {candidate['skills']}")
print()

match = calculate_skill_match(
    candidate['skills'],
    job['required_skills'],
    job['nice_to_have']
)

print("Match Analysis:")
print("-" * 50)
print(f"Required skills match: {match['required_match']:.0%}")
print(f"Nice-to-have match: {match['nice_to_have_match']:.0%}")
print(f"Overall score: {match['overall_score']:.0%}")
print(f"")
print(f"Matched required: {match['matched_required']}")
print(f"Missing required: {match['missing_required']}")
print(f"Matched nice-to-have: {match['matched_nice_to_have']}")

## 5. Multi-Field Matching with SchemaIndex

Skills aren't everything. A complete talent match should consider:

- **Job Title**: "Frontend Developer" should match "UI Engineer"
- **Skills**: The core match (highest weight)
- **Location**: Geographic proximity or remote work
- **Experience**: Years of experience vs. requirements

SchemaIndex lets us combine all these factors with configurable weights.

In [None]:
class TalentMatcher:
    """
    Multi-field talent matching system using SchemaIndex.
    
    Matches candidates to jobs based on:
    - Skills (TokenSet, highest weight)
    - Title (ShortText, medium weight)
    - Location (ShortText, lower weight)
    """
    
    def __init__(self):
        # Build schema for job matching
        builder = fr.SchemaBuilder()
        
        # Skills as token set (most important)
        builder.add_field(
            "skills",
            "token_set",
            weight=10,
            separator=",",
            required=True
        )
        
        # Title with fuzzy matching
        builder.add_field(
            "title",
            "short_text",
            weight=5,
            algorithm="jaro_winkler",
            normalize="lowercase"
        )
        
        # Location (lower weight, exact match preferred)
        builder.add_field(
            "location",
            "short_text",
            weight=2,
            algorithm="jaro_winkler",
            normalize="lowercase"
        )
        
        schema = builder.build()
        self.index = fr.SchemaIndex(schema)
        self.candidates = {}
    
    def add_candidate(self, candidate: dict):
        """
        Add a candidate to the index.
        """
        # Normalize skills before indexing
        normalized_skills = normalize_skills(candidate['skills'])
        
        record = {
            'skills': ','.join(normalized_skills),
            'title': candidate['title'].lower(),
            'location': candidate['location'].lower()
        }
        
        self.index.add(record, data=candidate['id'])
        self.candidates[candidate['id']] = candidate
    
    def find_candidates_for_job(self, job: dict, min_score: float = 0.3, 
                                 limit: int = 10) -> list:
        """
        Find candidates matching a job posting.
        
        Returns list of (candidate, score, field_scores) tuples.
        """
        # Normalize job skills (combine required and nice-to-have)
        all_skills = job.get('required_skills', []) + job.get('nice_to_have', [])
        normalized_skills = normalize_skills(all_skills)
        
        query = {
            'skills': ','.join(normalized_skills),
            'title': job['title'].lower(),
            'location': job.get('location', '').lower()
        }
        
        results = self.index.search(query, min_score=min_score, limit=limit)
        
        # Enrich with candidate data
        matches = []
        for r in results:
            candidate = self.candidates[r.data]
            matches.append({
                'candidate': candidate,
                'overall_score': r.score,
                'field_scores': r.field_scores,
                'skill_analysis': calculate_skill_match(
                    candidate['skills'],
                    job.get('required_skills', []),
                    job.get('nice_to_have', [])
                )
            })
        
        return matches


# Build the matcher and index candidates
talent_matcher = TalentMatcher()

for candidate in CANDIDATES:
    talent_matcher.add_candidate(candidate)

print(f"Indexed {len(talent_matcher.candidates)} candidates")

In [None]:
# Find candidates for each job
print("Talent Matching Results")
print("=" * 70)

for job in JOB_POSTINGS:
    print(f"\nJob: {job['title']} at {job['company']}")
    print(f"Location: {job['location']}")
    print(f"Required: {job['required_skills']}")
    print("-" * 50)
    
    matches = talent_matcher.find_candidates_for_job(job, min_score=0.2)
    
    if matches:
        for i, m in enumerate(matches[:3], 1):  # Top 3
            cand = m['candidate']
            analysis = m['skill_analysis']
            
            print(f"  #{i} {cand['name']} ({cand['title']})")
            print(f"      Overall: {m['overall_score']:.0%} | ")
            print(f"      Skills: {analysis['required_match']:.0%} required, "
                  f"{analysis['nice_to_have_match']:.0%} nice-to-have")
            if analysis['missing_required']:
                print(f"      Missing: {analysis['missing_required']}")
    else:
        print("  No matching candidates found")

## 6. Advanced: Fuzzy Skill Matching

What if a candidate has skills that don't exactly match our taxonomy?

For example:
- "Vue 3" (not in taxonomy, but close to "vue")
- "FastAPI" (might be written as "Fast API")
- "Kubernetes Administration" (contains "kubernetes")

We can build a fuzzy skill resolver that:
1. First tries exact match against taxonomy
2. Falls back to fuzzy matching if no exact match

In [None]:
class FuzzySkillResolver:
    """
    Resolves skills using both exact taxonomy matching and fuzzy fallback.
    """
    
    def __init__(self, taxonomy: dict):
        self.taxonomy = taxonomy
        
        # Build N-gram index of all skill variations
        self.skill_index = fr.NgramIndex(ngram_size=2)
        self.skill_to_canonical = {}
        
        for canonical, variations in taxonomy.items():
            for var in variations:
                var_lower = var.lower()
                self.skill_index.add(var_lower)
                self.skill_to_canonical[var_lower] = canonical
    
    def resolve(self, skill: str, min_similarity: float = 0.7) -> tuple:
        """
        Resolve a skill to its canonical form.
        
        Returns: (canonical_skill, confidence, match_type)
        """
        skill_lower = skill.lower().strip()
        
        # Try exact match first
        if skill_lower in self.skill_to_canonical:
            return (self.skill_to_canonical[skill_lower], 1.0, 'exact')
        
        # Remove common suffixes/prefixes for matching
        cleaned = re.sub(r'\s*(developer|engineer|programming|administration)\s*', '', skill_lower)
        if cleaned in self.skill_to_canonical:
            return (self.skill_to_canonical[cleaned], 0.95, 'cleaned')
        
        # Fuzzy match
        matches = self.skill_index.search(skill_lower, min_similarity=min_similarity, limit=1)
        
        if matches:
            matched_var = matches[0].text
            canonical = self.skill_to_canonical.get(matched_var, matched_var)
            return (canonical, matches[0].score, 'fuzzy')
        
        # No match - return original (unknown skill)
        return (skill_lower, 0.0, 'unknown')
    
    def resolve_all(self, skills: list) -> list:
        """
        Resolve a list of skills.
        
        Returns list of (original, canonical, confidence, match_type).
        """
        results = []
        for skill in skills:
            canonical, conf, match_type = self.resolve(skill)
            results.append((skill, canonical, conf, match_type))
        return results


# Test fuzzy skill resolution
fuzzy_resolver = FuzzySkillResolver(SKILLS_TAXONOMY)

test_skills = [
    "React.js",                  # Exact match
    "Vue 3",                     # Should match Vue
    "Kubernetes Administration", # Should match Kubernetes
    "Fast API",                  # Should match FastAPI
    "Node Development",          # Should match Node.js
    "ML Engineering",            # Should match Machine Learning
    "Obscure Framework XYZ",     # Unknown - should return as-is
]

print("Fuzzy Skill Resolution:")
print("-" * 70)
print(f"{'Input':<30} {'Canonical':<20} {'Confidence':<12} {'Type'}")
print("=" * 70)

for skill in test_skills:
    canonical, conf, match_type = fuzzy_resolver.resolve(skill)
    print(f"{skill:<30} {canonical:<20} {conf:.0%}          {match_type}")

## 7. Complete Talent Matching Pipeline

Let's build a production-ready pipeline that:

1. Normalizes and resolves skills (with fuzzy fallback)
2. Multi-field matching with SchemaIndex
3. Detailed scoring breakdown
4. Ranking with multiple factors

In [None]:
class TalentSearchEngine:
    """
    Production-ready talent search engine.
    """
    
    def __init__(self, taxonomy: dict):
        self.skill_resolver = FuzzySkillResolver(taxonomy)
        self.candidates = {}
        
        # Build schema
        builder = fr.SchemaBuilder()
        builder.add_field("skills", "token_set", weight=10, separator=",", required=True)
        builder.add_field("title", "short_text", weight=5, algorithm="jaro_winkler", normalize="lowercase")
        builder.add_field("location", "short_text", weight=2, algorithm="jaro_winkler", normalize="lowercase")
        builder.add_field("summary", "long_text", weight=3, algorithm="ngram", normalize="lowercase")
        
        schema = builder.build()
        self.index = fr.SchemaIndex(schema)
    
    def _resolve_skills(self, skills: list) -> list:
        """Resolve skills to canonical forms."""
        resolved = []
        for skill in skills:
            canonical, conf, _ = self.skill_resolver.resolve(skill)
            if conf > 0:  # Only include matched skills
                resolved.append(canonical)
            else:
                resolved.append(skill.lower())  # Keep unknown skills
        return list(set(resolved))
    
    def add_candidate(self, candidate: dict):
        """Add a candidate to the search index."""
        resolved_skills = self._resolve_skills(candidate['skills'])
        
        record = {
            'skills': ','.join(resolved_skills),
            'title': candidate['title'],
            'location': candidate['location'],
            'summary': candidate.get('summary', '')
        }
        
        self.index.add(record, data=candidate['id'])
        self.candidates[candidate['id']] = {
            **candidate,
            'resolved_skills': resolved_skills
        }
    
    def search(self, job: dict, min_score: float = 0.2, limit: int = 10) -> list:
        """
        Search for candidates matching a job.
        
        Returns detailed match results with breakdown.
        """
        # Resolve job skills
        required = self._resolve_skills(job.get('required_skills', []))
        nice_to_have = self._resolve_skills(job.get('nice_to_have', []))
        all_skills = list(set(required + nice_to_have))
        
        query = {
            'skills': ','.join(all_skills),
            'title': job['title'],
            'location': job.get('location', ''),
            'summary': job.get('description', '')
        }
        
        results = self.index.search(query, min_score=min_score, limit=limit)
        
        matches = []
        for r in results:
            candidate = self.candidates[r.data]
            
            # Detailed skill analysis
            cand_skills = set(candidate['resolved_skills'])
            req_skills = set(required)
            nice_skills = set(nice_to_have)
            
            matched_required = cand_skills & req_skills
            matched_nice = cand_skills & nice_skills
            missing_required = req_skills - cand_skills
            
            # Experience match
            cand_exp = candidate.get('experience_years', 0)
            req_exp = job.get('experience_years', 0)
            exp_match = min(cand_exp / req_exp, 1.0) if req_exp > 0 else 1.0
            
            matches.append({
                'candidate': candidate,
                'overall_score': r.score,
                'field_scores': r.field_scores,
                'skills': {
                    'required_match': len(matched_required) / len(req_skills) if req_skills else 1.0,
                    'nice_to_have_match': len(matched_nice) / len(nice_skills) if nice_skills else 0,
                    'matched_required': sorted(matched_required),
                    'matched_nice_to_have': sorted(matched_nice),
                    'missing_required': sorted(missing_required)
                },
                'experience_match': exp_match,
                'recommendation': self._get_recommendation(r.score, len(matched_required), len(req_skills))
            })
        
        # Sort by overall score (already sorted, but explicit)
        matches.sort(key=lambda x: -x['overall_score'])
        
        return matches
    
    def _get_recommendation(self, score: float, matched: int, required: int) -> str:
        """Generate a hiring recommendation."""
        skill_pct = matched / required if required > 0 else 1.0
        
        if score >= 0.8 and skill_pct >= 0.8:
            return "STRONG MATCH - Recommend interview"
        elif score >= 0.6 and skill_pct >= 0.6:
            return "GOOD MATCH - Consider for interview"
        elif score >= 0.4 and skill_pct >= 0.5:
            return "PARTIAL MATCH - Review manually"
        else:
            return "WEAK MATCH - May not be suitable"


# Initialize and populate
search_engine = TalentSearchEngine(SKILLS_TAXONOMY)

for candidate in CANDIDATES:
    search_engine.add_candidate(candidate)

print(f"Talent Search Engine initialized with {len(search_engine.candidates)} candidates")

In [None]:
# Demo the search engine
job = JOB_POSTINGS[0]  # Senior Full Stack Developer

print(f"\nSearching candidates for: {job['title']} at {job['company']}")
print(f"Location: {job['location']}")
print(f"Required Skills: {job['required_skills']}")
print(f"Nice-to-Have: {job['nice_to_have']}")
print(f"Experience Required: {job['experience_years']}+ years")
print("\n" + "=" * 70)

matches = search_engine.search(job)

for i, m in enumerate(matches, 1):
    cand = m['candidate']
    skills = m['skills']
    
    print(f"\n#{i} {cand['name']}")
    print(f"   Title: {cand['title']}")
    print(f"   Location: {cand['location']}")
    print(f"   Experience: {cand['experience_years']} years")
    print(f"   ")
    print(f"   Overall Score: {m['overall_score']:.0%}")
    print(f"   Required Skills: {skills['required_match']:.0%} ({len(skills['matched_required'])}/{len(skills['matched_required']) + len(skills['missing_required'])})")
    print(f"   Nice-to-Have: {skills['nice_to_have_match']:.0%}")
    print(f"   Experience Match: {m['experience_match']:.0%}")
    print(f"   ")
    if skills['missing_required']:
        print(f"   Missing Required: {skills['missing_required']}")
    print(f"   ")
    print(f"   >> {m['recommendation']}")

## 8. Batch Processing: Matching Many Jobs to Many Candidates

In production, you often need to:
- Find best candidates for multiple jobs
- Find best jobs for a single candidate
- Generate match reports

Let's build utilities for these scenarios.

In [None]:
def generate_match_report(search_engine: TalentSearchEngine, jobs: list) -> dict:
    """
    Generate a comprehensive match report for multiple jobs.
    """
    report = {
        'jobs_analyzed': len(jobs),
        'candidates_in_pool': len(search_engine.candidates),
        'job_matches': []
    }
    
    for job in jobs:
        matches = search_engine.search(job, min_score=0.3, limit=5)
        
        strong = [m for m in matches if 'STRONG' in m['recommendation']]
        good = [m for m in matches if 'GOOD' in m['recommendation']]
        partial = [m for m in matches if 'PARTIAL' in m['recommendation']]
        
        report['job_matches'].append({
            'job_id': job['id'],
            'job_title': job['title'],
            'company': job['company'],
            'total_matches': len(matches),
            'strong_matches': len(strong),
            'good_matches': len(good),
            'partial_matches': len(partial),
            'top_candidate': matches[0]['candidate']['name'] if matches else None,
            'top_score': matches[0]['overall_score'] if matches else 0
        })
    
    return report


# Generate report
report = generate_match_report(search_engine, JOB_POSTINGS)

print("Talent Match Report")
print("=" * 70)
print(f"Jobs Analyzed: {report['jobs_analyzed']}")
print(f"Candidate Pool: {report['candidates_in_pool']}")
print()

print(f"{'Job':<30} {'Strong':<8} {'Good':<8} {'Top Candidate':<20} {'Score'}")
print("-" * 75)

for jm in report['job_matches']:
    print(f"{jm['job_title']:<30} {jm['strong_matches']:<8} {jm['good_matches']:<8} "
          f"{jm['top_candidate'] or 'None':<20} {jm['top_score']:.0%}")

## 9. Production Considerations

### Performance

| Candidate Pool Size | Expected Search Time | Recommendation |
|---------------------|---------------------|----------------|
| <1,000 | <10ms | SchemaIndex sufficient |
| 1,000-10,000 | 10-50ms | Add skill pre-filtering |
| 10,000-100,000 | 50-200ms | Use blocking by required skills |
| >100,000 | Use database | PostgreSQL + fuzzy + application filtering |

### Skill Taxonomy Maintenance

Your taxonomy will need updates as:
- New technologies emerge ("Bun", "Astro", etc.)
- Terms evolve ("Data Science" → "ML Engineering")
- Industry-specific variations appear

Consider:
- Automated alias detection from job posting text
- Periodic review of "unknown" skills
- Integration with external taxonomies (O*NET, ESCO)

### Bias Considerations

Automated resume screening can encode bias:
- Favor candidates with "standard" skill naming (disadvantages non-native speakers)
- Weight prestigious company names in summaries
- Geographic bias from location matching

Mitigations:
- Focus matching on skills, not text style
- Remove or de-weight company names
- Support remote work fairly
- Regular audits of match demographics

## Summary

In this guide, we built a complete talent matching system:

1. **Skills Taxonomy**: Canonical forms + aliases for consistent matching
2. **Fuzzy Skill Resolution**: Handle unknown skills gracefully
3. **Multi-Field Matching**: SchemaIndex for title + skills + location + description
4. **Weighted Scoring**: Required skills weighted higher than nice-to-have
5. **Hiring Recommendations**: Automated categorization of match quality

### Key Takeaways

- **TokenSet** field type is ideal for skill list comparison (Jaccard similarity)
- **Skill normalization** is critical - "React.js" and "ReactJS" must match
- **Fuzzy resolution** handles unknown variations gracefully
- **Multi-field matching** considers the whole candidate, not just skills
- **Weighted scoring** lets you prioritize what matters most

### When to Use This Approach

- ATS (Applicant Tracking System) resume filtering
- Internal talent mobility / skill matching
- Freelancer / contractor matching platforms
- Training recommendation systems
- Team composition optimization