# Skill Taxonomy Builder with Embeddings (Improved Extraction)

This notebook helps you:
1. Build a taxonomy structure from raw skills
2. Generate high-quality variations and abbreviations (with improved filtering)
3. Compute and store embeddings efficiently
4. Set up NumPy-based similarity search with deduplication and scoring improvements

## Prerequisites
```bash
pip install sentence-transformers pandas numpy scikit-learn rapidfuzz pyarrow
```

In [42]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import json
from collections import defaultdict
from rapidfuzz import fuzz, process
from sklearn.cluster import AgglomerativeClustering
import re
from pathlib import Path
import gc

In [None]:
# Check if Ollama is running with mistral:7b model
import subprocess
import time

def check_ollama_status():
    """Check if Ollama is running and mistral:7b is available"""
    try:
        # Check if ollama is running
        result = subprocess.run(['ollama', 'list'], capture_output=True, text=True, timeout=5)
        if result.returncode != 0:
            print("❌ Ollama is not running")
            print("   Start it with: ollama serve")
            return False
        
        # Check if mistral:7b is available
        if 'mistral:7b' in result.stdout or 'mistral:latest' in result.stdout:
            print("✓ Ollama is running")
            print("✓ mistral:7b is available")
            return True
        else:
            print("✓ Ollama is running")
            print("❌ mistral:7b not found")
            print("   Install it with: ollama pull mistral:7b")
            print("\nAvailable models:")
            print(result.stdout)
            return False
            
    except subprocess.TimeoutExpired:
        print("❌ Ollama is not responding")
        print("   Start it with: ollama serve")
        return False
    except FileNotFoundError:
        print("❌ Ollama is not installed")
        print("   Install from: https://ollama.ai")
        return False

# Run the check
ollama_ready = check_ollama_status()

if not ollama_ready:
    print("\n⚠️  LLM relevance validation will be skipped without Ollama")
    print("   Skills will still be extracted using semantic similarity")

## Step 1: Load and Build Taxonomy Structure

We'll analyze your 35,000 skills to create a hierarchical taxonomy using:
- Pattern matching for common categories
- Hierarchical clustering based on semantic similarity
- Parent-child relationships

In [None]:
# Load your skills file
# Adjust the path and format as needed
def load_skills(filepath):
    """
    Load skills from a text file (one skill per line)
    Lines starting with # are treated as comments and skipped
    Returns a pandas DataFrame
    """
    with open(filepath, 'r', encoding='utf-8') as f:
        skills = []
        for line in f:
            line = line.strip()
            # Skip empty lines and comments
            if line and not line.startswith('#'):
                skills.append(line)
    
    df = pd.DataFrame({
        'skill_id': [f'SKILL_{i:05d}' for i in range(len(skills))],
        'canonical_name': skills,
        'normalized_name': [s.lower().strip() for s in skills]
    })
    
    return df

# Load your skills from the data directory
skills_df = load_skills('../data/skills/skills.txt')
print(f"Loaded {len(skills_df)} skills")
skills_df.head()

In [3]:
# Basic taxonomy building using keyword patterns
def extract_category_from_patterns(skill_name):
    """
    Extract likely category based on common patterns
    Customize these patterns based on your domain
    """
    skill_lower = skill_name.lower()
    
    # Programming languages
    prog_langs = ['python', 'java', 'javascript', 'c++', 'ruby', 'go', 'rust', 'php', 'swift']
    if any(lang in skill_lower for lang in prog_langs):
        return 'Programming Languages'
    
    # Data Science & ML
    ds_keywords = ['machine learning', 'data science', 'deep learning', 'neural network', 
                   'tensorflow', 'pytorch', 'scikit-learn', 'nlp', 'computer vision']
    if any(kw in skill_lower for kw in ds_keywords):
        return 'Data Science & AI'
    
    # Cloud & DevOps
    cloud_keywords = ['aws', 'azure', 'gcp', 'docker', 'kubernetes', 'terraform', 'ci/cd', 'devops']
    if any(kw in skill_lower for kw in cloud_keywords):
        return 'Cloud & DevOps'
    
    # Databases
    db_keywords = ['sql', 'database', 'postgresql', 'mongodb', 'redis', 'mysql', 'oracle']
    if any(kw in skill_lower for kw in db_keywords):
        return 'Databases'
    
    # Web Development
    web_keywords = ['html', 'css', 'react', 'angular', 'vue', 'frontend', 'backend', 'web development']
    if any(kw in skill_lower for kw in web_keywords):
        return 'Web Development'
    
    # Add more categories as needed
    
    return 'General'

# Apply pattern-based categorization
skills_df['category'] = skills_df['canonical_name'].apply(extract_category_from_patterns)

print("\nCategory distribution:")
print(skills_df['category'].value_counts())


Category distribution:
category
General                  31069
Programming Languages      665
Databases                  436
Cloud & DevOps             134
Web Development            127
Data Science & AI           37
Name: count, dtype: int64


In [4]:
# Detect parent-child relationships
def find_parent_skills(skill, all_skills, threshold=0.85):
    """
    Find potential parent skills (broader skills that contain this one)
    Example: 'Python Programming' is parent of 'Python Django'
    """
    skill_lower = skill.lower()
    parents = []
    
    for other_skill in all_skills:
        if skill == other_skill:
            continue
            
        other_lower = other_skill.lower()
        
        # Check if skill contains the other (other is more general)
        if other_lower in skill_lower and other_lower != skill_lower:
            # Check token overlap to avoid false positives
            skill_tokens = set(skill_lower.split())
            other_tokens = set(other_lower.split())
            
            if other_tokens.issubset(skill_tokens):
                parents.append(other_skill)
    
    return parents

# Find parent relationships (this can take a while for 35k skills)
print("Finding parent-child relationships...")
all_skill_names = skills_df['canonical_name'].tolist()
skills_df['parent_skills'] = skills_df['canonical_name'].apply(
    lambda x: find_parent_skills(x, all_skill_names)
)

# Show some examples
print("\nSkills with parents:")
skills_with_parents = skills_df[skills_df['parent_skills'].apply(len) > 0]
print(f"Found {len(skills_with_parents)} skills with parent relationships")
skills_with_parents[['canonical_name', 'parent_skills']].head(10)

Finding parent-child relationships...

Skills with parents:
Found 8698 skills with parent relationships


Unnamed: 0,canonical_name,parent_skills
4,.NET Framework 1,[.NET Framework]
5,.NET Framework 3,[.NET Framework]
6,.NET Framework 4,[.NET Framework]
12,10 Gigabit Ethernet,"[Ethernet, Gigabit Ethernet]"
14,10-Hour OSHA Construction Card,[Construction]
16,100-Ton Master Captain's License,[Captain's License]
21,12 Volt Electricity,[Electricity]
25,200-Ton Master Captain's License,[Captain's License]
26,2020 Design Software,"[Design Software, Design]"
27,25-Ton Master Captain's License,[Captain's License]


## Step 2: Generate High-Quality Variations and Abbreviations

We'll use a data-driven approach with improved filtering:
- Rule-based abbreviation generation
- Common typo patterns
- Case variations (filtered to avoid false positives)
- Token reordering for multi-word skills

In [5]:
def generate_abbreviations(skill_name):
    """
    Generate likely abbreviations using rules
    """
    abbreviations = set()
    
    # Remove common words that are typically not abbreviated
    stopwords = {'and', 'or', 'the', 'of', 'for', 'with', 'in', 'on', 'at'}
    
    tokens = skill_name.split()
    filtered_tokens = [t for t in tokens if t.lower() not in stopwords]
    
    if len(filtered_tokens) > 1:
        # First letter of each word
        abbr = ''.join([t[0].upper() for t in filtered_tokens])
        abbreviations.add(abbr)
        
        # First letter lowercase version
        abbreviations.add(abbr.lower())
        
        # Common pattern: First word + first letter of others
        if len(filtered_tokens) >= 2:
            first_word = filtered_tokens[0]
            rest_abbr = ''.join([t[0].upper() for t in filtered_tokens[1:]])
            abbreviations.add(f"{first_word}{rest_abbr}")
    
    # Known common abbreviations (add your domain-specific ones)
    known_abbrevs = {
        'machine learning': ['ML', 'ml'],
        'artificial intelligence': ['AI', 'ai'],
        'natural language processing': ['NLP', 'nlp'],
        'computer vision': ['CV', 'cv'],
        'deep learning': ['DL', 'dl'],
        'data science': ['DS', 'ds'],
        'application programming interface': ['API', 'api'],
        'structured query language': ['SQL', 'sql'],
        'continuous integration': ['CI', 'ci'],
        'continuous deployment': ['CD', 'cd'],
    }
    
    skill_lower = skill_name.lower()
    for phrase, abbrevs in known_abbrevs.items():
        if phrase in skill_lower:
            abbreviations.update(abbrevs)
    
    return list(abbreviations)

def generate_common_typos(skill_name):
    """
    Generate common typo patterns
    """
    typos = set()
    skill_lower = skill_name.lower()
    
    # Common character swaps
    swaps = [('ie', 'ei'), ('ph', 'f'), ('tion', 'sion')]
    for old, new in swaps:
        if old in skill_lower:
            typos.add(skill_lower.replace(old, new))
    
    # Double letter removals (programming -> programing)
    for i in range(len(skill_lower) - 1):
        if skill_lower[i] == skill_lower[i+1]:
            typo = skill_lower[:i] + skill_lower[i+1:]
            typos.add(typo)
    
    return list(typos)

def generate_variations(skill_name):
    """
    Generate all variations of a skill with improved filtering
    """
    variations = set()
    tokens = skill_name.split()
    
    # For single-word skills, be more conservative with variations
    if len(tokens) == 1:
        variations.add(skill_name.lower())
        variations.add(skill_name.upper())
        # Only add abbreviations for longer technical terms
        if len(skill_name) > 4:
            variations.update(generate_abbreviations(skill_name))
        variations.discard(skill_name)
        return list(variations)
    
    # For multi-word skills, generate full variations
    variations.add(skill_name)
    variations.add(skill_name.lower())
    variations.add(skill_name.upper())
    variations.add(skill_name.title())
    
    # Abbreviations
    variations.update(generate_abbreviations(skill_name))
    
    # Common typos (limit to avoid explosion)
    typos = generate_common_typos(skill_name)
    variations.update(typos[:5])
    
    # Token reordering for 2-word skills
    if len(tokens) == 2:
        variations.add(f"{tokens[1]} {tokens[0]}")
    
    # Hyphen/underscore variations
    if ' ' in skill_name:
        variations.add(skill_name.replace(' ', '-'))
        variations.add(skill_name.replace(' ', '_'))
    
    # Remove the original to avoid duplication
    variations.discard(skill_name)
    
    # Filter out problematic variations (common words that cause false positives)
    common_words = {'project', 'projects', 'management', 'analysis', 'development', 
                    'design', 'testing', 'planning', 'support', 'systems', 'data',
                    'business', 'technical', 'customer', 'service', 'process'}
    
    filtered_variations = []
    for var in variations:
        var_lower = var.lower()
        # Keep variations that:
        # 1. Are not single common words, OR
        # 2. Have special characters (hyphens, underscores)
        if var_lower not in common_words or ' ' in var or '-' in var or '_' in var:
            filtered_variations.append(var)
    
    return filtered_variations

# Generate variations for all skills
print("Generating variations...")
skills_df['variations'] = skills_df['canonical_name'].apply(generate_variations)

# Show statistics
avg_variations = skills_df['variations'].apply(len).mean()
total_variations = skills_df['variations'].apply(len).sum()
print(f"\nGenerated {total_variations:,} total variations")
print(f"Average {avg_variations:.1f} variations per skill")

# Show examples
print("\nExample variations:")
for idx in skills_df.sample(min(5, len(skills_df))).index:
    skill = skills_df.loc[idx, 'canonical_name']
    vars = skills_df.loc[idx, 'variations']
    print(f"\n{skill}:")
    print(f"  {vars[:10]}")  # Show first 10

Generating variations...

Generated 231,869 total variations
Average 7.1 variations per skill

Example variations:

Component Design:
  ['Component_Design', 'cd', 'COMPONENT DESIGN', 'Component-Design', 'ComponentD', 'CD', 'component design', 'Design Component']

Cloudera Certified Developer For Hadoop (CCDH):
  ['CCDH(', 'Cloudera Certified Developer For Hadoop (Ccdh)', 'cloudera certified developer for hadoop (ccdh)', 'Cloudera-Certified-Developer-For-Hadoop-(CCDH)', 'cloudera certifeid developer for hadoop (ccdh)', 'CLOUDERA CERTIFIED DEVELOPER FOR HADOOP (CCDH)', 'cloudera certified developer for hadop (ccdh)', 'cloudera certified developer for hadoop (cdh)', 'Cloudera_Certified_Developer_For_Hadoop_(CCDH)', 'ClouderaCDH(']

Registration:
  ['registration', 'REGISTRATION']

Sonatype:
  ['SONATYPE', 'sonatype']

Stormwater Monitoring:
  ['Stormwater_Monitoring', 'SM', 'Stormwater-Monitoring', 'STORMWATER MONITORING', 'stormwater monitoring', 'sm', 'StormwaterM', 'Monitoring Stormwat

## Step 3: Compute and Store Embeddings

We'll use sentence-transformers to generate embeddings for:
- Canonical skill names
- All variations

These will be pre-computed and stored for fast loading.

In [6]:
# Load embedding model
# Options:
# - 'all-MiniLM-L6-v2': Fast, good balance (384 dimensions)
# - 'multi-qa-MiniLM-L6-cos-v1': Better for asymmetric search
# - 'all-mpnet-base-v2': Higher quality, slower (768 dimensions)

print("Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding_dim = model.get_sentence_embedding_dimension()
print(f"Model loaded. Embedding dimension: {embedding_dim}")

Loading embedding model...
Model loaded. Embedding dimension: 384


In [7]:
# Compute embeddings for canonical names
print("Computing embeddings for canonical skill names...")
canonical_names = skills_df['canonical_name'].tolist()
canonical_embeddings = model.encode(
    canonical_names,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Store embeddings in dataframe
skills_df['embedding'] = list(canonical_embeddings)

print(f"Computed {len(canonical_embeddings)} embeddings")
print(f"Embedding shape: {canonical_embeddings.shape}")

Computing embeddings for canonical skill names...


Batches:   0%|          | 0/1015 [00:00<?, ?it/s]

Computed 32468 embeddings
Embedding shape: (32468, 384)


In [8]:
# Create variation-to-skill mapping with embeddings
print("\nComputing embeddings for all variations...")

variation_data = []
for idx, row in skills_df.iterrows():
    skill_id = row['skill_id']
    canonical = row['canonical_name']
    
    # Add canonical name
    variation_data.append({
        'skill_id': skill_id,
        'canonical_name': canonical,
        'variation': canonical,
        'is_canonical': True
    })
    
    # Add all variations
    for var in row['variations']:
        variation_data.append({
            'skill_id': skill_id,
            'canonical_name': canonical,
            'variation': var,
            'is_canonical': False
        })

variations_df = pd.DataFrame(variation_data)
print(f"Total entries (canonical + variations): {len(variations_df):,}")

# Compute embeddings for all variations
all_variations = variations_df['variation'].tolist()
variation_embeddings = model.encode(
    all_variations,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

variations_df['embedding'] = list(variation_embeddings)
print(f"\nComputed {len(variation_embeddings):,} variation embeddings")


Computing embeddings for all variations...
Total entries (canonical + variations): 264,337


Batches:   0%|          | 0/8261 [00:00<?, ?it/s]


Computed 264,337 variation embeddings


In [None]:
# Save the complete taxonomy with embeddings
print("Saving taxonomy...")

# Output directory
output_dir = '../data/skills'

# Save as Parquet (preserves embeddings)
skills_df.to_parquet(f'{output_dir}/skill_taxonomy.parquet', index=False)
variations_df.to_parquet(f'{output_dir}/skill_variations.parquet', index=False)

# Also save as JSON for human readability (without embeddings)
skills_json = []
for idx, row in skills_df.iterrows():
    skills_json.append({
        'skill_id': row['skill_id'],
        'canonical_name': row['canonical_name'],
        'category': row['category'],
        'parent_skills': row['parent_skills'],
        'variations': row['variations'],
        # Embeddings excluded from JSON to keep file size reasonable
    })

with open(f'{output_dir}/skill_taxonomy.json', 'w', encoding='utf-8') as f:
    json.dump(skills_json, f, indent=2, ensure_ascii=False)

print(f"✓ Saved {output_dir}/skill_taxonomy.parquet (with embeddings)")
print(f"✓ Saved {output_dir}/skill_variations.parquet (with embeddings)")
print(f"✓ Saved {output_dir}/skill_taxonomy.json (human-readable, no embeddings)")

## Step 4: Prepare NumPy-based Similarity Search

Instead of FAISS (which can cause segmentation faults), we'll use NumPy for similarity search.
This is stable, portable, and still very fast for your use case.

In [None]:
# Normalize and save embeddings for similarity search
print("Preparing embeddings for similarity search...")

# Output directory
output_dir = '../data/skills'

# Normalize canonical embeddings
canonical_embeddings_normalized = canonical_embeddings.astype('float32')
canonical_embeddings_normalized = canonical_embeddings_normalized / np.linalg.norm(
    canonical_embeddings_normalized, axis=1, keepdims=True
)

# Save normalized canonical embeddings
np.save(f'{output_dir}/skill_canonical_embeddings.npy', canonical_embeddings_normalized)
print(f"✓ Saved {len(canonical_embeddings_normalized)} canonical embeddings")

# Normalize variation embeddings
variation_embeddings_normalized = variation_embeddings.astype('float32')
variation_embeddings_normalized = variation_embeddings_normalized / np.linalg.norm(
    variation_embeddings_normalized, axis=1, keepdims=True
)

# Save normalized variation embeddings
np.save(f'{output_dir}/skill_variations_embeddings.npy', variation_embeddings_normalized)
print(f"✓ Saved {len(variation_embeddings_normalized)} variation embeddings")

print("\n✓ All embeddings saved and ready for fast similarity search!")

# Cleanup to free memory
del canonical_embeddings_normalized
del variation_embeddings_normalized
gc.collect()

## Runtime Usage: Improved SkillExtractor Class

This version includes:
- Length-based scoring penalty (prevents partial matches from scoring too high)
- Similar skill deduplication (removes redundant results)
- Better handling of edge cases

In [None]:
# Try to import ollama for LLM validation
try:
    import ollama
    OLLAMA_AVAILABLE = True
except ImportError:
    OLLAMA_AVAILABLE = False
    print("Note: Ollama not installed. LLM validation will not be available.")

class SkillExtractor:
    """
    Fast skill extraction using pre-built taxonomy and numpy similarity search
    with improved scoring and deduplication
    """
    def __init__(self, 
                 taxonomy_path='../data/skills/skill_taxonomy.parquet',
                 variations_path='../data/skills/skill_variations.parquet',
                 canonical_embeddings_path='../data/skills/skill_canonical_embeddings.npy',
                 variations_embeddings_path='../data/skills/skill_variations_embeddings.npy',
                 model_name='all-MiniLM-L6-v2'):
        
        print("Loading skill extractor...")
        
        # Load taxonomy
        self.skills_df = pd.read_parquet(taxonomy_path)
        self.variations_df = pd.read_parquet(variations_path)
        
        # Load embeddings (numpy arrays)
        self.canonical_embeddings = np.load(canonical_embeddings_path)
        self.variations_embeddings = np.load(variations_embeddings_path)
        
        # Load embedding model
        self.model = SentenceTransformer(model_name)
        
        print(f"✓ Loaded {len(self.skills_df)} skills")
        print(f"✓ Loaded {len(self.variations_df)} variations")
        print(f"✓ Ready for extraction")
    
    def extract_from_text(self, text, threshold=0.5, top_k=10, use_variations=True,
                         deduplicate_similar=True, dedup_threshold=0.85,
                         apply_length_penalty=True,
                         validate_relevance=True, context=None,
                         ollama_model="mistral:7b", relevance_threshold=0.5):
        """
        Extract skills from text using numpy-based similarity search
        
        Args:
            text: Input text to extract skills from
            threshold: Minimum similarity threshold (0-1)
            top_k: Number of top matches to consider per n-gram
            use_variations: Whether to search variations or just canonical names
            deduplicate_similar: Remove skills that are very similar to each other
            dedup_threshold: Similarity threshold for considering skills duplicates
            apply_length_penalty: Penalize matches where ngram is much shorter than skill
            validate_relevance: Use LLM to validate skill relevance (default True)
            context: Context hint for relevance validation (e.g., "software engineer")
            ollama_model: Ollama model for validation (default: mistral:7b)
            relevance_threshold: Minimum relevance score to keep skill
        
        Returns:
            List of detected skills with similarity scores (and relevance_score if validation enabled)
        """
        # Generate n-grams from text
        ngrams = self._generate_ngrams(text, max_n=5)
        
        if not ngrams:
            return []
        
        # Encode n-grams
        ngram_embeddings = self.model.encode(ngrams, convert_to_numpy=True)
        ngram_embeddings = ngram_embeddings / np.linalg.norm(
            ngram_embeddings, axis=1, keepdims=True
        )
        
        # Search for matches
        detected_skills = {}
        
        for i, ngram in enumerate(ngrams):
            query = ngram_embeddings[i:i+1]
            
            if use_variations:
                # Compute cosine similarity with all variations (dot product)
                similarities = np.dot(self.variations_embeddings, query.T).flatten()
                
                # Get top k indices
                if len(similarities) > top_k:
                    top_indices = np.argpartition(similarities, -top_k)[-top_k:]
                    top_indices = top_indices[np.argsort(similarities[top_indices])][::-1]
                else:
                    top_indices = np.argsort(similarities)[::-1]
                
                for idx in top_indices:
                    dist = similarities[idx]
                    if dist >= threshold:
                        match = self.variations_df.iloc[idx]
                        skill_id = match['skill_id']
                        canonical = match['canonical_name']
                        matched_variation = match['variation']
                        
                        # Apply length penalty
                        if apply_length_penalty:
                            adjusted_score = self._apply_length_penalty(
                                dist, ngram, canonical
                            )
                        else:
                            adjusted_score = dist
                        
                        # Only update if this is a better match
                        if skill_id not in detected_skills or adjusted_score > detected_skills[skill_id]['score']:
                            detected_skills[skill_id] = {
                                'canonical_name': canonical,
                                'matched_text': ngram,
                                'matched_variation': matched_variation,
                                'score': float(adjusted_score),
                                'raw_similarity': float(dist)
                            }
            else:
                # Search canonical embeddings
                similarities = np.dot(self.canonical_embeddings, query.T).flatten()
                
                if len(similarities) > top_k:
                    top_indices = np.argpartition(similarities, -top_k)[-top_k:]
                    top_indices = top_indices[np.argsort(similarities[top_indices])][::-1]
                else:
                    top_indices = np.argsort(similarities)[::-1]
                
                for idx in top_indices:
                    dist = similarities[idx]
                    if dist >= threshold:
                        match = self.skills_df.iloc[idx]
                        skill_id = match['skill_id']
                        canonical = match['canonical_name']
                        
                        # Apply length penalty
                        if apply_length_penalty:
                            adjusted_score = self._apply_length_penalty(
                                dist, ngram, canonical
                            )
                        else:
                            adjusted_score = dist
                        
                        if skill_id not in detected_skills or adjusted_score > detected_skills[skill_id]['score']:
                            detected_skills[skill_id] = {
                                'canonical_name': canonical,
                                'matched_text': ngram,
                                'score': float(adjusted_score),
                                'raw_similarity': float(dist)
                            }
        
        # Sort by score
        results = sorted(detected_skills.values(), key=lambda x: x['score'], reverse=True)
        
        # Deduplicate similar skills
        if deduplicate_similar and len(results) > 1:
            results = self._deduplicate_similar_skills(results, dedup_threshold)
        
        # Apply LLM relevance validation if enabled
        if validate_relevance and results and OLLAMA_AVAILABLE:
            results = self._validate_relevance_with_llm(
                text, results, context, ollama_model, relevance_threshold
            )
        
        return results
    
    def _apply_length_penalty(self, similarity, ngram, canonical_skill):
        """
        Apply penalty when matched n-gram is much shorter than the skill name
        """
        ngram_len = len(ngram.split())
        skill_len = len(canonical_skill.split())
        
        if ngram_len < skill_len:
            length_penalty = ngram_len / skill_len
            penalty_factor = max(0.5, length_penalty)
            adjusted_score = similarity * penalty_factor
        else:
            adjusted_score = similarity
        
        return adjusted_score
    
    def _deduplicate_similar_skills(self, results, threshold=0.85):
        """
        Remove redundant skills that are very similar to higher-scoring skills
        """
        if len(results) <= 1:
            return results
        
        skill_names = [r['canonical_name'] for r in results]
        skill_embeddings = self.model.encode(skill_names, convert_to_numpy=True)
        skill_embeddings = skill_embeddings / np.linalg.norm(skill_embeddings, axis=1, keepdims=True)
        
        keep_indices = []
        
        for i in range(len(results)):
            if i == 0:
                keep_indices.append(i)
                continue
            
            should_keep = True
            for j in keep_indices:
                similarity = np.dot(skill_embeddings[i], skill_embeddings[j])
                if similarity > threshold:
                    should_keep = False
                    break
            
            if should_keep:
                keep_indices.append(i)
        
        return [results[i] for i in keep_indices]
    
    def _generate_ngrams(self, text, max_n=5):
        """
        Generate n-grams from text (1 to max_n words)
        """
        text = re.sub(r'[^a-zA-Z0-9\s-]', ' ', text)
        tokens = text.lower().split()
        
        ngrams = []
        for n in range(1, min(max_n + 1, len(tokens) + 1)):
            for i in range(len(tokens) - n + 1):
                ngram = ' '.join(tokens[i:i+n])
                ngrams.append(ngram)
        
        return ngrams
    
    def _validate_relevance_with_llm(self, text, skills, context=None, 
                                     model="mistral:7b", relevance_threshold=0.5):
        """
        Use a local LLM via Ollama to validate skill relevance to the text context.
        """
        if not skills:
            return skills
        
        # Limit to top 30 skills (reduced from 50 to prevent truncation)
        skills_to_validate = skills[:30]
        skill_names = [s['canonical_name'] for s in skills_to_validate]
        
        # Build context description
        if context:
            context_desc = f" about {context}"
        else:
            context_desc = ""
        
        # Create prompt with numbered skills for reliable matching
        skills_list = "\n".join([f"{i}: {name}" for i, name in enumerate(skill_names)])
        
        prompt = f"""Rate each skill's relevance to this text{context_desc}:

{text[:1500]}

Rate from 0.0 (not relevant/incidental) to 1.0 (core requirement).

Skills:
{skills_list}

Return ONLY JSON with skill numbers as keys and scores as values.
Example: {{"0": 0.95, "1": 0.1, "2": 0.85}}

JSON:"""
        
        try:
            # Call Ollama
            response = ollama.generate(
                model=model,
                prompt=prompt,
                options={
                    "temperature": 0,
                    "num_predict": 3500,
                }
            )
            
            response_text = response['response'].strip()
            
            # Parse JSON from response
            try:
                start_idx = response_text.find('{')
                if start_idx != -1:
                    brace_count = 0
                    end_idx = start_idx
                    for i, char in enumerate(response_text[start_idx:], start_idx):
                        if char == '{':
                            brace_count += 1
                        elif char == '}':
                            brace_count -= 1
                            if brace_count == 0:
                                end_idx = i + 1
                                break
                    json_str = response_text[start_idx:end_idx]
                    relevance_scores = json.loads(json_str)
                else:
                    print(f"Warning: No JSON found in LLM response")
                    return skills
            except json.JSONDecodeError as e:
                print(f"Warning: Could not parse JSON from LLM response: {e}")
                return skills
            
            # Apply relevance scores using index-based matching
            validated_skills = []
            unmatched_count = 0
            for i, skill in enumerate(skills_to_validate):
                # Try to get score by index (as string or int)
                relevance = None
                if str(i) in relevance_scores:
                    relevance = float(relevance_scores[str(i)])
                elif i in relevance_scores:
                    relevance = float(relevance_scores[i])
                
                if relevance is None:
                    # Not found - assign low score
                    relevance = 0.1
                    unmatched_count += 1
                
                skill['relevance_score'] = relevance
                
                if relevance >= relevance_threshold:
                    validated_skills.append(skill)
            
            if unmatched_count > 0:
                print(f"Note: LLM did not rate {unmatched_count} skills (assigned 0.1)")
            
            validated_skills.sort(key=lambda x: (x.get('relevance_score', 0), x['score']), reverse=True)
            
            return validated_skills
            
        except Exception as e:
            print(f"LLM validation failed: {e}")
            return skills

In [None]:
# Example usage
extractor = SkillExtractor()

# Test with sample text
sample_text = """
I have 5 years of experience in Python programming and machine learning. 
I've worked extensively with TensorFlow and PyTorch for deep learning projects.
I'm also proficient in SQL databases and cloud platforms like AWS.
"""

# Extract skills with LLM relevance validation (default)
detected = extractor.extract_from_text(
    sample_text, 
    threshold=0.6, 
    dedup_threshold=0.85,
    apply_length_penalty=True,
    context="software engineer resume"
)

print(f"Detected {len(detected)} skills:")
print("-" * 70)
for skill in detected[:15]:
    relevance = skill.get('relevance_score', 'N/A')
    rel_str = f"{relevance:.2f}" if isinstance(relevance, float) else str(relevance)
    print(f"  {skill['canonical_name']:30} | Score: {skill['score']:.2f} | Relevance: {rel_str}")

In [None]:
# Test with marine biologist job description
marine_bio_text = '''
A marine biologist studies marine life and ecosystems to understand and protect them. Job duties include conducting research through fieldwork and laboratory work, collecting and analyzing data, monitoring marine populations, and developing conservation strategies. They also write reports, communicate findings to stakeholders and the public, and may work with government agencies or conservation groups. 

Core responsibilities:
• Conduct research: Study marine organisms, their behavior, life cycles, and interactions with their environment. 
• Fieldwork: Go to marine environments to observe wildlife, collect samples (water, organisms, sediment), and conduct surveys. 
• Laboratory work: Analyze collected samples, conduct experiments, and process data. 
• Data analysis: Interpret findings using statistical software and GIS to understand populations and environments. 
• Conservation and management: Develop and implement programs to protect marine life, restore habitats, and manage ecosystems. 
• Reporting and communication: Write research papers, create reports, and present findings to the public, policymakers, and other scientists. 

Typical activities:
• Monitoring marine animal populations and the effects of human activity. 
• Testing sea creatures for pollutants. 
• Using equipment such as SCUBA gear for fieldwork. 
• Working with government agencies or non-profit organizations on conservation efforts. 
• Writing grant proposals for research funding. 
• Attending conferences to share and learn about scientific advancements. 
'''

# Extract skills with relevance validation
detected = extractor.extract_from_text(
    marine_bio_text, 
    threshold=0.6, 
    dedup_threshold=0.85, 
    apply_length_penalty=True,
    context="marine biologist job description",
    relevance_threshold=0.3
)

print(f"Marine Biologist - Detected {len(detected)} relevant skills:")
print("-" * 70)
for skill in detected[:20]:
    relevance = skill.get('relevance_score', 'N/A')
    rel_str = f"{relevance:.2f}" if isinstance(relevance, float) else str(relevance)
    print(f"  {skill['canonical_name']:30} | Score: {skill['score']:.2f} | Relevance: {rel_str}")

In [None]:
# Test with product manager job description
product_manager_text = '''
We are seeking an experienced Product Manager to lead our product development initiatives. The ideal candidate will drive product strategy, work closely with engineering teams, and deliver exceptional user experiences.

Key Responsibilities:
• Define product vision, strategy, and roadmap aligned with business objectives
• Gather and prioritize product requirements from stakeholders and customers
• Work with UX designers to create intuitive user interfaces and experiences
• Collaborate with engineering teams to deliver features on time and within scope
• Analyze market trends, competitive landscape, and customer feedback
• Define and track key performance indicators (KPIs) and success metrics
• Lead sprint planning, backlog grooming, and agile ceremonies
• Communicate product updates to executives, sales teams, and customers

Requirements:
• 5+ years of product management experience in SaaS or technology companies
• Strong analytical skills with experience in data-driven decision making
• Excellent communication and presentation skills
• Experience with agile methodologies (Scrum, Kanban)
• Proficiency with product management tools (Jira, Confluence, Figma)
• MBA or technical degree preferred

Benefits:
• Competitive salary and equity package
• Health, dental, and vision insurance
• 401(k) matching
• Flexible work arrangements
'''

# Extract skills with relevance validation
detected = extractor.extract_from_text(
    product_manager_text, 
    threshold=0.6, 
    dedup_threshold=0.85, 
    apply_length_penalty=True,
    context="product manager job description",
    relevance_threshold=0.3
)

print(f"Product Manager - Detected {len(detected)} relevant skills:")
print("-" * 70)
for skill in detected[:20]:
    relevance = skill.get('relevance_score', 'N/A')
    rel_str = f"{relevance:.2f}" if isinstance(relevance, float) else str(relevance)
    print(f"  {skill['canonical_name']:30} | Score: {skill['score']:.2f} | Relevance: {rel_str}")

In [None]:
# Quick test - single skill extraction
test_text = "Looking for a data science expert with machine learning experience"
results = extractor.extract_from_text(
    test_text, 
    context="job requirement",
    relevance_threshold=0.3
)

print(f"Quick test - {len(results)} skills found:")
print("-" * 70)
for skill in results[:5]:
    relevance = skill.get('relevance_score', 'N/A')
    rel_str = f"{relevance:.2f}" if isinstance(relevance, float) else str(relevance)
    print(f"  {skill['canonical_name']:30} | Score: {skill['score']:.2f} | Relevance: {rel_str}")

## Summary and Improvements

### Key Improvements in this version:

1. **Better Variation Generation**
   - Conservative approach for single-word skills
   - Filters out common words that cause false positives
   - Reduces noise from generic terms like "projects", "management", etc.

2. **Length-Based Scoring Penalty**
   - Prevents partial matches from scoring too high
   - E.g., matching "projects" to "Project Management" gets penalized
   - Configurable via `apply_length_penalty` parameter

3. **Similar Skill Deduplication**
   - Removes redundant results (e.g., "Machine Learning" and "ML")
   - Keeps the highest-scoring match from each cluster
   - Configurable via `deduplicate_similar` and `dedup_threshold`

4. **Better Diagnostics**
   - Returns both adjusted score and raw similarity
   - Shows which variation was matched
   - Easier to debug and tune thresholds

### Tuning Recommendations:

- **For higher precision**: Increase `threshold` (e.g., 0.7) and `dedup_threshold` (e.g., 0.9)
- **For higher recall**: Decrease `threshold` (e.g., 0.4) and disable length penalty
- **For balanced results**: Use defaults (threshold=0.6, dedup_threshold=0.85, length_penalty=True)

### Next Steps:

1. Test on your actual text data and adjust thresholds
2. Add more common words to the filter list if needed
3. Consider adding exact/fuzzy matching as a complement to semantic search
4. Evaluate on labeled data to measure precision/recall
5. Fine-tune the length penalty formula for your specific use case