# Skill Taxonomy Builder with Embeddings (Improved Extraction)

This notebook helps you:
1. Build a taxonomy structure from raw skills
2. Generate high-quality variations and abbreviations (with improved filtering)
3. Compute and store embeddings efficiently
4. Set up NumPy-based similarity search with deduplication and scoring improvements

## Prerequisites
```bash
pip install sentence-transformers pandas numpy scikit-learn rapidfuzz pyarrow
```

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import json
from collections import defaultdict
from rapidfuzz import fuzz, process
from sklearn.cluster import AgglomerativeClustering
import re
from pathlib import Path
import gc

## Step 1: Load and Build Taxonomy Structure

We'll analyze your 35,000 skills to create a hierarchical taxonomy using:
- Pattern matching for common categories
- Hierarchical clustering based on semantic similarity
- Parent-child relationships

In [None]:
# Load your skills file
# Adjust the path and format as needed
def load_skills(filepath):
    """
    Load skills from a text file (one skill per line)
    Returns a pandas DataFrame
    """
    with open(filepath, 'r', encoding='utf-8') as f:
        skills = [line.strip() for line in f if line.strip()]
    
    df = pd.DataFrame({
        'skill_id': [f'SKILL_{i:05d}' for i in range(len(skills))],
        'canonical_name': skills,
        'normalized_name': [s.lower().strip() for s in skills]
    })
    
    return df

# Load your skills
skills_df = load_skills('your_skills_file.txt')  # Update this path
print(f"Loaded {len(skills_df)} skills")
skills_df.head()

In [None]:
# Basic taxonomy building using keyword patterns
def extract_category_from_patterns(skill_name):
    """
    Extract likely category based on common patterns
    Customize these patterns based on your domain
    """
    skill_lower = skill_name.lower()
    
    # Programming languages
    prog_langs = ['python', 'java', 'javascript', 'c++', 'ruby', 'go', 'rust', 'php', 'swift']
    if any(lang in skill_lower for lang in prog_langs):
        return 'Programming Languages'
    
    # Data Science & ML
    ds_keywords = ['machine learning', 'data science', 'deep learning', 'neural network', 
                   'tensorflow', 'pytorch', 'scikit-learn', 'nlp', 'computer vision']
    if any(kw in skill_lower for kw in ds_keywords):
        return 'Data Science & AI'
    
    # Cloud & DevOps
    cloud_keywords = ['aws', 'azure', 'gcp', 'docker', 'kubernetes', 'terraform', 'ci/cd', 'devops']
    if any(kw in skill_lower for kw in cloud_keywords):
        return 'Cloud & DevOps'
    
    # Databases
    db_keywords = ['sql', 'database', 'postgresql', 'mongodb', 'redis', 'mysql', 'oracle']
    if any(kw in skill_lower for kw in db_keywords):
        return 'Databases'
    
    # Web Development
    web_keywords = ['html', 'css', 'react', 'angular', 'vue', 'frontend', 'backend', 'web development']
    if any(kw in skill_lower for kw in web_keywords):
        return 'Web Development'
    
    # Add more categories as needed
    
    return 'General'

# Apply pattern-based categorization
skills_df['category'] = skills_df['canonical_name'].apply(extract_category_from_patterns)

print("\nCategory distribution:")
print(skills_df['category'].value_counts())

In [None]:
# Detect parent-child relationships
def find_parent_skills(skill, all_skills, threshold=0.85):
    """
    Find potential parent skills (broader skills that contain this one)
    Example: 'Python Programming' is parent of 'Python Django'
    """
    skill_lower = skill.lower()
    parents = []
    
    for other_skill in all_skills:
        if skill == other_skill:
            continue
            
        other_lower = other_skill.lower()
        
        # Check if skill contains the other (other is more general)
        if other_lower in skill_lower and other_lower != skill_lower:
            # Check token overlap to avoid false positives
            skill_tokens = set(skill_lower.split())
            other_tokens = set(other_lower.split())
            
            if other_tokens.issubset(skill_tokens):
                parents.append(other_skill)
    
    return parents

# Find parent relationships (this can take a while for 35k skills)
print("Finding parent-child relationships...")
all_skill_names = skills_df['canonical_name'].tolist()
skills_df['parent_skills'] = skills_df['canonical_name'].apply(
    lambda x: find_parent_skills(x, all_skill_names)
)

# Show some examples
print("\nSkills with parents:")
skills_with_parents = skills_df[skills_df['parent_skills'].apply(len) > 0]
print(f"Found {len(skills_with_parents)} skills with parent relationships")
skills_with_parents[['canonical_name', 'parent_skills']].head(10)

## Step 2: Generate High-Quality Variations and Abbreviations

We'll use a data-driven approach with improved filtering:
- Rule-based abbreviation generation
- Common typo patterns
- Case variations (filtered to avoid false positives)
- Token reordering for multi-word skills

In [None]:
def generate_abbreviations(skill_name):
    """
    Generate likely abbreviations using rules
    """
    abbreviations = set()
    
    # Remove common words that are typically not abbreviated
    stopwords = {'and', 'or', 'the', 'of', 'for', 'with', 'in', 'on', 'at'}
    
    tokens = skill_name.split()
    filtered_tokens = [t for t in tokens if t.lower() not in stopwords]
    
    if len(filtered_tokens) > 1:
        # First letter of each word
        abbr = ''.join([t[0].upper() for t in filtered_tokens])
        abbreviations.add(abbr)
        
        # First letter lowercase version
        abbreviations.add(abbr.lower())
        
        # Common pattern: First word + first letter of others
        if len(filtered_tokens) >= 2:
            first_word = filtered_tokens[0]
            rest_abbr = ''.join([t[0].upper() for t in filtered_tokens[1:]])
            abbreviations.add(f"{first_word}{rest_abbr}")
    
    # Known common abbreviations (add your domain-specific ones)
    known_abbrevs = {
        'machine learning': ['ML', 'ml'],
        'artificial intelligence': ['AI', 'ai'],
        'natural language processing': ['NLP', 'nlp'],
        'computer vision': ['CV', 'cv'],
        'deep learning': ['DL', 'dl'],
        'data science': ['DS', 'ds'],
        'application programming interface': ['API', 'api'],
        'structured query language': ['SQL', 'sql'],
        'continuous integration': ['CI', 'ci'],
        'continuous deployment': ['CD', 'cd'],
    }
    
    skill_lower = skill_name.lower()
    for phrase, abbrevs in known_abbrevs.items():
        if phrase in skill_lower:
            abbreviations.update(abbrevs)
    
    return list(abbreviations)

def generate_common_typos(skill_name):
    """
    Generate common typo patterns
    """
    typos = set()
    skill_lower = skill_name.lower()
    
    # Common character swaps
    swaps = [('ie', 'ei'), ('ph', 'f'), ('tion', 'sion')]
    for old, new in swaps:
        if old in skill_lower:
            typos.add(skill_lower.replace(old, new))
    
    # Double letter removals (programming -> programing)
    for i in range(len(skill_lower) - 1):
        if skill_lower[i] == skill_lower[i+1]:
            typo = skill_lower[:i] + skill_lower[i+1:]
            typos.add(typo)
    
    return list(typos)

def generate_variations(skill_name):
    """
    Generate all variations of a skill with improved filtering
    """
    variations = set()
    tokens = skill_name.split()
    
    # For single-word skills, be more conservative with variations
    if len(tokens) == 1:
        variations.add(skill_name.lower())
        variations.add(skill_name.upper())
        # Only add abbreviations for longer technical terms
        if len(skill_name) > 4:
            variations.update(generate_abbreviations(skill_name))
        variations.discard(skill_name)
        return list(variations)
    
    # For multi-word skills, generate full variations
    variations.add(skill_name)
    variations.add(skill_name.lower())
    variations.add(skill_name.upper())
    variations.add(skill_name.title())
    
    # Abbreviations
    variations.update(generate_abbreviations(skill_name))
    
    # Common typos (limit to avoid explosion)
    typos = generate_common_typos(skill_name)
    variations.update(typos[:5])
    
    # Token reordering for 2-word skills
    if len(tokens) == 2:
        variations.add(f"{tokens[1]} {tokens[0]}")
    
    # Hyphen/underscore variations
    if ' ' in skill_name:
        variations.add(skill_name.replace(' ', '-'))
        variations.add(skill_name.replace(' ', '_'))
    
    # Remove the original to avoid duplication
    variations.discard(skill_name)
    
    # Filter out problematic variations (common words that cause false positives)
    common_words = {'project', 'projects', 'management', 'analysis', 'development', 
                    'design', 'testing', 'planning', 'support', 'systems', 'data',
                    'business', 'technical', 'customer', 'service', 'process'}
    
    filtered_variations = []
    for var in variations:
        var_lower = var.lower()
        # Keep variations that:
        # 1. Are not single common words, OR
        # 2. Have special characters (hyphens, underscores)
        if var_lower not in common_words or ' ' in var or '-' in var or '_' in var:
            filtered_variations.append(var)
    
    return filtered_variations

# Generate variations for all skills
print("Generating variations...")
skills_df['variations'] = skills_df['canonical_name'].apply(generate_variations)

# Show statistics
avg_variations = skills_df['variations'].apply(len).mean()
total_variations = skills_df['variations'].apply(len).sum()
print(f"\nGenerated {total_variations:,} total variations")
print(f"Average {avg_variations:.1f} variations per skill")

# Show examples
print("\nExample variations:")
for idx in skills_df.sample(min(5, len(skills_df))).index:
    skill = skills_df.loc[idx, 'canonical_name']
    vars = skills_df.loc[idx, 'variations']
    print(f"\n{skill}:")
    print(f"  {vars[:10]}")  # Show first 10

## Step 3: Compute and Store Embeddings

We'll use sentence-transformers to generate embeddings for:
- Canonical skill names
- All variations

These will be pre-computed and stored for fast loading.

In [None]:
# Load embedding model
# Options:
# - 'all-MiniLM-L6-v2': Fast, good balance (384 dimensions)
# - 'multi-qa-MiniLM-L6-cos-v1': Better for asymmetric search
# - 'all-mpnet-base-v2': Higher quality, slower (768 dimensions)

print("Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding_dim = model.get_sentence_embedding_dimension()
print(f"Model loaded. Embedding dimension: {embedding_dim}")

In [None]:
# Compute embeddings for canonical names
print("Computing embeddings for canonical skill names...")
canonical_names = skills_df['canonical_name'].tolist()
canonical_embeddings = model.encode(
    canonical_names,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Store embeddings in dataframe
skills_df['embedding'] = list(canonical_embeddings)

print(f"Computed {len(canonical_embeddings)} embeddings")
print(f"Embedding shape: {canonical_embeddings.shape}")

In [None]:
# Create variation-to-skill mapping with embeddings
print("\nComputing embeddings for all variations...")

variation_data = []
for idx, row in skills_df.iterrows():
    skill_id = row['skill_id']
    canonical = row['canonical_name']
    
    # Add canonical name
    variation_data.append({
        'skill_id': skill_id,
        'canonical_name': canonical,
        'variation': canonical,
        'is_canonical': True
    })
    
    # Add all variations
    for var in row['variations']:
        variation_data.append({
            'skill_id': skill_id,
            'canonical_name': canonical,
            'variation': var,
            'is_canonical': False
        })

variations_df = pd.DataFrame(variation_data)
print(f"Total entries (canonical + variations): {len(variations_df):,}")

# Compute embeddings for all variations
all_variations = variations_df['variation'].tolist()
variation_embeddings = model.encode(
    all_variations,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

variations_df['embedding'] = list(variation_embeddings)
print(f"\nComputed {len(variation_embeddings):,} variation embeddings")

In [None]:
# Save the complete taxonomy with embeddings
print("Saving taxonomy...")

# Save as Parquet (preserves embeddings)
skills_df.to_parquet('skill_taxonomy.parquet', index=False)
variations_df.to_parquet('skill_variations.parquet', index=False)

# Also save as JSON for human readability (without embeddings)
skills_json = []
for idx, row in skills_df.iterrows():
    skills_json.append({
        'skill_id': row['skill_id'],
        'canonical_name': row['canonical_name'],
        'category': row['category'],
        'parent_skills': row['parent_skills'],
        'variations': row['variations'],
        # Embeddings excluded from JSON to keep file size reasonable
    })

with open('skill_taxonomy.json', 'w', encoding='utf-8') as f:
    json.dump(skills_json, f, indent=2, ensure_ascii=False)

print("✓ Saved skill_taxonomy.parquet (with embeddings)")
print("✓ Saved skill_variations.parquet (with embeddings)")
print("✓ Saved skill_taxonomy.json (human-readable, no embeddings)")

## Step 4: Prepare NumPy-based Similarity Search

Instead of FAISS (which can cause segmentation faults), we'll use NumPy for similarity search.
This is stable, portable, and still very fast for your use case.

In [None]:
# Normalize and save embeddings for similarity search
print("Preparing embeddings for similarity search...")

# Normalize canonical embeddings
canonical_embeddings_normalized = canonical_embeddings.astype('float32')
canonical_embeddings_normalized = canonical_embeddings_normalized / np.linalg.norm(
    canonical_embeddings_normalized, axis=1, keepdims=True
)

# Save normalized canonical embeddings
np.save('skill_canonical_embeddings.npy', canonical_embeddings_normalized)
print(f"✓ Saved {len(canonical_embeddings_normalized)} canonical embeddings")

# Normalize variation embeddings
variation_embeddings_normalized = variation_embeddings.astype('float32')
variation_embeddings_normalized = variation_embeddings_normalized / np.linalg.norm(
    variation_embeddings_normalized, axis=1, keepdims=True
)

# Save normalized variation embeddings
np.save('skill_variations_embeddings.npy', variation_embeddings_normalized)
print(f"✓ Saved {len(variation_embeddings_normalized)} variation embeddings")

print("\n✓ All embeddings saved and ready for fast similarity search!")

# Cleanup to free memory
del canonical_embeddings_normalized
del variation_embeddings_normalized
gc.collect()

## Runtime Usage: Improved SkillExtractor Class

This version includes:
- Length-based scoring penalty (prevents partial matches from scoring too high)
- Similar skill deduplication (removes redundant results)
- Better handling of edge cases

In [None]:
class SkillExtractor:
    """
    Fast skill extraction using pre-built taxonomy and numpy similarity search
    with improved scoring and deduplication
    """
    def __init__(self, 
                 taxonomy_path='skill_taxonomy.parquet',
                 variations_path='skill_variations.parquet',
                 canonical_embeddings_path='skill_canonical_embeddings.npy',
                 variations_embeddings_path='skill_variations_embeddings.npy',
                 model_name='all-MiniLM-L6-v2'):
        
        print("Loading skill extractor...")
        
        # Load taxonomy
        self.skills_df = pd.read_parquet(taxonomy_path)
        self.variations_df = pd.read_parquet(variations_path)
        
        # Load embeddings (numpy arrays)
        self.canonical_embeddings = np.load(canonical_embeddings_path)
        self.variations_embeddings = np.load(variations_embeddings_path)
        
        # Load embedding model
        self.model = SentenceTransformer(model_name)
        
        print(f"✓ Loaded {len(self.skills_df)} skills")
        print(f"✓ Loaded {len(self.variations_df)} variations")
        print(f"✓ Ready for extraction")
    
    def extract_from_text(self, text, threshold=0.5, top_k=10, use_variations=True,
                         deduplicate_similar=True, dedup_threshold=0.85,
                         apply_length_penalty=True):
        """
        Extract skills from text using numpy-based similarity search
        
        Args:
            text: Input text to extract skills from
            threshold: Minimum similarity threshold (0-1)
            top_k: Number of top matches to consider per n-gram
            use_variations: Whether to search variations or just canonical names
            deduplicate_similar: Remove skills that are very similar to each other
            dedup_threshold: Similarity threshold for considering skills duplicates
            apply_length_penalty: Penalize matches where ngram is much shorter than skill
        
        Returns:
            List of detected skills with similarity scores
        """
        # Generate n-grams from text
        ngrams = self._generate_ngrams(text, max_n=5)
        
        if not ngrams:
            return []
        
        # Encode n-grams
        ngram_embeddings = self.model.encode(ngrams, convert_to_numpy=True)
        ngram_embeddings = ngram_embeddings / np.linalg.norm(
            ngram_embeddings, axis=1, keepdims=True
        )
        
        # Search for matches
        detected_skills = {}
        
        for i, ngram in enumerate(ngrams):
            query = ngram_embeddings[i:i+1]
            
            if use_variations:
                # Compute cosine similarity with all variations (dot product)
                similarities = np.dot(self.variations_embeddings, query.T).flatten()
                
                # Get top k indices
                if len(similarities) > top_k:
                    top_indices = np.argpartition(similarities, -top_k)[-top_k:]
                    top_indices = top_indices[np.argsort(similarities[top_indices])][::-1]
                else:
                    top_indices = np.argsort(similarities)[::-1]
                
                for idx in top_indices:
                    dist = similarities[idx]
                    if dist >= threshold:
                        match = self.variations_df.iloc[idx]
                        skill_id = match['skill_id']
                        canonical = match['canonical_name']
                        matched_variation = match['variation']
                        
                        # Apply length penalty
                        if apply_length_penalty:
                            adjusted_score = self._apply_length_penalty(
                                dist, ngram, canonical
                            )
                        else:
                            adjusted_score = dist
                        
                        # Only update if this is a better match
                        if skill_id not in detected_skills or adjusted_score > detected_skills[skill_id]['score']:
                            detected_skills[skill_id] = {
                                'canonical_name': canonical,
                                'matched_text': ngram,
                                'matched_variation': matched_variation,
                                'score': float(adjusted_score),
                                'raw_similarity': float(dist)
                            }
            else:
                # Search canonical embeddings
                similarities = np.dot(self.canonical_embeddings, query.T).flatten()
                
                if len(similarities) > top_k:
                    top_indices = np.argpartition(similarities, -top_k)[-top_k:]
                    top_indices = top_indices[np.argsort(similarities[top_indices])][::-1]
                else:
                    top_indices = np.argsort(similarities)[::-1]
                
                for idx in top_indices:
                    dist = similarities[idx]
                    if dist >= threshold:
                        match = self.skills_df.iloc[idx]
                        skill_id = match['skill_id']
                        canonical = match['canonical_name']
                        
                        # Apply length penalty
                        if apply_length_penalty:
                            adjusted_score = self._apply_length_penalty(
                                dist, ngram, canonical
                            )
                        else:
                            adjusted_score = dist
                        
                        if skill_id not in detected_skills or adjusted_score > detected_skills[skill_id]['score']:
                            detected_skills[skill_id] = {
                                'canonical_name': canonical,
                                'matched_text': ngram,
                                'score': float(adjusted_score),
                                'raw_similarity': float(dist)
                            }
        
        # Sort by score
        results = sorted(detected_skills.values(), key=lambda x: x['score'], reverse=True)
        
        # Deduplicate similar skills
        if deduplicate_similar and len(results) > 1:
            results = self._deduplicate_similar_skills(results, dedup_threshold)
        
        return results
    
    def _apply_length_penalty(self, similarity, ngram, canonical_skill):
        """
        Apply penalty when matched n-gram is much shorter than the skill name
        This prevents partial matches from scoring too high
        """
        ngram_len = len(ngram.split())
        skill_len = len(canonical_skill.split())
        
        if ngram_len < skill_len:
            # Penalize based on length difference
            # E.g., if ngram is 1 word and skill is 3 words, penalty = 1/3
            length_penalty = ngram_len / skill_len
            # Apply penalty with a minimum floor to avoid over-penalization
            penalty_factor = max(0.5, length_penalty)
            adjusted_score = similarity * penalty_factor
        else:
            adjusted_score = similarity
        
        return adjusted_score
    
    def _deduplicate_similar_skills(self, results, threshold=0.85):
        """
        Remove redundant skills that are very similar to higher-scoring skills
        """
        if len(results) <= 1:
            return results
        
        # Get embeddings for all detected skills
        skill_names = [r['canonical_name'] for r in results]
        skill_embeddings = self.model.encode(skill_names, convert_to_numpy=True)
        skill_embeddings = skill_embeddings / np.linalg.norm(skill_embeddings, axis=1, keepdims=True)
        
        # Keep track of which skills to keep
        keep_indices = []
        
        for i in range(len(results)):
            # Always keep the first (highest scoring)
            if i == 0:
                keep_indices.append(i)
                continue
            
            # Check similarity with all higher-scoring kept skills
            should_keep = True
            for j in keep_indices:
                similarity = np.dot(skill_embeddings[i], skill_embeddings[j])
                if similarity > threshold:
                    should_keep = False
                    break
            
            if should_keep:
                keep_indices.append(i)
        
        return [results[i] for i in keep_indices]
    
    def _generate_ngrams(self, text, max_n=5):
        """
        Generate n-grams from text (1 to max_n words)
        """
        # Clean and tokenize
        text = re.sub(r'[^a-zA-Z0-9\s-]', ' ', text)
        tokens = text.lower().split()
        
        ngrams = []
        for n in range(1, min(max_n + 1, len(tokens) + 1)):
            for i in range(len(tokens) - n + 1):
                ngram = ' '.join(tokens[i:i+n])
                ngrams.append(ngram)
        
        return ngrams

In [None]:
# Example usage
extractor = SkillExtractor()

# Test with sample text
sample_text = """
I have 5 years of experience in Python programming and machine learning. 
I've worked extensively with TensorFlow and PyTorch for deep learning projects.
I'm also proficient in SQL databases and cloud platforms like AWS.
"""

detected = extractor.extract_from_text(
    sample_text, 
    threshold=0.6, 
    use_variations=True,
    deduplicate_similar=True,
    apply_length_penalty=True
)

print(f"\nDetected {len(detected)} skills:")
for skill in detected[:10]:  # Show top 10
    print(f"  • {skill['canonical_name']} (score: {skill['score']:.3f})")
    print(f"    Matched: '{skill['matched_text']}'")
    if 'matched_variation' in skill:
        print(f"    Via variation: '{skill['matched_variation']}'")
    if 'raw_similarity' in skill:
        print(f"    Raw similarity: {skill['raw_similarity']:.3f}")

## Summary and Improvements

### Key Improvements in this version:

1. **Better Variation Generation**
   - Conservative approach for single-word skills
   - Filters out common words that cause false positives
   - Reduces noise from generic terms like "projects", "management", etc.

2. **Length-Based Scoring Penalty**
   - Prevents partial matches from scoring too high
   - E.g., matching "projects" to "Project Management" gets penalized
   - Configurable via `apply_length_penalty` parameter

3. **Similar Skill Deduplication**
   - Removes redundant results (e.g., "Machine Learning" and "ML")
   - Keeps the highest-scoring match from each cluster
   - Configurable via `deduplicate_similar` and `dedup_threshold`

4. **Better Diagnostics**
   - Returns both adjusted score and raw similarity
   - Shows which variation was matched
   - Easier to debug and tune thresholds

### Tuning Recommendations:

- **For higher precision**: Increase `threshold` (e.g., 0.7) and `dedup_threshold` (e.g., 0.9)
- **For higher recall**: Decrease `threshold` (e.g., 0.4) and disable length penalty
- **For balanced results**: Use defaults (threshold=0.6, dedup_threshold=0.85, length_penalty=True)

### Next Steps:

1. Test on your actual text data and adjust thresholds
2. Add more common words to the filter list if needed
3. Consider adding exact/fuzzy matching as a complement to semantic search
4. Evaluate on labeled data to measure precision/recall
5. Fine-tune the length penalty formula for your specific use case