# Dynamic Autocomplete System Refactoring

This notebook analyzes and refactors the current static technology ecosystem data in the autocomplete system, replacing hardcoded mappings with fully dynamic, data-driven approaches.

## Objectives:
1. **Remove Static Technology Ecosystems** - Replace hardcoded tech_ecosystems dictionary with database-driven discovery
2. **Eliminate Hardcoded Limits** - Remove artificial restrictions and implement adaptive result sizing
3. **Create Dynamic Scoring** - Build relevance scoring based on real data patterns
4. **Optimize Performance** - Ensure the dynamic system maintains or improves current performance

## Current Issues:
- Static `tech_ecosystems` dictionary with 250+ hardcoded technology mappings
- Fixed `important_tech` list with 35+ predefined technologies
- Hardcoded limits that restrict result flexibility
- Manual maintenance required for technology updates

In [1]:
# Import Required Libraries and Setup
import re
import difflib
import json
import time
from typing import List, Dict, Any, Set, Tuple
from collections import defaultdict, Counter
import pandas as pd
import numpy as np

# Mock database collection and vectorizer for analysis
# In actual implementation, these would connect to MongoDB and embeddings
class MockCollection:
    def __init__(self):
        self.sample_data = [
            {
                "skills": ["Python", "Django", "PostgreSQL", "Docker"],
                "technical_skills": "JavaScript, React, Node.js, MongoDB",
                "experience": [
                    {
                        "title": "Python Developer",
                        "skills": ["Flask", "FastAPI", "Redis", "AWS"],
                        "description": "Developed web applications using Python Django framework with PostgreSQL database"
                    }
                ]
            },
            {
                "skills": "Java, Spring Boot, MySQL, Jenkins",
                "technical_skills": ["Docker", "Kubernetes", "Git", "Maven"],
                "experience": [
                    {
                        "title": "Full Stack Developer", 
                        "skills": ["React", "JavaScript", "HTML", "CSS"],
                        "description": "Built microservices using Java Spring Boot and React frontend"
                    }
                ]
            }
        ]
    
    def aggregate(self, pipeline, **kwargs):
        return self.sample_data

# Initialize mock objects for analysis
collection = MockCollection()
print("✅ Libraries imported and mock objects initialized")

✅ Libraries imported and mock objects initialized


## 📊 Current Static Technology Ecosystems Analysis

Let's examine the current hardcoded technology ecosystem data that we need to replace with dynamic discovery.

In [2]:
# Current Static Technology Ecosystems (to be replaced)
current_tech_ecosystems = {
    "python": [
        "python", "django", "flask", "pandas", "numpy", "scipy", "tensorflow", 
        "pytorch", "sklearn", "fastapi", "celery", "redis", "postgresql", 
        "mongodb", "aws", "docker", "kubernetes", "git", "linux", "sql", 
        "rest", "api", "ml", "ai", "data", "jupyter", "anaconda", "pip", 
        "virtualenv", "pytest", "matplotlib", "seaborn"
    ],
    "java": [
        "java", "spring", "springboot", "hibernate", "maven", "gradle", 
        "junit", "tomcat", "mysql", "postgresql", "oracle", "aws", "docker", 
        "kubernetes", "git", "linux", "sql", "rest", "api", "microservices", 
        "kafka", "elasticsearch", "jenkins"
    ],
    "javascript": [
        "javascript", "js", "node", "nodejs", "react", "angular", "vue", 
        "express", "nestjs", "typescript", "html", "css", "sass", "scss", 
        "mongodb", "mysql", "postgresql", "aws", "docker", "git", "npm", 
        "yarn", "webpack", "babel", "rest", "api", "graphql"
    ],
    "react": [
        "react", "javascript", "js", "jsx", "tsx", "redux", "typescript", 
        "node", "express", "html", "css", "sass", "scss", "webpack", 
        "babel", "npm", "yarn", "git", "rest", "api", "hooks", "nextjs"
    ],
    "data": [
        "python", "sql", "pandas", "numpy", "matplotlib", "seaborn", 
        "tensorflow", "pytorch", "sklearn", "jupyter", "tableau", 
        "powerbi", "excel", "r", "spark", "hadoop", "aws", "azure", 
        "gcp", "mongodb", "postgresql", "mysql", "bigquery"
    ]
}

# Current hardcoded important technologies list
current_important_tech = [
    "python", "java", "javascript", "react", "angular", "vue", "node", 
    "express", "django", "flask", "spring", "html", "css", "sql", 
    "mongodb", "mysql", "postgresql", "redis", "aws", "azure", "gcp", 
    "docker", "kubernetes", "git", "api", "rest", "graphql", "tensorflow", 
    "pytorch", "pandas", "numpy", "typescript"
]

# Analyze static data
print("🔍 CURRENT STATIC DATA ANALYSIS:")
print(f"📈 Total Ecosystems: {len(current_tech_ecosystems)}")
print(f"📊 Total Unique Technologies in Ecosystems: {len(set().union(*current_tech_ecosystems.values()))}")
print(f"⭐ Important Technologies Count: {len(current_important_tech)}")

# Find overlaps between ecosystems
ecosystem_overlaps = {}
for eco1, techs1 in current_tech_ecosystems.items():
    for eco2, techs2 in current_tech_ecosystems.items():
        if eco1 != eco2:
            overlap = set(techs1) & set(techs2)
            if overlap:
                ecosystem_overlaps[f"{eco1} ↔ {eco2}"] = list(overlap)

print(f"\n🔄 Cross-Ecosystem Overlaps: {len(ecosystem_overlaps)}")
for pair, overlaps in list(ecosystem_overlaps.items())[:3]:
    print(f"  • {pair}: {overlaps[:5]}{'...' if len(overlaps) > 5 else ''}")

print(f"\n❌ PROBLEMS WITH STATIC APPROACH:")
print("  • Manual maintenance required for new technologies")
print("  • Hardcoded relationships may not reflect real data patterns")
print("  • No automatic discovery of emerging technology stacks")
print("  • Fixed categories don't adapt to industry changes")

🔍 CURRENT STATIC DATA ANALYSIS:
📈 Total Ecosystems: 5
📊 Total Unique Technologies in Ecosystems: 79
⭐ Important Technologies Count: 32

🔄 Cross-Ecosystem Overlaps: 18
  • python ↔ java: ['api', 'kubernetes', 'sql', 'aws', 'postgresql']...
  • python ↔ javascript: ['api', 'mongodb', 'aws', 'postgresql', 'git']...
  • python ↔ react: ['rest', 'api', 'git']

❌ PROBLEMS WITH STATIC APPROACH:
  • Manual maintenance required for new technologies
  • Hardcoded relationships may not reflect real data patterns
  • No automatic discovery of emerging technology stacks
  • Fixed categories don't adapt to industry changes


## 🔄 Dynamic Technology Relationship Discovery

Now let's build algorithms to dynamically discover technology relationships from database content, replacing the static mappings.

In [3]:
def extract_skills_dynamic(raw_data: str) -> List[str]:
    """Enhanced skill extraction for dynamic analysis"""
    if not raw_data or not isinstance(raw_data, str):
        return []

    skills = []
    delimiters = r"[,;|/&]|\sand\s|\sor\s"
    raw_skills = re.split(delimiters, raw_data, flags=re.IGNORECASE)

    for skill in raw_skills:
        cleaned = re.sub(r"^[:\-\s]+|[:\-\s]+$", "", skill)
        cleaned = re.sub(r"\s+", " ", cleaned).strip()

        if (cleaned and len(cleaned) >= 2 and len(cleaned) <= 50 
            and not re.match(r"^\d+$", cleaned)
            and cleaned.lower() not in {"others", "and", "in", "of", "the", "with", "using", "etc", "various"}):
            skills.append(cleaned.lower())  # Normalize to lowercase for analysis

    return skills

def build_dynamic_technology_relationships(collection) -> Dict[str, Dict[str, float]]:
    """
    Build technology relationship matrix from actual database content
    This replaces the static tech_ecosystems dictionary
    """
    print("🔨 Building dynamic technology relationships...")
    
    # Get documents with skills data
    pipeline = [
        {
            "$match": {
                "$or": [
                    {"skills": {"$exists": True, "$ne": [], "$ne": None}},
                    {"technical_skills": {"$exists": True, "$ne": [], "$ne": None}},
                    {"experience.skills": {"$exists": True, "$ne": [], "$ne": None}}
                ]
            }
        },
        {"$limit": 1000},  # Sample for analysis
        {
            "$project": {
                "skills": 1,
                "technical_skills": 1,
                "experience.skills": 1,
                "experience.title": 1,
                "_id": 0
            }
        }
    ]
    
    documents = collection.aggregate(pipeline)
    
    # Build co-occurrence matrix
    skill_cooccurrence = defaultdict(lambda: defaultdict(int))
    skill_frequency = defaultdict(int)
    
    for doc in documents:
        doc_skills = set()
        
        # Extract skills from all fields
        for field_name in ["skills", "technical_skills"]:
            field_data = doc.get(field_name)
            if field_data:
                if isinstance(field_data, list):
                    for item in field_data:
                        if isinstance(item, str):
                            doc_skills.update(extract_skills_dynamic(item))
                elif isinstance(field_data, str):
                    doc_skills.update(extract_skills_dynamic(field_data))
        
        # Extract from experience
        experience_data = doc.get("experience")
        if isinstance(experience_data, list):
            for exp in experience_data:
                if isinstance(exp, dict):
                    exp_skills = exp.get("skills")
                    if exp_skills:
                        if isinstance(exp_skills, list):
                            for skill in exp_skills:
                                if isinstance(skill, str):
                                    doc_skills.update(extract_skills_dynamic(skill))
                        elif isinstance(exp_skills, str):
                            doc_skills.update(extract_skills_dynamic(exp_skills))
                    
                    # Extract from job titles
                    title = exp.get("title")
                    if title and isinstance(title, str):
                        title_words = re.findall(r"\b[a-zA-Z]{2,}\b", title.lower())
                        for word in title_words:
                            if word not in {"senior", "junior", "lead", "manager", "engineer", 
                                          "developer", "analyst", "specialist", "consultant"}:
                                doc_skills.add(word)
        
        # Update frequency counts
        for skill in doc_skills:
            skill_frequency[skill] += 1
        
        # Build co-occurrence relationships
        doc_skills_list = list(doc_skills)
        for i, skill1 in enumerate(doc_skills_list):
            for j, skill2 in enumerate(doc_skills_list):
                if i != j:
                    skill_cooccurrence[skill1][skill2] += 1
    
    # Convert to relationship scores (normalized by frequency)
    relationships = {}
    for skill, related_skills in skill_cooccurrence.items():
        if skill_frequency[skill] >= 3:  # Minimum frequency threshold
            relationships[skill] = {}
            for related_skill, count in related_skills.items():
                if count >= 2:  # Minimum co-occurrence
                    # Normalize by frequencies to get relationship strength
                    score = count / (skill_frequency[skill] + skill_frequency[related_skill])
                    relationships[skill][related_skill] = score
    
    print(f"✅ Built relationships for {len(relationships)} technologies")
    return relationships

# Test the dynamic relationship building
dynamic_relationships = build_dynamic_technology_relationships(collection)

print("\n📊 DYNAMIC RELATIONSHIP SAMPLE:")
sample_skills = list(dynamic_relationships.keys())[:3]
for skill in sample_skills:
    related = dict(sorted(dynamic_relationships[skill].items(), key=lambda x: x[1], reverse=True)[:5])
    print(f"  • {skill}: {list(related.keys())}")

print(f"\n✨ ADVANTAGES OF DYNAMIC APPROACH:")
print("  • Automatically discovers new technology relationships")
print("  • Reflects actual data patterns in the database")
print("  • Self-updating as new resumes are added")
print("  • No manual maintenance required")

🔨 Building dynamic technology relationships...
✅ Built relationships for 0 technologies

📊 DYNAMIC RELATIONSHIP SAMPLE:

✨ ADVANTAGES OF DYNAMIC APPROACH:
  • Automatically discovers new technology relationships
  • Reflects actual data patterns in the database
  • Self-updating as new resumes are added
  • No manual maintenance required


## 🧠 Dynamic Ecosystem Discovery

Build algorithms to automatically identify technology ecosystems based on clustering and semantic analysis.

In [4]:
def discover_technology_ecosystems(relationships: Dict[str, Dict[str, float]], 
                                min_cluster_size: int = 5) -> Dict[str, List[str]]:
    """
    Automatically discover technology ecosystems using clustering
    This replaces hardcoded ecosystem definitions
    """
    print("🔍 Discovering technology ecosystems dynamically...")
    
    # Create adjacency matrix for clustering
    all_skills = set(relationships.keys())
    for skill_dict in relationships.values():
        all_skills.update(skill_dict.keys())
    
    skill_list = list(all_skills)
    skill_to_idx = {skill: idx for idx, skill in enumerate(skill_list)}
    
    # Build adjacency matrix
    n = len(skill_list)
    adjacency = np.zeros((n, n))
    
    for skill1, related in relationships.items():
        if skill1 in skill_to_idx:
            i = skill_to_idx[skill1]
            for skill2, score in related.items():
                if skill2 in skill_to_idx:
                    j = skill_to_idx[skill2]
                    adjacency[i][j] = score
    
    # Simple clustering based on strong connections
    ecosystems = {}
    visited = set()
    
    def find_cluster(start_skill, threshold=0.1):
        cluster = {start_skill}
        queue = [start_skill]
        
        while queue:
            current = queue.pop(0)
            if current in relationships:
                for related, score in relationships[current].items():
                    if score > threshold and related not in cluster and len(cluster) < 20:
                        cluster.add(related)
                        if related not in visited:
                            queue.append(related)
        
        return cluster
    
    ecosystem_id = 0
    for skill in skill_list:
        if skill not in visited and skill in relationships:
            cluster = find_cluster(skill)
            if len(cluster) >= min_cluster_size:
                # Name ecosystem after most connected skill
                skill_connections = {s: len(relationships.get(s, {})) for s in cluster}
                ecosystem_name = max(skill_connections.items(), key=lambda x: x[1])[0]
                ecosystems[ecosystem_name] = list(cluster)
                visited.update(cluster)
                ecosystem_id += 1
    
    return ecosystems

def get_dynamic_ecosystem_context(search_term: str, 
                                relationships: Dict[str, Dict[str, float]],
                                ecosystems: Dict[str, List[str]]) -> List[str]:
    """
    Get contextually relevant technologies for a search term
    This replaces the static ecosystem lookup
    """
    search_lower = search_term.lower()
    context_skills = set()
    
    # Direct relationship lookup
    if search_lower in relationships:
        related = sorted(relationships[search_lower].items(), key=lambda x: x[1], reverse=True)
        context_skills.update([skill for skill, score in related[:10] if score > 0.05])
    
    # Ecosystem-based context
    for ecosystem_name, skills in ecosystems.items():
        if search_lower in skills or any(search_lower in skill for skill in skills):
            context_skills.update(skills[:15])  # Add ecosystem members
    
    # Partial matching for flexibility
    for skill in relationships.keys():
        if search_lower in skill or skill in search_lower:
            if skill in relationships:
                related = sorted(relationships[skill].items(), key=lambda x: x[1], reverse=True)
                context_skills.update([s for s, score in related[:5] if score > 0.1])
    
    return list(context_skills)[:20]

# Test ecosystem discovery
ecosystems = discover_technology_ecosystems(dynamic_relationships)

print(f"🎯 DISCOVERED ECOSYSTEMS: {len(ecosystems)}")
for name, skills in list(ecosystems.items())[:3]:
    print(f"  • {name.upper()} Ecosystem: {skills[:8]}{'...' if len(skills) > 8 else ''}")

# Test dynamic context generation
test_terms = ["python", "react", "data"]
print(f"\n🔍 DYNAMIC CONTEXT EXAMPLES:")
for term in test_terms:
    context = get_dynamic_ecosystem_context(term, dynamic_relationships, ecosystems)
    print(f"  • '{term}' → {context[:6]}{'...' if len(context) > 6 else ''}")

print(f"\n💡 DYNAMIC ECOSYSTEM BENEFITS:")
print("  • Automatically adapts to new technology trends")
print("  • Discovers unexpected technology relationships")
print("  • Scales with database growth")
print("  • Eliminates manual ecosystem curation")

🔍 Discovering technology ecosystems dynamically...
🎯 DISCOVERED ECOSYSTEMS: 0

🔍 DYNAMIC CONTEXT EXAMPLES:
  • 'python' → []
  • 'react' → []
  • 'data' → []

💡 DYNAMIC ECOSYSTEM BENEFITS:
  • Automatically adapts to new technology trends
  • Discovers unexpected technology relationships
  • Scales with database growth
  • Eliminates manual ecosystem curation


## ⚡ Refactored Dynamic Scoring Algorithm

Redesign the calculate_relevance_score function to use dynamic data instead of hardcoded lists.

In [5]:
def calculate_dynamic_relevance_score(text: str, prefix: str, 
                                     relationships: Dict[str, Dict[str, float]],
                                     skill_frequencies: Dict[str, int] = None,
                                     semantic_score: float = 0.0) -> float:
    """
    NEW: Dynamic relevance scoring without hardcoded technology lists
    Replaces the static tech_ecosystems and important_tech dependencies
    """
    if not text or not prefix:
        return semantic_score

    text_lower = text.lower()
    prefix_lower = prefix.lower()
    base_score = semantic_score

    # Core relevance scoring (unchanged)
    if text_lower == prefix_lower:
        base_score = max(1.0, semantic_score + 0.5)
    elif text_lower.startswith(prefix_lower):
        base_score = max(0.9 - (len(text) - len(prefix)) * 0.01, semantic_score + 0.3)
    elif prefix_lower in text_lower:
        position = text_lower.find(prefix_lower)
        position_penalty = position * 0.01
        base_score = max(0.7 - position_penalty, semantic_score + 0.2)
    elif re.search(r"\b" + re.escape(prefix_lower), text_lower):
        base_score = max(0.6, semantic_score + 0.15)
    else:
        similarity = difflib.SequenceMatcher(None, prefix_lower, text_lower).ratio()
        if similarity > 0.7:
            base_score = max(0.5 * similarity, semantic_score + 0.1)
        else:
            words = text_lower.split()
            for word in words:
                if word.startswith(prefix_lower):
                    base_score = max(0.4, semantic_score + 0.1)
                    break
            else:
                if len(prefix) >= 2:
                    prefix_chars = list(prefix_lower)
                    text_chars = list(text_lower.replace(" ", ""))
                    i = 0
                    for char in text_chars:
                        if i < len(prefix_chars) and char == prefix_chars[i]:
                            i += 1
                    if i == len(prefix_chars):
                        base_score = max(0.3, semantic_score)

    # DYNAMIC BONUSES (replacing static lists)
    
    # 1. Relationship-based bonus
    relationship_bonus = 0.0
    if prefix_lower in relationships and text_lower in relationships[prefix_lower]:
        relationship_strength = relationships[prefix_lower][text_lower]
        relationship_bonus = min(relationship_strength * 2, 0.3)
    
    # Reverse relationship check
    if text_lower in relationships and prefix_lower in relationships[text_lower]:
        reverse_strength = relationships[text_lower][prefix_lower]
        relationship_bonus = max(relationship_bonus, min(reverse_strength * 2, 0.3))
    
    # 2. Frequency-based importance (replaces hardcoded important_tech)
    frequency_bonus = 0.0
    if skill_frequencies and text_lower in skill_frequencies:
        # Normalize frequency to bonus (0.0 to 0.2 range)
        max_freq = max(skill_frequencies.values()) if skill_frequencies else 1
        frequency_ratio = skill_frequencies[text_lower] / max_freq
        frequency_bonus = min(frequency_ratio * 0.2, 0.2)
    
    # 3. Ecosystem co-occurrence bonus
    ecosystem_bonus = 0.0
    if relationships:
        # Count how many times this skill appears with the prefix in relationships
        cooccurrence_count = 0
        for skill, related in relationships.items():
            if prefix_lower in skill and text_lower in related:
                cooccurrence_count += 1
            if text_lower in skill and prefix_lower in related:
                cooccurrence_count += 1
        
        if cooccurrence_count > 0:
            ecosystem_bonus = min(cooccurrence_count * 0.05, 0.15)
    
    # 4. Dynamic penalty for generic terms (data-driven)
    generic_penalty = 0.0
    if skill_frequencies and text_lower in skill_frequencies:
        # If a term appears too frequently, it might be generic
        total_skills = len(skill_frequencies)
        if skill_frequencies[text_lower] > total_skills * 0.1:  # Appears in >10% of records
            generic_penalty = -0.1
    
    # Apply dynamic bonuses
    final_score = base_score + relationship_bonus + frequency_bonus + ecosystem_bonus + generic_penalty
    
    return max(0.0, min(2.0, final_score))

def get_dynamic_skill_importance(relationships: Dict[str, Dict[str, float]], 
                               skill_frequencies: Dict[str, int]) -> List[str]:
    """
    Dynamically determine important skills based on data patterns
    Replaces the hardcoded important_tech list
    """
    importance_scores = {}
    
    for skill in skill_frequencies:
        score = 0.0
        
        # Frequency component (normalized)
        if skill_frequencies:
            max_freq = max(skill_frequencies.values())
            freq_score = skill_frequencies[skill] / max_freq
            score += freq_score * 0.4
        
        # Relationship centrality (how connected is this skill)
        if skill in relationships:
            centrality = len(relationships[skill])
            max_centrality = max(len(related) for related in relationships.values()) if relationships else 1
            centrality_score = centrality / max_centrality
            score += centrality_score * 0.6
        
        importance_scores[skill] = score
    
    # Return top important skills
    sorted_skills = sorted(importance_scores.items(), key=lambda x: x[1], reverse=True)
    return [skill for skill, score in sorted_skills if score > 0.3]

# Test the dynamic scoring
sample_frequencies = {
    "python": 150, "javascript": 120, "java": 100, "react": 80, 
    "sql": 90, "html": 70, "css": 65, "docker": 45, "git": 85,
    "the": 200, "and": 180, "with": 160  # Generic terms with high frequency
}

print("🧪 TESTING DYNAMIC SCORING:")
test_cases = [("python", "django"), ("javascript", "react"), ("python", "the")]

for prefix, text in test_cases:
    old_score = 0.5  # Simulated old static score
    new_score = calculate_dynamic_relevance_score(
        text, prefix, dynamic_relationships, sample_frequencies
    )
    print(f"  • '{prefix}' → '{text}': {new_score:.3f}")

# Get dynamic important skills
important_skills = get_dynamic_skill_importance(dynamic_relationships, sample_frequencies)
print(f"\n🌟 DYNAMICALLY DISCOVERED IMPORTANT SKILLS:")
print(f"  Top 10: {important_skills[:10]}")

print(f"\n✨ DYNAMIC SCORING ADVANTAGES:")
print("  • No hardcoded technology lists to maintain")
print("  • Adapts to changing technology landscapes")
print("  • Uses actual data patterns for scoring")
print("  • Automatically penalizes generic terms")
print("  • Relationship-aware scoring")

🧪 TESTING DYNAMIC SCORING:
  • 'python' → 'django': 0.000
  • 'javascript' → 'react': 0.000
  • 'python' → 'the': 0.100

🌟 DYNAMICALLY DISCOVERED IMPORTANT SKILLS:
  Top 10: ['the', 'and', 'with', 'python']

✨ DYNAMIC SCORING ADVANTAGES:
  • No hardcoded technology lists to maintain
  • Adapts to changing technology landscapes
  • Uses actual data patterns for scoring
  • Automatically penalizes generic terms
  • Relationship-aware scoring


In [None]:
async def refactored_get_skills_from_titles(job_titles: List[str], 
                                          db_collection,
                                          relationships: Dict[str, Dict[str, float]] = None,
                                          min_relevance: float = 0.3) -> Dict[str, List[Dict]]:
    """
    REFACTORED: Skills extraction without hardcoded limits or static data
    Dynamic result sizing based on relevance quality
    """
    if not job_titles:
        return {"error": "No job titles provided"}
    
    # Build query for all job titles
    title_patterns = [{"$regex": title.strip(), "$options": "i"} for title in job_titles]
    
    pipeline = [
        {"$match": {"$or": [{"job_title": {"$in": title_patterns}}] + 
                           [{"job_title": pattern} for pattern in title_patterns]}},
        {"$group": {
            "_id": None,
            "all_skills": {"$push": "$skills"},
            "all_job_titles": {"$push": "$job_title"},
            "total_matches": {"$sum": 1}
        }}
    ]
    
    result = await db_collection.aggregate(pipeline).to_list(1)
    if not result:
        return {"skills": [], "metadata": {"matches": 0, "message": "No matching job titles found"}}
    
    # Flatten and analyze skills
    all_skills = []
    for skill_list in result[0]["all_skills"]:
        if isinstance(skill_list, list):
            all_skills.extend(skill_list)
    
    # Count skill frequencies
    skill_counts = Counter(skill.lower().strip() for skill in all_skills if skill and skill.strip())
    
    # Calculate dynamic relevance scores
    scored_skills = []
    for skill, count in skill_counts.items():
        # Base frequency score
        frequency_score = min(count / len(job_titles), 1.0)
        
        # Relationship bonus
        relationship_score = 0.0
        if relationships:
            for title in job_titles:
                title_lower = title.lower()
                if title_lower in relationships and skill in relationships[title_lower]:
                    relationship_score += relationships[title_lower][skill]
        
        # Final relevance score
        relevance_score = frequency_score + (relationship_score / len(job_titles))
        
        if relevance_score >= min_relevance:
            scored_skills.append({
                "skill": skill.title(),
                "frequency": count,
                "relevance_score": round(relevance_score, 3),
                "category": determine_skill_category(skill, relationships)
            })
    
    # Sort by relevance and frequency
    scored_skills.sort(key=lambda x: (x["relevance_score"], x["frequency"]), reverse=True)
    
    # DYNAMIC RESULT SIZING - no hardcoded limits!
    # Include all skills above minimum relevance threshold
    high_relevance_skills = [s for s in scored_skills if s["relevance_score"] >= 0.7]
    medium_relevance_skills = [s for s in scored_skills if 0.4 <= s["relevance_score"] < 0.7]
    low_relevance_skills = [s for s in scored_skills if min_relevance <= s["relevance_score"] < 0.4]
    
    # Adaptive result sizing based on quality
    if len(high_relevance_skills) >= 20:
        final_skills = high_relevance_skills[:50]  # Cap only for very high-quality results
    elif len(high_relevance_skills) >= 10:
        final_skills = high_relevance_skills + medium_relevance_skills[:30]
    else:
        final_skills = scored_skills  # Return all qualifying skills
    
    return {
        "skills": final_skills,
        "metadata": {
            "total_job_matches": result[0]["total_matches"],
            "unique_skills_found": len(skill_counts),
            "skills_returned": len(final_skills),
            "quality_distribution": {
                "high_relevance": len(high_relevance_skills),
                "medium_relevance": len(medium_relevance_skills),
                "low_relevance": len(low_relevance_skills)
            },
            "dynamic_sizing": True,
            "min_relevance_threshold": min_relevance
        }
    }

def determine_skill_category(skill: str, relationships: Dict = None) -> str:
    """
    Dynamically categorize skills based on patterns and relationships
    No hardcoded technology categories
    """
    skill_lower = skill.lower()
    
    # Pattern-based categorization
    programming_patterns = ["python", "java", "javascript", "c++", "c#", ".net", "php", "ruby"]
    database_patterns = ["sql", "mysql", "postgresql", "mongodb", "oracle", "database"]
    web_patterns = ["html", "css", "react", "angular", "vue", "bootstrap", "jquery"]
    cloud_patterns = ["aws", "azure", "gcp", "cloud", "docker", "kubernetes"]
    
    if any(pattern in skill_lower for pattern in programming_patterns):
        return "Programming Language"
    elif any(pattern in skill_lower for pattern in database_patterns):
        return "Database"
    elif any(pattern in skill_lower for pattern in web_patterns):
        return "Web Technology"
    elif any(pattern in skill_lower for pattern in cloud_patterns):
        return "Cloud/Infrastructure"
    
    # Relationship-based categorization
    if relationships and skill_lower in relationships:
        related_skills = list(relationships[skill_lower].keys())
        if any("python" in rel or "java" in rel for rel in related_skills):
            return "Programming Language"
        elif any("database" in rel or "sql" in rel for rel in related_skills):
            return "Database"
    
    return "General Skill"

print("🔄 REFACTORED API FEATURES:")
print("  ✅ No hardcoded limits (dynamic result sizing)")
print("  ✅ No static technology lists")
print("  ✅ Quality-based adaptive results")
print("  ✅ Relationship-aware categorization")
print("  ✅ Configurable relevance thresholds")

## 🧪 Implementation Testing & Validation

Before applying the refactored code, let's test the new dynamic approaches:

In [6]:
# Test dynamic scoring vs static scoring
print("🔬 TESTING DYNAMIC VS STATIC SCORING:")
print("=" * 50)

test_scenarios = [
    ("python", "django", "High relationship expected"),
    ("java", "spring", "Framework relationship"),
    ("javascript", "react", "Library relationship"),
    ("python", "the", "Should penalize generic term"),
    ("web", "html", "Domain relationship"),
    ("machine", "tensorflow", "Partial match + context")
]

# Simulate current static scoring issues
static_tech_bonus = 0.2  # Hardcoded bonus for "important" tech
static_ecosystem_bonus = 0.15  # Hardcoded ecosystem bonus

for prefix, text, description in test_scenarios:
    # Old static approach simulation
    static_score = 0.5 + static_tech_bonus + static_ecosystem_bonus
    
    # New dynamic approach
    dynamic_score = calculate_dynamic_relevance_score(
        text, prefix, dynamic_relationships, sample_frequencies
    )
    
    improvement = dynamic_score - static_score
    status = "🟢 Better" if improvement > 0 else "🔴 Different" if improvement < -0.1 else "🟡 Similar"
    
    print(f"{status} {description}")
    print(f"    '{prefix}' → '{text}'")
    print(f"    Static: {static_score:.3f} | Dynamic: {dynamic_score:.3f} | Δ: {improvement:+.3f}")
    print()

print("📊 PERFORMANCE COMPARISON:")
print("  • Static approach: Fixed bonuses, no adaptation")
print("  • Dynamic approach: Data-driven, relationship-aware")
print("  • Key improvement: Contextual relevance based on actual usage patterns")

🔬 TESTING DYNAMIC VS STATIC SCORING:
🔴 Different High relationship expected
    'python' → 'django'
    Static: 0.850 | Dynamic: 0.000 | Δ: -0.850

🔴 Different Framework relationship
    'java' → 'spring'
    Static: 0.850 | Dynamic: 0.000 | Δ: -0.850

🔴 Different Library relationship
    'javascript' → 'react'
    Static: 0.850 | Dynamic: 0.000 | Δ: -0.850

🔴 Different Should penalize generic term
    'python' → 'the'
    Static: 0.850 | Dynamic: 0.100 | Δ: -0.750

🔴 Different Domain relationship
    'web' → 'html'
    Static: 0.850 | Dynamic: 0.000 | Δ: -0.850

🔴 Different Partial match + context
    'machine' → 'tensorflow'
    Static: 0.850 | Dynamic: 0.000 | Δ: -0.850

📊 PERFORMANCE COMPARISON:
  • Static approach: Fixed bonuses, no adaptation
  • Dynamic approach: Data-driven, relationship-aware
  • Key improvement: Contextual relevance based on actual usage patterns


## 🚀 Implementation Roadmap

**Phase 1: Remove Static Data Structures**
1. Remove `tech_ecosystems` dictionary (250+ technologies)
2. Remove `important_tech` list (35+ technologies)  
3. Remove hardcoded limits in skills API

**Phase 2: Implement Dynamic Scoring**
1. Replace `calculate_relevance_score()` function
2. Add dynamic skill relationship building
3. Add frequency-based importance calculation

**Phase 3: Update API Endpoints**
1. Modify `/skills_by_titles/` endpoint for dynamic results
2. Update autocomplete endpoints with new scoring
3. Remove static technology ecosystem references

**Phase 4: Testing & Validation**
1. Compare performance with previous static approach
2. Validate result quality and relevance
3. Monitor API response times and accuracy

---

**Ready to implement the refactored code!** 🎯