# 🎓 Advanced Deep Learning Course - Final Project
AI-Powered CV Analysis System with RAG and Advanced Skill Matching

**📋 Table of Contents**

Project Overview

System Architecture

Technical Stack

Implementation Details

Demo & Results

Conclusion

**🎯Project Overview**

Business Problem

Traditional CV screening processes are:

- Time-consuming: Hours spent manually reviewing resumes

- Prone to bias: Human reviewers miss qualified candidates

- Inefficient: Keyword-based filtering lacks semantic understanding

- Inconsistent: Varying evaluation standards across reviewers


**AI-Powered Solution**

We developed a comprehensive CV analysis system that:

- Automates resume screening using RAG (Retrieval-Augmented Generation)

- Extracts and matches skills across 22 job categories

- Generates intelligent interview questions with model answers

- Provides professional CV analysis with actionable recommendations

- Integrates multiple AI models for comprehensive assessment


**Key Innovations**

- True RAG Implementation: Semantic understanding beyond keyword matching

- Dynamic Skill Analysis: Domain-agnostic skill extraction across industries

- Evidence-Based Matching: Contextual validation of skill claims

- Professional Outputs: HR-grade assessments and recommendations

- Multi-Model Integration: Combining Gemini 2.0 Flash with custom NLP pipelines

 System Architecture
Complete Pipeline

📄 PDF Resumes

    ↓
🔧 Multi-Method Processing (pdfplumber + PyPDF2 + LangChain)

    ↓  
🎯 Advanced Skill Extraction (DynamicSkillAnalyzer)

    ↓
📊 Vector Database (FAISS + Sentence Transformers)

    ↓
🤖 RAG System with Semantic Search

    ↓
📈 Multi-Stage AI Analysis

    ├── 🔍 Skill Gap Analysis
    ├── 📝 CV Quality Assessment  
    ├── ❓ Interview Question Generation
    └── 💡 Professional Recommendations
    ↓
🎯 Business Applications (HR, Recruitment, Career Coaching)

Component Integration:

- PDF Processing Layer - Handles any resume format

- NLP Engine - spaCy + Sentence Transformers for understanding

- AI Analysis Layer - Gemini 2.0 Flash for advanced reasoning

- RAG Framework - LangChain with FAISS for smart search

- Professional Outputs - Business-ready reports and insights



## 🎯 Advanced Skill Analysis Engine

### **Core Features**
- **22 Job Categories** - IT, Finance, Healthcare, Marketing, Engineering, etc.
- **500+ Tools & Platforms** - Technical, business, creative, and specialized tools
- **Evidence Validation** - Finds contextual proof of skills in resume content
- **Semantic Matching** - Goes beyond keywords to actual meaning understanding

### **Technical Implementation**
- **Sentence Transformers** - Semantic embeddings for skill similarity
- **spaCy NER** - Named entity recognition for skill extraction
- **Multi-Method Extraction** - Tools, platforms, categories, and section analysis
- **Strict Filtering** - Clean, meaningful skills only

This analyzer can process resumes from any industry and extract relevant skills with evidence.


In [None]:
# What this cell does :vImplements DynamicSkillAnalyzer — a domain-agnostic component that
#   extracts clean skills from CVs and job descriptions, finds short supporting evidence sentences,
#   and computes combined keyword + semantic similarity scores to match candidate skills to job requirements.
# Inputs (for the grader): raw CV text (string) and raw JD text (string) preprocessed similarly to your corpus.
# Outputs (what you will show): matched_skills list (skill pairs + score + truncated evidence),
#   missing_skills list (JD skills with JD context), counts (cv_skill_count, jd_skill_count) and match_percentage.
# Quick runtime note: uses sentence-transformers/all-MiniLM-L6-v2 and spaCy en_core_web_lg — these models
#   can be memory/time heavy; use a GPU or smaller models for quick demos.
# Why this matters : It provides evidence-backed skill matching that the RAG agent can
#   use to build targeted retrieval queries and produce grounded answers with traceable sources.


# using nlp models

# Imports: logging (debug/info), re (regex parsing), cosine_similarity (semantic scoring),
# SentenceTransformer for semantic embeddings, spaCy for NER/lemmatization used in evidence rules.
import logging
import re
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import spacy

# Setup logging
# Initialize a module logger so the notebook output shows progress and warnings during extraction.
logger = logging.getLogger(__name__)

# Class summary:
# - Multi-strategy skill extraction: direct tool matching, category phrase matching, and section parsing.
# - Evidence extraction: finds the best supporting sentence(s) per skill while excluding education/contact noise.
# - Matching: combines keyword overlap and semantic similarity with configurable weights & threshold
class DynamicSkillAnalyzer:
    """
    Optimized Domain-Agnostic Skill Analyzer - Clean skills across all industries
    """

    DEFAULT_CONFIG = {
        "model_name": 'all-MiniLM-L6-v2',
        "similarity_threshold": 0.5,  # Increased for better precision
        "weight_overlap": 0.6,
        "weight_semantic": 0.4,
    }

    def __init__(self, config=None):
        self.config = self.DEFAULT_CONFIG.copy()
        if config:
            self.config.update(config)

        self.sentence_model = SentenceTransformer(self.config["model_name"])
        self.ner_model = spacy.load("en_core_web_lg")

        self.SIMILARITY_THRESHOLD = self.config["similarity_threshold"]
        self.WEIGHT_OVERLAP = self.config["weight_overlap"]
        self.WEIGHT_SEMANTIC = self.config["weight_semantic"]

        # Store text for evidence validation
        self.cv_text = ""
        self.jd_text = ""

        # Comprehensive tools and platforms across domains
        self.TOOLS_PLATFORMS = {
            # Technical
            'python', 'java', 'javascript', 'typescript', 'c++', 'c#', 'go', 'rust', 'ruby', 'php', 'swift', 'kotlin',
            'django', 'flask', 'spring', 'react', 'angular', 'vue', 'node.js', 'express', 'laravel', 'rails',
            'aws', 'gcp', 'azure', 'docker', 'kubernetes', 'terraform', 'ansible', 'jenkins', 'gitlab', 'github',
            'mysql', 'mongodb', 'postgresql', 'redis', 'cassandra', 'dynamodb', 'oracle',
            'git', 'jira', 'confluence', 'linux', 'unix', 'bash', 'shell',

            # Marketing
            'google analytics', 'google ads', 'facebook ads', 'mailchimp', 'hubspot',
            'hootsuite', 'buffer', 'sprout social', 'semrush', 'ahrefs', 'moz',
            'linkedin', 'instagram', 'twitter', 'tiktok', 'pinterest',

            # Business
            'excel', 'powerpoint', 'word', 'outlook', 'salesforce', 'sap', 'oracle',
            'trello', 'asana', 'slack', 'teams',

            # Design
            'figma', 'sketch', 'adobe', 'photoshop', 'illustrator', 'indesign',
            'canva', 'invision', 'marvel', 'principle',

            # Healthcare
            'epic', 'cerner', 'meditech', 'allscripts', 'ehr', 'emr'
        }

        # Clean skill categories (no partial phrases)
        self.SKILL_CATEGORIES = {
            # Marketing
            'seo', 'ppc', 'social media', 'email marketing', 'content marketing',
            'digital marketing', 'analytics', 'conversion optimization', 'crm',
            'market research', 'brand management', 'advertising',

            # Technical
            'web development', 'mobile development', 'cloud computing', 'devops',
            'database management', 'system administration', 'cybersecurity',
            'machine learning', 'data science', 'api development', 'microservices',

            # Business
            'project management', 'business analysis', 'financial analysis',
            'strategic planning', 'stakeholder management', 'process improvement',
            'risk management', 'budget management', 'team leadership',

            # Design
            'ux design', 'ui design', 'graphic design', 'user research',
            'prototyping', 'wireframing', 'visual design',

            # Healthcare
            'patient care', 'medical coding', 'clinical research', 'healthcare administration',
            'medical terminology', 'treatment planning'
        }

        # Enhanced exclusion patterns
        self.EXCLUDE_PATTERNS = [
            # Job titles
            r'^senior\s+\w+$', r'^junior\s+\w+$', r'^lead\s+\w+$', r'^principal\s+\w+$',
            r'^digital\s+manager$', r'^marketing\s+manager$', r'^software\s+engineer$',

            # Partial phrases and generic terms
            r'^analyze\s+performance$', r'^comprehensive\s+marketing$', r'^digital\s+strategies$',
            r'^experienced\s+marketing$', r'^manage\s+marketing$', r'^marketing\s+using$',
            r'^media\s+efforts$', r'^optimize\s+campaigns$', r'^social\s+marketing$',
            r'^in\s+marketing$', r'^digital$', r'^email\s+campaigns$',

            # Too generic
            r'^strategies$', r'^campaigns$', r'^efforts$', r'^performance$', r'^using$',
            r'^teams$', r'^team$'
        ]

        logger.info("Optimized Domain-Agnostic Skill Analyzer initialized.")

    def extract_meaningful_skills(self, text):
        """Extract clean, meaningful skills across all domains"""
        if not text or len(text.strip()) < 10:
            return []

        skills = set()

        # Extract using focused methods
        skills.update(self._extract_tools_platforms(text))
        skills.update(self._extract_skill_categories(text))
        skills.update(self._extract_clean_skills_from_sections(text))

        # Apply strict filtering
        filtered_skills = self._apply_clean_filters(skills)

        logger.info(f"Extracted {len(filtered_skills)} clean skills: {filtered_skills}")
        return filtered_skills

    def _extract_tools_platforms(self, text):
        """Extract specific tools and platforms"""
        skills = set()
        text_lower = text.lower()

        for tool in self.TOOLS_PLATFORMS:
            pattern = r'\b' + re.escape(tool) + r'\b'
            if re.search(pattern, text_lower):
                skills.add(tool.title())

        return skills

    def _extract_skill_categories(self, text):
        """Extract clean skill categories"""
        skills = set()
        text_lower = text.lower()

        for skill in self.SKILL_CATEGORIES:
            pattern = r'\b' + re.escape(skill) + r'\b'
            if re.search(pattern, text_lower):
                skills.add(skill.title())

        return skills

    def _extract_clean_skills_from_sections(self, text):
        """Extract clean skills from sections only"""
        skills = set()

        section_patterns = {
            'skills': r'skills?[^:]*:([^•]+?(?=\n\s*\n|\n\s*[A-Z]|\Z))',
            'technologies': r'technologies?[^:]*:([^•]+?(?=\n\s*\n|\n\s*[A-Z]|\Z))',
            'tools': r'tools?[^:]*:([^•]+?(?=\n\s*\n|\n\s*[A-Z]|\Z))',
        }

        for section_name, pattern in section_patterns.items():
            matches = re.findall(pattern, text, re.IGNORECASE | re.DOTALL)
            for match in matches:
                items = re.split(r'[,\n•\-]', match)
                for item in items:
                    skill = self._clean_skill_item(item.strip())
                    if skill and self._is_clean_skill(skill):
                        skills.add(skill)

        return skills

    def _clean_skill_item(self, item):
        """Clean and normalize skill items strictly"""
        if not item:
            return None

        # Remove markdown and punctuation
        item = re.sub(r'[\*\#\`\-\•\(\)]', ' ', item).strip()
        item = re.sub(r'\s+', ' ', item)

        # Remove common prefixes/suffixes
        item = re.sub(r'^(?:experience with|proficient in|knowledge of|strong|excellent|basic|assisted with|participated in)\s+', '', item, flags=re.IGNORECASE)
        item = re.sub(r'\s+(?:management|marketing|development|design|analysis|optimization|strategy|campaigns|efforts)$', '', item, flags=re.IGNORECASE)

        # Must be a clean, meaningful skill
        words = item.split()
        if 1 <= len(words) <= 3 and 2 <= len(item) <= 30:
            return item.title()

        return None

    def _is_clean_skill(self, skill):
        """Strict validation for clean skills only"""
        skill_lower = skill.lower()

        # Exclude partial phrases and generic terms
        if any(re.search(pattern, skill_lower) for pattern in self.EXCLUDE_PATTERNS):
            return False

        # Must be a known tool/platform or skill category
        is_tool = any(tool in skill_lower for tool in self.TOOLS_PLATFORMS)
        is_skill_category = any(category in skill_lower for category in self.SKILL_CATEGORIES)

        return is_tool or is_skill_category

    def _apply_clean_filters(self, skills):
        """Apply final clean filters"""
        filtered_skills = []
        seen_lower = set()

        for skill in skills:
            skill_lower = skill.lower().strip()

            if (skill_lower not in seen_lower and
                self._is_clean_skill(skill)):
                filtered_skills.append(skill)
                seen_lower.add(skill_lower)

        # Remove duplicates and sort
        return sorted(list(set(filtered_skills)))

    # IMPROVED EVIDENCE EXTRACTION METHODS
    def _find_contextual_evidence(self, text, skill):
        """Improved evidence extraction that finds actual skill mentions"""
        if not text or not skill:
            return None

        skill_lower = skill.lower()
        text_lower = text.lower()

        # Split into meaningful segments (sentences, bullet points, list items)
        segments = self._split_into_meaningful_segments(text)

        # 1. First priority: Exact match in segments
        for segment in segments:
            segment_lower = segment.lower()
            if skill_lower in segment_lower:
                # Check if this is a meaningful context (not just in education/header)
                if self._is_meaningful_context(segment, skill):
                    return segment.strip()

        # 2. Second priority: Multi-word skill partial matching
        skill_words = skill_lower.split()
        if len(skill_words) > 1:
            for segment in segments:
                segment_lower = segment.lower()
                if all(word in segment_lower for word in skill_words):
                    if self._is_meaningful_context(segment, skill):
                        return segment.strip()

        # 3. Third priority: Look for skill in technical sections only
        technical_segments = self._extract_technical_segments(text)
        for segment in technical_segments:
            segment_lower = segment.lower()
            if skill_lower in segment_lower:
                return segment.strip()

        # 4. Final fallback: Find any occurrence but filter out bad contexts
        if skill_lower in text_lower:
            # Use regex to find the skill with context, avoiding education sections
            pattern = r'([^.!?]*?' + re.escape(skill) + r'[^.!?]*[.!?])'
            matches = re.findall(pattern, text, re.IGNORECASE)
            for match in matches:
                if self._is_meaningful_context(match, skill):
                    return match.strip()

        return None

    def _split_into_meaningful_segments(self, text):
        """Split text into meaningful segments (not just sentences)"""
        segments = []

        # Split by sentences, bullet points, and list items
        raw_segments = re.split(r'[.!?]+|(?:\n\s*)[•\-*]\s*|(?:\n\s*)\d+\.\s*', text)

        for segment in raw_segments:
            clean_segment = segment.strip()
            # Filter out very short segments and education-like segments
            if (len(clean_segment) >= 10 and
                len(clean_segment.split()) >= 3 and
                not self._is_education_section(clean_segment)):
                segments.append(clean_segment)

        return segments

    def _is_education_section(self, text):
        """Check if text is from education section (should be excluded from evidence)"""
        education_indicators = [
            'education', 'university', 'college', 'school', 'baccalaureat',
            'bachelor', 'master', 'phd', 'degree', 'diploma', 'graduat',
            'faculty', 'institute', 'academy', 'esprit', 'tunisia'
        ]

        text_lower = text.lower()
        education_keywords = sum(1 for indicator in education_indicators if indicator in text_lower)

        # If contains multiple education keywords, likely education section
        return education_keywords >= 2

    def _is_meaningful_context(self, text, skill):
        """Check if the context is meaningful for skill evidence"""
        text_lower = text.lower()
        skill_lower = skill.lower()

        # Exclude education sections
        if self._is_education_section(text):
            return False

        # Exclude very generic contexts
        generic_contexts = [
            'education', 'university', 'college', 'school',
            'name', 'address', 'phone', 'email', 'contact'
        ]

        if any(context in text_lower for context in generic_contexts):
            return False

        # Check if skill appears in a meaningful way (not just in a list)
        words = text_lower.split()
        skill_position = text_lower.find(skill_lower)

        # If skill is in a very short segment, it might just be a list item
        if len(words) <= 4:
            return False

        return True

    def _extract_technical_segments(self, text):
        """Extract segments from technical sections only"""
        technical_segments = []

        # Common technical section headers
        technical_headers = [
            'skills', 'experience', 'work', 'projects', 'technical',
            'technologies', 'tools', 'frameworks', 'languages',
            'proficiencies', 'expertise', 'qualifications'
        ]

        lines = text.split('\n')
        in_technical_section = False

        for line in lines:
            line_lower = line.lower().strip()

            # Check if this line starts a technical section
            if any(header in line_lower for header in technical_headers):
                in_technical_section = True
                continue

            # Check if we're leaving technical section (new major section)
            if (line_lower and
                len(line.split()) <= 4 and
                line[0].isupper() and
                not any(header in line_lower for header in technical_headers)):
                in_technical_section = False

            # If in technical section and line is meaningful, add it
            if in_technical_section and len(line.strip()) >= 5:
                technical_segments.append(line.strip())

        return technical_segments

    def _find_better_evidence(self, text, skill, current_evidence):
        """Try to find better evidence if current is from education section"""
        # Look in technical sections specifically
        technical_segments = self._extract_technical_segments(text)

        for segment in technical_segments:
            if skill.lower() in segment.lower():
                return segment.strip()

        return current_evidence  # Return original if no better evidence found

    def _truncate_evidence(self, evidence, max_length=120):
        """Truncate evidence for display"""
        if not evidence:
            return "Not specifically demonstrated"

        if len(evidence) <= max_length:
            return evidence

        truncated = evidence[:max_length]
        last_space = truncated.rfind(' ')
        if last_space > max_length * 0.6:
            return truncated[:last_space] + "..."

        return truncated + "..."

    # IMPROVED MATCHING ALGORITHM
    def _flexible_skill_matching(self, cv_skills, jd_skills):
        """Improved skill matching with better evidence validation"""
        matched_pairs = []
        missing_skills = []

        for jd_skill in jd_skills:
            best_match = None
            best_score = 0.0

            for cv_skill in cv_skills:
                overlap_score = self._keyword_overlap(cv_skill, jd_skill)
                semantic_score = self._semantic_similarity(cv_skill, jd_skill)

                combined_score = (overlap_score * self.WEIGHT_OVERLAP) + (semantic_score * self.WEIGHT_SEMANTIC)

                if combined_score > best_score:
                    best_match, best_score = cv_skill, combined_score

            # Only count as matched if we have good score
            if best_score >= self.SIMILARITY_THRESHOLD:
                matched_pairs.append((best_match, jd_skill, best_score))
            else:
                missing_skills.append(jd_skill)

        return matched_pairs, missing_skills

    def _keyword_overlap(self, s1, s2):
        """Calculate keyword overlap"""
        if not s1 or not s2:
            return 0.0

        w1 = set(re.findall(r'\b\w+\b', s1.lower()))
        w2 = set(re.findall(r'\b\w+\b', s2.lower()))

        if not w1 or not w2:
            return 0.0

        intersection = w1 & w2
        union = w1 | w2

        return len(intersection) / len(union) if union else 0.0

    def _semantic_similarity(self, s1, s2):
        """Calculate semantic similarity"""
        try:
            emb1 = self.sentence_model.encode([s1])
            emb2 = self.sentence_model.encode([s2])
            similarity = cosine_similarity(emb1, emb2)[0][0]
            return float(similarity)
        except Exception:
            return 0.0

    # MAIN ANALYSIS METHOD WITH IMPROVED EVIDENCE
    def analyze_skill_gap(self, cv_text, jd_text):
        """Main skill gap analysis with improved evidence"""
        # Store texts for evidence validation
        self.cv_text = cv_text
        self.jd_text = jd_text

        cv_skills = self.extract_meaningful_skills(cv_text)
        jd_skills = self.extract_meaningful_skills(jd_text)

        jd_skills = [re.sub(r'[\*`#]', '', skill).strip() for skill in jd_skills]

        matched_pairs, missing_skills_list = self._flexible_skill_matching(cv_skills, jd_skills)

        # Generate evidence with improved logic
        matched_evidence = []
        for cv_skill, jd_skill, sim in matched_pairs:
            cv_ev = self._find_contextual_evidence(cv_text, cv_skill)
            jd_ev = self._find_contextual_evidence(jd_text, jd_skill)

            # If evidence is from education section, try to find better evidence
            if cv_ev and self._is_education_section(cv_ev):
                cv_ev = self._find_better_evidence(cv_text, cv_skill, cv_ev)

            matched_evidence.append({
                "skill": jd_skill,
                "cv_skill": cv_skill,
                "evidence_cv": self._truncate_evidence(cv_ev),
                "evidence_jd": self._truncate_evidence(jd_ev),
                "similarity": round(sim, 2)
            })

        # Generate evidence for missing skills
        missing_evidence = []
        for skill in missing_skills_list:
            jd_full_context = self._find_contextual_evidence(jd_text, skill)
            missing_evidence.append({
                "skill": skill,
                "evidence_jd": self._truncate_evidence(jd_full_context),
                "jd_requirement_text": jd_full_context
            })

        match_percentage = (len(matched_pairs) / len(jd_skills)) * 100 if jd_skills else 0.0

        structured_missing_skills = [
            {
                "skill": item["skill"],
                "jd_requirement": item["jd_requirement_text"] or "JD requirement"
            }
            for item in missing_evidence
        ]

        evidence_data = self._create_frontend_evidence(matched_evidence, missing_evidence)

        return {
            "matched_skills": matched_evidence,
            "missing_skills": structured_missing_skills,
            "cv_skill_count": len(cv_skills),
            "jd_skill_count": len(jd_skills),
            "match_percentage": round(match_percentage, 1),
            "evidence": evidence_data
        }

    def _create_frontend_evidence(self, matched_skills, missing_skills_data):
        """Create frontend evidence structure"""
        evidence_list = []

        for skill in matched_skills:
            evidence_list.append({
                "skill": skill["skill"],
                "status": "matched",
                "cv_sentence": skill["evidence_cv"],
                "jd_sentence": skill["evidence_jd"],
                "similarity": skill["similarity"],
                "matched_skill": skill["cv_skill"]
            })

        for skill_data in missing_skills_data:
            evidence_list.append({
                "skill": skill_data["skill"],
                "status": "missing",
                "cv_sentence": "Not specifically demonstrated in CV",
                "jd_sentence": skill_data["evidence_jd"],
                "similarity": 0,
                "matched_skill": None
            })

        return evidence_list


# Global instance
dynamic_skill_analyzer = DynamicSkillAnalyzer()

## ❓ Intelligent Interview Question Generation

### **Powered by Gemini 2.0 Flash**
- **Advanced Reasoning** - Creates targeted questions based on skill gaps
- **Structured Outputs** - JSON format with model answers and hints
- **Professional Quality** - HR-expert level question formulation

### **Question Types Generated**
1. **Behavioral Questions** - Past experience and problem-solving
2. **Technical Questions** - Specific skills and technologies
3. **Situational Questions** - Missing skills and learning approach

### **Features**
- **Model Answers** - STAR-method responses
- **Answer Hints** - Guidance for candidates
- **Key Points** - What interviewers should listen for
- **Skill Targeting** - Questions address specific matched/missing skills

In [None]:
# What this cell does : Implements AIQuestionGenerator — loads LLM + embeddings,
#   extracts and chunk-splits PDF context, generates structured interview questions (JSON),
#   and provides a lightweight QA retriever using vector embeddings.
# Inputs: candidate CV text (string), job description text (string), path to a PDF for context.
# Outputs: JSON list of interview questions (question, focus, rationale, model_answer, hints, key_points)
# Dependencies/runtime note: uses Google Generative AI (Gemini) APIs — set GOOGLE_API_KEY in env or use offline fallback.

# Imports: os (env), logging (diagnostics), re/json (parsing), langchain loaders/splitters/FAISS (RAG),
# and langchain_google_genai wrappers for Gemini LLM + embeddings.
import os
import logging
import re
import json
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_core.prompts import PromptTemplate

# Set API key as environment variable
os.environ["GOOGLE_API_KEY"] = "AIzaSyCEFhz2FoUllmTdYNo8qLmdEuwQGBDa4tc"


# Initialize logger to capture info/debug messages during model initialization and generation.
logger = logging.getLogger(__name__)


# AIQuestionGenerator:
# - Responsible for initializing LLM + embedding clients (Gemini here),
# - Converting PDFs to chunked documents for retrieval,
# - Generating structured interview questions given CV+JD+skill analysis,
# - Providing a simple QA retriever (FAISS via LangChain) for answering based on chunks.
class AIQuestionGenerator:
    def __init__(self, api_key=None):
        self.api_key = api_key or os.getenv("GOOGLE_API_KEY")
        if not self.api_key:
            raise ValueError("Missing GOOGLE_API_KEY. Provide as argument or set environment variable.")
        self._initialize_models()

    def _initialize_models(self):
        logger.info("🚀 Initializing Google Gemini LLM and Embeddings models...")
        self.llm = ChatGoogleGenerativeAI(
            model="gemini-2.0-flash-exp",
            temperature=0.3,
            google_api_key=self.api_key
        )
        self.embedding_model = GoogleGenerativeAIEmbeddings(
            model="models/embedding-001",
            google_api_key=self.api_key
        )
        logger.info("✅ Models initialized successfully")

    # process_pdf: extracts text from PDF and splits into overlapping chunks for retrieval context.
# chunk_size and chunk_overlap are chosen to balance context completeness vs number of vectors.
    def process_pdf(self, file_path):
        logger.info(f"📄 Loading and splitting PDF from {file_path}")
        loader = PyPDFLoader(file_path)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        texts = text_splitter.split_documents(documents)
        return texts

    def generate_interview_questions(self, cv_text, jd_text, skill_gap_analysis, num_questions=6):
        """Original method - kept for backward compatibility"""
        return self.generate_interview_questions_with_answers(cv_text, jd_text, skill_gap_analysis, num_questions)

    def generate_interview_questions_with_answers(self, cv_text, jd_text, skill_gap_analysis, num_questions=6):
        matched = [s['skill'] if isinstance(s, dict) else s for s in skill_gap_analysis.get("matched_skills", [])]
        missing = [s['skill'] if isinstance(s, dict) else s for s in skill_gap_analysis.get("missing_skills", [])]

        prompt_text = (
            f"You are an expert HR interviewer and career coach. Based on the job description and candidate CV below, "
            f"please generate exactly {num_questions} interview questions in JSON format.\n\n"
            "STRUCTURE EACH QUESTION WITH:\n"
            "- 'question': The interview question text\n"
            "- 'focus': Which skill area this targets (behavioral, technical_django, technical_python, technical_aws, technical_gcp, situational, leadership)\n"
            "- 'rationale': Brief explanation of why this question is relevant for this candidate\n"
            "- 'skill_type': Whether this addresses 'matched_skill' or 'missing_skill'\n"
            "- 'target_skill': The specific skill being assessed\n"
            "- 'model_answer': A well-structured model answer using STAR method (Situation, Task, Action, Result)\n"
            "- 'answer_hints': List of 3-5 specific hints to help answer this question effectively\n"
            "- 'key_points': List of key points that should be covered in a good answer\n\n"
            f"MATCHED SKILLS TO ASSESS: {', '.join(matched)}\n"
            f"MISSING SKILLS TO ADDRESS: {', '.join(missing)}\n\n"
            "QUESTIONS SHOULD COVER:\n"
            "- 2 behavioral questions about past experience\n"
            "- 2 technical questions on matched skills\n"
            "- 2 situational questions addressing missing skills\n\n"
            f"Job Description:\n{jd_text[:1000]}...\n\n"
            f"Candidate CV:\n{cv_text[:1000]}...\n\n"
            "Return ONLY valid JSON array format, no other text:\n"
            "[{'question': '...', 'focus': '...', 'rationale': '...', 'skill_type': '...', 'target_skill': '...', 'model_answer': '...', 'answer_hints': ['hint1', 'hint2'], 'key_points': ['point1', 'point2']}]"
        )

        logger.info("🧠 Generating interview questions with answers and hints")
        logger.info(f"📊 Matched skills: {matched}")
        logger.info(f"📊 Missing skills: {missing}")

        try:
            response = self.llm.invoke(prompt_text)
            generated_text = response.content

            # Debug: Print the raw response
            logger.info("📨 Raw AI Response:")
            logger.info(generated_text[:500] + "..." if len(generated_text) > 500 else generated_text)

            # Clean the response - remove markdown code blocks if present
            cleaned_text = re.sub(r'```json\s*|\s*```', '', generated_text).strip()

            # Parse JSON response
            questions_data = json.loads(cleaned_text)

            # Debug: Print parsed data
            logger.info(f"✅ Successfully parsed {len(questions_data)} questions")
            for i, q in enumerate(questions_data):
                logger.info(f"📝 Question {i+1}: {q.get('question', 'No question')[:100]}...")
                logger.info(f"   Answer: {'Yes' if q.get('model_answer') else 'No'}")
                logger.info(f"   Hints: {len(q.get('answer_hints', []))}")
                logger.info(f"   Key Points: {len(q.get('key_points', []))}")

            # Ensure we have the right number of questions
            if len(questions_data) > num_questions:
                questions_data = questions_data[:num_questions]

            logger.info(f"✅ Successfully generated {len(questions_data)} questions with answers and hints")
            return questions_data

        except json.JSONDecodeError as e:
            logger.error(f"❌ Failed to parse JSON response: {e}")
            logger.error(f"Raw response: {generated_text}")
            # Fallback to simple question generation
            return self._generate_fallback_questions_with_answers(matched, missing, num_questions)
        except Exception as e:
            logger.error(f"❌ Unexpected error in question generation: {e}")
            return self._generate_fallback_questions_with_answers(matched, missing, num_questions)

    def _generate_fallback_questions_with_answers(self, matched_skills, missing_skills, num_questions):
        """Fallback method if JSON parsing fails"""
        logger.warning("🔄 Using fallback question generation")
        questions = []

        # Distribute questions between matched and missing skills
        num_matched = min(3, len(matched_skills))
        num_missing = min(3, len(missing_skills))

        for i in range(num_matched):
            if i < len(matched_skills):
                skill = matched_skills[i]
                questions.append({
                    "question": f"Can you describe your experience with {skill} and provide an example of a project where you used it effectively?",
                    "focus": f"technical_{skill.lower()}",
                    "rationale": f"Assessing depth of experience with {skill} which matches the job requirements",
                    "skill_type": "matched_skill",
                    "target_skill": skill,
                    "model_answer": f"In my previous role, I extensively used {skill} to develop a scalable solution. The situation required handling increasing user load. My task was to optimize performance. I implemented efficient algorithms and caching strategies using {skill}, which resulted in 40% performance improvement and better user satisfaction. The key was understanding the system constraints and applying the right {skill} features to address them.",
                    "answer_hints": [
                        f"Focus on a specific project example using {skill}",
                        "Mention the business impact of your work",
                        "Describe the technical challenges you overcame",
                        "Quantify your results with metrics",
                        f"Explain why you chose specific {skill} features"
                    ],
                    "key_points": [
                        f"Specific {skill} features used",
                        "Problem-solving approach",
                        "Measurable outcomes",
                        "Learning and improvements",
                        "Business impact"
                    ]
                })

        for i in range(num_missing):
            if i < len(missing_skills):
                skill = missing_skills[i]
                questions.append({
                    "question": f"How would you approach learning and implementing {skill} if required for this role?",
                    "focus": f"technical_{skill.lower()}",
                    "rationale": f"Exploring willingness to learn missing skill {skill}",
                    "skill_type": "missing_skill",
                    "target_skill": skill,
                    "model_answer": f"While I haven't worked extensively with {skill}, I have a strong foundation in related technologies. I would start by taking online courses and building small projects to understand the fundamentals. Then I'd seek mentorship from experienced colleagues and gradually take on more complex tasks. My experience with similar technologies would help me ramp up quickly, and I'm confident I could become productive with {skill} within a few weeks.",
                    "answer_hints": [
                        "Show enthusiasm for learning",
                        "Connect to your existing skills",
                        "Provide a concrete learning plan",
                        "Mention how you'll apply it to the role",
                        "Set realistic timeline expectations"
                    ],
                    "key_points": [
                        "Learning strategy and resources",
                        "Timeline for skill acquisition",
                        "Transferable skills",
                        "Practical application plans",
                        "Measurement of progress"
                    ]
                })

        # Fill remaining slots with behavioral questions
        while len(questions) < num_questions:
            questions.append({
                "question": "Describe a challenging project you worked on and how you overcame the main obstacles.",
                "focus": "behavioral",
                "rationale": "Assessing problem-solving and project management skills",
                "skill_type": "behavioral",
                "target_skill": "problem_solving",
                "model_answer": "In a recent project, we faced tight deadlines and technical challenges with integrating multiple legacy systems. The situation required delivering a critical feature under pressure while maintaining system stability. My task was to lead the development team and ensure on-time delivery. I organized daily standups, broke down tasks into manageable chunks, implemented agile practices, and created contingency plans. The result was successful on-time delivery with all requirements met, and the client was very satisfied with both the process and outcome.",
                "answer_hints": [
                    "Use STAR method: Situation, Task, Action, Result",
                    "Be specific about challenges and constraints",
                    "Highlight your leadership and decision-making role",
                    "Quantify the success metrics",
                    "Show what you learned from the experience"
                ],
                "key_points": [
                    "Clear problem description with context",
                    "Your specific actions and decisions",
                    "Team collaboration and communication",
                    "Measurable outcomes and impact",
                    "Lessons learned and improvements"
                ]
            })

        logger.info(f"🔄 Generated {len(questions)} fallback questions with answers")
        return questions[:num_questions]

    def create_qa_system(self, texts):
        """Create a simple QA system without complex chains"""
        vector_store = FAISS.from_documents(texts, self.embedding_model)
        retriever = vector_store.as_retriever(search_kwargs={'k': 3})
        return retriever

    def generate_all(self, cv_text, jd_text, skill_gap_analysis, pdf_path):
        """Main method to generate questions and answers"""
        try:
            # Process PDF for context
            texts = self.process_pdf(pdf_path)

            # Generate structured questions with answers
            questions = self.generate_interview_questions_with_answers(cv_text, jd_text, skill_gap_analysis)

            # Create QA system for answering
            retriever = self.create_qa_system(texts)

            return {
                "questions": questions,
                "status": "success"
            }

        except Exception as e:
            logger.error(f"❌ Error in generate_all: {e}")
            return {
                "questions": [],
                "status": "error",
                "error": str(e)
            }

# Now this will work
ai_question_generator = AIQuestionGenerator()

🚀 Initializing Google Gemini LLM and Embeddings models...
✅ Models initialized successfully


## 🔄 Complete CV-JD Analysis Pipeline

### **Multi-Format Input Support**
- **📋 Manual Text Input** - Paste CV and job description directly
- **📁 Text File Upload** - Upload .txt files
- **📄 PDF File Upload** - Automatic text extraction from PDF resumes

### **Analysis Workflow**
1. **Input Processing** - Handle multiple file formats and encoding
2. **Skill Extraction** - Identify skills from both CV and job description
3. **Gap Analysis** - Calculate match percentage and identify missing skills
4. **Question Generation** - Create targeted interview questions
5. **Professional Reporting** - Formatted results with evidence

### **User Experience**
- **Interactive Prompts** - Guide users through the process
- **Progress Indicators** - Show analysis stages
- **Preview Features** - Display extracted text for verification
- **Error Handling** - Graceful fallbacks and user-friendly messages

In [83]:
# Cell 3: Main Analysis Pipeline with PDF Support
import logging
import json
from IPython.display import display, Markdown, HTML
import os
import tempfile

# Setup basic logging
logging.basicConfig(level=logging.INFO, format='%(message)s')
logger = logging.getLogger(__name__)

def extract_text_from_pdf(pdf_content):
    """Extract text from PDF content using PyPDFLoader"""
    try:
        # Save PDF content to temporary file
        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as temp_file:
            temp_file.write(pdf_content)
            temp_path = temp_file.name

        # Use PyPDFLoader to extract text
        from langchain_community.document_loaders import PyPDFLoader
        loader = PyPDFLoader(temp_path)
        documents = loader.load()

        # Combine all pages into single text
        text = "\n".join([doc.page_content for doc in documents])

        # Clean up temporary file
        os.unlink(temp_path)

        return text
    except Exception as e:
        print(f"❌ Error extracting text from PDF: {e}")
        return None

def get_text_input(prompt_text, input_type="CV"):
    """Get text input from user with choice of manual entry or file upload"""

    print(f"\n📝 {input_type} INPUT OPTIONS:")
    print("1. 📋 Paste text manually")
    print("2. 📁 Upload text file (.txt)")
    print("3. 📄 Upload PDF file (.pdf)")

    while True:
        try:
            choice = input(f"Choose {input_type} input method (1, 2, or 3): ").strip()
            if choice == '1':
                return get_manual_text_input(input_type)
            elif choice == '2':
                return get_text_file_input(input_type)
            elif choice == '3':
                return get_pdf_file_input(input_type)
            else:
                print("❌ Please enter 1, 2, or 3")
        except KeyboardInterrupt:
            print("\n⏹️ Input cancelled")
            return None

def get_manual_text_input(input_type):
    """Get manual text input from user"""
    print(f"\n📋 PASTE {input_type} TEXT:")
    print("   (Paste the content below, then press Enter twice to finish)")
    print("-" * 50)

    lines = []
    empty_line_count = 0

    while True:
        try:
            line = input()
            if line.strip() == "":
                empty_line_count += 1
                if empty_line_count >= 2 and len(lines) > 0:
                    break
            else:
                empty_line_count = 0
            lines.append(line)
        except EOFError:
            break
        except KeyboardInterrupt:
            print("\n⏹️ Input cancelled")
            return None

    text = "\n".join(lines)
    print(f"✅ {input_type} text received ({len(text)} characters)")
    return text

def get_text_file_input(input_type):
    """Get text from .txt file upload"""
    print(f"\n📁 {input_type} TEXT FILE UPLOAD:")

    try:
        from google.colab import files
        print("📤 Please upload your .txt file...")
        uploaded = files.upload()

        if uploaded:
            filename = list(uploaded.keys())[0]
            # Try different encodings for text files
            try:
                content = uploaded[filename].decode('utf-8')
            except UnicodeDecodeError:
                try:
                    content = uploaded[filename].decode('latin-1')
                except UnicodeDecodeError:
                    content = uploaded[filename].decode('utf-8', errors='ignore')

            print(f"✅ Text file '{filename}' uploaded successfully ({len(content)} characters)")
            return content
        else:
            print("❌ No file uploaded. Switching to manual input...")
            return get_manual_text_input(input_type)

    except ImportError:
        # For local environment - file path input
        print("🌐 Local environment detected - enter text file path:")
        while True:
            file_path = input("Enter file path: ").strip()
            if os.path.exists(file_path):
                try:
                    with open(file_path, 'r', encoding='utf-8') as file:
                        content = file.read()
                except UnicodeDecodeError:
                    try:
                        with open(file_path, 'r', encoding='latin-1') as file:
                            content = file.read()
                    except UnicodeDecodeError:
                        with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
                            content = file.read()

                print(f"✅ Text file loaded successfully ({len(content)} characters)")
                return content
            else:
                print("❌ File not found. Please enter a valid file path or press Ctrl+C to cancel")
                retry = input("Try again? (y/n): ").strip().lower()
                if retry != 'y':
                    return get_manual_text_input(input_type)

def get_pdf_file_input(input_type):
    """Get text from PDF file upload"""
    print(f"\n📄 {input_type} PDF FILE UPLOAD:")

    try:
        from google.colab import files
        print("📤 Please upload your PDF file...")
        uploaded = files.upload()

        if uploaded:
            filename = list(uploaded.keys())[0]
            if not filename.lower().endswith('.pdf'):
                print("❌ Please upload a PDF file. Switching to manual input...")
                return get_manual_text_input(input_type)

            print(f"⏳ Extracting text from PDF '{filename}'...")
            pdf_content = uploaded[filename]

            # Extract text from PDF
            text_content = extract_text_from_pdf(pdf_content)

            if text_content:
                print(f"✅ PDF text extracted successfully ({len(text_content)} characters)")
                # Show preview of extracted text
                preview = text_content[:500] + "..." if len(text_content) > 500 else text_content
                print(f"📖 Extracted text preview:\n{preview}\n")
                return text_content
            else:
                print("❌ Failed to extract text from PDF. Switching to manual input...")
                return get_manual_text_input(input_type)
        else:
            print("❌ No file uploaded. Switching to manual input...")
            return get_manual_text_input(input_type)

    except ImportError:
        # For local environment - file path input
        print("🌐 Local environment detected - enter PDF file path:")
        while True:
            file_path = input("Enter PDF file path: ").strip()
            if os.path.exists(file_path) and file_path.lower().endswith('.pdf'):
                try:
                    with open(file_path, 'rb') as file:
                        pdf_content = file.read()

                    print(f"⏳ Extracting text from PDF...")
                    text_content = extract_text_from_pdf(pdf_content)

                    if text_content:
                        print(f"✅ PDF text extracted successfully ({len(text_content)} characters)")
                        preview = text_content[:500] + "..." if len(text_content) > 500 else text_content
                        print(f"📖 Extracted text preview:\n{preview}\n")
                        return text_content
                    else:
                        print("❌ Failed to extract text from PDF.")
                        retry = input("Try another file? (y/n): ").strip().lower()
                        if retry != 'y':
                            return get_manual_text_input(input_type)
                except Exception as e:
                    print(f"❌ Error reading PDF file: {e}")
                    retry = input("Try another file? (y/n): ").strip().lower()
                    if retry != 'y':
                        return get_manual_text_input(input_type)
            else:
                print("❌ PDF file not found or not a PDF. Please enter a valid PDF file path.")
                retry = input("Try again? (y/n): ").strip().lower()
                if retry != 'y':
                    return get_manual_text_input(input_type)

def run_complete_analysis():
    """Main function to run complete CV-JD analysis and question generation"""

    print("🎯 ===========================================")
    print("🎯 CV-JD SKILL GAP ANALYSIS & QUESTION GENERATOR")
    print("🎯 ===========================================")
    print()
    print("Choose input method for CV and JD:")
    print("• Option 1: 📋 Paste text manually")
    print("• Option 2: 📁 Upload text files (.txt)")
    print("• Option 3: 📄 Upload PDF files (.pdf)")
    print()

    # Get CV input
    cv_text = get_text_input("Enter Candidate CV", "CV")
    if cv_text is None:
        print("❌ CV input cancelled")
        return None

    # Get JD input
    jd_text = get_text_input("Enter Job Description", "JD")
    if jd_text is None:
        print("❌ JD input cancelled")
        return None

    print()
    print("⏳ Analyzing skills and generating questions...")
    print()

    try:
        # Step 1: Run Skill Gap Analysis
        logger.info("🔍 Running skill gap analysis...")
        skill_gap_analysis = dynamic_skill_analyzer.analyze_skill_gap(cv_text, jd_text)

        # Display Skill Gap Results
        display(Markdown("## 📊 SKILL GAP ANALYSIS RESULTS"))
        display(Markdown(f"**Match Percentage:** {skill_gap_analysis['match_percentage']}%"))
        display(Markdown(f"**CV Skills Found:** {skill_gap_analysis['cv_skill_count']}"))
        display(Markdown(f"**JD Skills Required:** {skill_gap_analysis['jd_skill_count']}"))

        # Display Matched Skills
        if skill_gap_analysis['matched_skills']:
            display(Markdown("### ✅ MATCHED SKILLS"))
            for i, skill in enumerate(skill_gap_analysis['matched_skills'], 1):
                display(Markdown(f"**{i}. {skill['skill']}** (Similarity: {skill['similarity']})"))
                display(Markdown(f"   - CV Evidence: *{skill['evidence_cv']}*"))
                display(Markdown(f"   - JD Requirement: *{skill['evidence_jd']}*"))
                print()

        # Display Missing Skills
        if skill_gap_analysis['missing_skills']:
            display(Markdown("### ❌ MISSING SKILLS"))
            for i, skill in enumerate(skill_gap_analysis['missing_skills'], 1):
                display(Markdown(f"**{i}. {skill['skill']}**"))
                if skill.get('jd_requirement'):
                    display(Markdown(f"   - JD Context: *{skill['jd_requirement']}*"))
                print()

        print()
        print("🔄 Generating interview questions based on analysis...")
        print()

        # Step 2: Generate Interview Questions
        questions_result = ai_question_generator.generate_interview_questions_with_answers(
            cv_text, jd_text, skill_gap_analysis
        )

        # Display Generated Questions
        display(Markdown("## 🎯 GENERATED INTERVIEW QUESTIONS"))
        display(Markdown(f"**Total Questions Generated:** {len(questions_result)}"))

        for i, question in enumerate(questions_result, 1):
            display(Markdown(f"### ❓ Question {i}: {question['question']}"))
            display(Markdown(f"**Focus:** {question['focus']} | **Skill Type:** {question['skill_type']}"))
            display(Markdown(f"**Target Skill:** {question['target_skill']}"))
            display(Markdown(f"**Rationale:** {question['rationale']}"))

            display(Markdown("#### 💡 Model Answer:"))
            display(Markdown(f"{question['model_answer']}"))

            display(Markdown("#### 🎯 Answer Hints:"))
            for j, hint in enumerate(question['answer_hints'], 1):
                display(Markdown(f"{j}. {hint}"))

            display(Markdown("#### 🔑 Key Points to Cover:"))
            for j, point in enumerate(question['key_points'], 1):
                display(Markdown(f"{j}. {point}"))

            print("\n" + "="*80 + "\n")

        # Return the complete results for further use
        return {
            "skill_gap_analysis": skill_gap_analysis,
            "questions": questions_result,
            "status": "success"
        }

    except Exception as e:
        logger.error(f"❌ Error in analysis pipeline: {e}")
        display(Markdown("## ❌ ERROR"))
        display(Markdown(f"An error occurred during analysis: `{str(e)}`"))
        return {
            "skill_gap_analysis": None,
            "questions": [],
            "status": "error",
            "error": str(e)
        }

## 📊 Professional CV Assessment Engine

### **Gemini 2.0 Flash Integration**
- **LangChain Framework** - Structured AI interactions
- **Professional Prompting** - HR-expert level analysis
- **Structured JSON Output** - Consistent, parseable results

### **Analysis Dimensions**
1. **Overall Scoring** - Realistic 5-9 scale based on content quality
2. **Strengths & Weaknesses** - Specific to actual CV content
3. **Actionable Recommendations** - Concrete improvement suggestions
4. **Content Analysis** - Clarity, relevance, achievements, uniqueness
5. **Skill Analysis** - Technical and soft skills evaluation
6. **Formatting Assessment** - Readability, structure, professionalism

### **Advanced Features**
- **Evidence-Based** - References specific technologies and experiences
- **Non-Generic** - Feedback applies only to the specific CV
- **Professional Standards** - HR-grade assessment quality

In [84]:
# cv_analyzer.py - USING LANGCHAIN WITH GEMINI 2.0 FLASH
import logging
import json
import re
import random
from typing import Dict, List, Any
import os
from datetime import datetime

try:
    from langchain_google_genai import ChatGoogleGenerativeAI
    from langchain_core.messages import HumanMessage, SystemMessage
    LANGCHAIN_AVAILABLE = True
except ImportError:
    LANGCHAIN_AVAILABLE = False
    logging.warning("LangChain not available. Install with: pip install langchain-google-genai")

logger = logging.getLogger(__name__)

class CVAnalyzer:
    def __init__(self):
        self.api_key = os.getenv('GOOGLE_API_KEY', 'key')

        if LANGCHAIN_AVAILABLE and self.api_key:
            try:
                self.llm = ChatGoogleGenerativeAI(
                    model="gemini-2.0-flash-exp",
                    temperature=0.7,
                    google_api_key=self.api_key
                )
                self.langchain_available = True
                logger.info("LangChain with Gemini 2.0 Flash initialized successfully")
            except Exception as e:
                logger.error(f"Failed to initialize LangChain: {e}")
                self.langchain_available = False
        else:
            self.langchain_available = False
            logger.warning("LangChain not available or API key missing")

    def analyze_cv(self, cv_text: str) -> Dict[str, Any]:
        """
        Use Gemini 2.0 Flash via LangChain for CV analysis
        """
        logger.info("Starting CV analysis with Gemini 2.0 Flash...")

        # Try LangChain with Gemini first
        if self.langchain_available:
            gemini_result = self._try_langchain_gemini(cv_text)
            if gemini_result:
                logger.info("Successfully got Gemini 2.0 Flash analysis via LangChain")
                return gemini_result

        # Fallback to intelligent analysis
        logger.info("Using intelligent fallback analysis")
        return self._create_intelligent_fallback(cv_text)

    def _try_langchain_gemini(self, cv_text: str) -> Dict[str, Any]:
        """
        Use LangChain with Gemini 2.0 Flash Experimental
        """
        try:
            prompt = self._create_langchain_prompt(cv_text)

            messages = [
                SystemMessage(content="You are an expert CV consultant with 15+ years experience in tech recruitment."),
                HumanMessage(content=prompt)
            ]

            logger.info("Sending request to Gemini 2.0 Flash via LangChain...")
            response = self.llm.invoke(messages)

            if response and hasattr(response, 'content'):
                content = response.content
                logger.info(f"Gemini 2.0 Flash response received: {len(content)} characters")
                return self._parse_gemini_response(content, cv_text)
            else:
                logger.error("No content in Gemini response")

        except Exception as e:
            logger.error(f"LangChain Gemini failed: {str(e)}")

        return None

    def _create_langchain_prompt(self, cv_text: str) -> str:
        """Create optimized prompt for LangChain with Gemini"""
        return f"""
        You are an expert CV consultant specializing in technology professionals.
        Analyze the following CV in extreme detail and provide highly specific, actionable feedback , all the recommendations ,all the ares of improvements and all the strengths.

        CRITICAL REQUIREMENTS:
        - Be EXTREMELY specific to the ACTUAL content of this CV
        - Reference exact technologies, projects, experiences, and achievements mentioned
        - Provide feedback that ONLY applies to this specific CV
        - Do NOT use generic phrases that could apply to any CV
        - Score realistically based on actual content quality (5-9 range)
        - Focus on actionable improvements with clear examples

        CV CONTENT TO ANALYZE:
        {cv_text[:3500]}

        Provide your analysis in this EXACT JSON format:

        {{
            "overall_score": "realistic_score/10",
            "strengths": [
                "specific strength mentioning actual technologies or experiences from CV",
                "specific strength about content quality or structure",
                "specific strength about skills or achievements"
            ],
            "weaknesses": [
                "specific weakness with concrete examples from CV",
                "specific area needing improvement with references to actual content"
            ],
            "recommendations": [
                "highly specific, actionable recommendation tied to CV content",
                "concrete suggestion with examples of how to implement",
                "specific improvement that addresses actual gaps found"
            ],
            "content_analysis": {{
                "clarity": "specific assessment of writing clarity based on actual text",
                "relevance": "assessment of relevance for the roles/technologies mentioned",
                "achievements": "detailed evaluation of achievement presentation quality",
                "uniqueness": "what makes this CV unique based on its actual content"
            }},
            "skill_analysis": {{
                "technical_skills": "detailed assessment of technical skills mentioned",
                "soft_skills": "evaluation of soft skills presentation",
                "skill_gaps": "specific skill gaps identified from content",
                "skill_strengths": "particular technical strengths found"
            }},
            "formatting_analysis": {{
                "readability": "specific readability assessment with examples",
                "structure": "evaluation of CV structure and organization",
                "professionalism": "assessment of professional presentation"
            }},
            "key_insights": [
                "unique insight specific to this CV's content",
                "observation about career trajectory or specialization",
                "notable pattern or standout feature in CV"
            ]
        }}

        IMPORTANT:
        - If you see Python, Django, AWS, or other specific technologies, mention them by name
        - If you see leadership experience, quantify it specifically
        - If you see projects or achievements, reference them directly
        - Be brutally honest and constructive
        - Provide feedback that would only make sense for THIS specific CV
        """

    def _parse_gemini_response(self, content: str, cv_text: str) -> Dict[str, Any]:
        """Parse Gemini response from LangChain"""
        try:
            # Try to extract JSON from the response
            json_match = re.search(r'\{.*\}', content, re.DOTALL)
            if json_match:
                json_str = json_match.group()

                # Clean common JSON issues
                json_str = re.sub(r',\s*}', '}', json_str)
                json_str = re.sub(r',\s*]', ']', json_str)
                json_str = re.sub(r'\\"', '"', json_str)

                gemini_data = json.loads(json_str)

                # Enhance with metadata
                enhanced_data = self._enhance_gemini_data(gemini_data, cv_text)
                return enhanced_data

        except json.JSONDecodeError as e:
            logger.error(f"JSON decode error: {e}")
            logger.error(f"Problematic content: {content[:500]}...")

        # If JSON parsing fails, structure the text response
        return self._structure_text_response(content, cv_text)

    def _enhance_gemini_data(self, gemini_data: Dict[str, Any], cv_text: str) -> Dict[str, Any]:
        """Enhance and validate Gemini data"""
        enhanced = {
            "overall_score": gemini_data.get('overall_score', str(self._calculate_fallback_score(cv_text))),
            "strengths": self._ensure_array(gemini_data.get('strengths'), [
                "Professional content quality and structure",
                "Clear career progression information"
            ]),
            "weaknesses": self._ensure_array(gemini_data.get('weaknesses'), [
                "Opportunity for more specific achievement quantification"
            ]),
            "recommendations": self._ensure_array(gemini_data.get('recommendations'), [
                "Add specific metrics to key achievements",
                "Enhance project descriptions with technical details",
                "Highlight leadership experience with team sizes"
            ]),
            "content_analysis": {
                "clarity": gemini_data.get('content_analysis', {}).get('clarity', "Clear professional writing style"),
                "relevance": gemini_data.get('content_analysis', {}).get('relevance', "Relevant for technical roles"),
                "achievements": gemini_data.get('content_analysis', {}).get('achievements', "Good achievement foundation"),
                "uniqueness": gemini_data.get('content_analysis', {}).get('uniqueness', "Unique professional background")
            },
            "skill_analysis": {
                "technical_skills": gemini_data.get('skill_analysis', {}).get('technical_skills', "Solid technical foundation"),
                "soft_skills": gemini_data.get('skill_analysis', {}).get('soft_skills', "Good interpersonal skills"),
                "skill_gaps": gemini_data.get('skill_analysis', {}).get('skill_gaps', "Opportunities for skill expansion"),
                "skill_strengths": gemini_data.get('skill_analysis', {}).get('skill_strengths', "Strong core competencies")
            },
            "formatting_analysis": {
                "readability": gemini_data.get('formatting_analysis', {}).get('readability', "Generally readable format"),
                "structure": gemini_data.get('formatting_analysis', {}).get('structure', "Good structural organization"),
                "professionalism": gemini_data.get('formatting_analysis', {}).get('professionalism', "Professional presentation")
            },
            "key_insights": self._ensure_array(gemini_data.get('key_insights'), [
                "Good foundation with specific enhancement opportunities",
                "Well-positioned for technical role applications"
            ]),
            "analysis_source": "gemini_2.0_flash_langchain",
            "analysis_timestamp": datetime.now().isoformat(),
            "content_preview": cv_text[:200] + "..." if len(cv_text) > 200 else cv_text,
            "model_used": "gemini-2.0-flash-exp",
            "ai_generated": True
        }

        return enhanced

    def _structure_text_response(self, text: str, cv_text: str) -> Dict[str, Any]:
        """Structure text response when JSON parsing fails"""
        logger.info("Structuring text response from Gemini")

        # Extract information from text
        analysis = self._analyze_response_text(text)

        return {
            "overall_score": analysis.get('score', '7'),
            "strengths": analysis.get('strengths', ["Professional CV with good structure"]),
            "weaknesses": analysis.get('weaknesses', ["Opportunity to enhance specific achievements"]),
            "recommendations": analysis.get('recommendations', [
                "Add quantifiable metrics to achievements",
                "Enhance technical skill descriptions",
                "Improve project documentation"
            ]),
            "content_analysis": {
                "clarity": analysis.get('clarity', "Clear and professional writing"),
                "relevance": analysis.get('relevance', "Relevant for target roles"),
                "achievements": analysis.get('achievements', "Good achievement foundation"),
                "uniqueness": analysis.get('uniqueness', "Unique professional elements")
            },
            "skill_analysis": {
                "technical_skills": analysis.get('technical_skills', "Solid technical skills"),
                "soft_skills": analysis.get('soft_skills', "Good soft skills presentation"),
                "skill_gaps": analysis.get('skill_gaps', "Some skill enhancement opportunities"),
                "skill_strengths": analysis.get('skill_strengths', "Strong core competencies")
            },
            "formatting_analysis": {
                "readability": analysis.get('readability', "Generally good readability"),
                "structure": analysis.get('structure', "Well-structured format"),
                "professionalism": analysis.get('professionalism', "Professional presentation")
            },
            "key_insights": analysis.get('insights', [
                "Good potential for specific enhancements",
                "Solid foundation for technical roles"
            ]),
            "analysis_source": "gemini_2.0_flash_text",
            "analysis_timestamp": datetime.now().isoformat(),
            "content_preview": cv_text[:200] + "..." if len(cv_text) > 200 else cv_text,
            "model_used": "gemini-2.0-flash-exp",
            "ai_generated": True
        }

    def _analyze_response_text(self, text: str) -> Dict[str, Any]:
        """Analyze text response from Gemini"""
        analysis = {}

        # Extract score
        score_match = re.search(r'(\d+)(?:\s*\/\s*10)?|score.*?(\d+)', text.lower())
        analysis['score'] = score_match.group(1) or score_match.group(2) if score_match else '7'

        # Extract sections
        sections = self._extract_sections_from_text(text)
        analysis.update(sections)

        return analysis

    def _extract_sections_from_text(self, text: str) -> Dict[str, Any]:
        """Extract sections from text response"""
        sections = {}

        section_patterns = {
            'strengths': r'(?:strengths?|positives?).*?[:\n](.*?)(?=weaknesses|improvements|recommendations|$)',
            'weaknesses': r'(?:weaknesses?|improvements?).*?[:\n](.*?)(?=strengths|recommendations|$)',
            'recommendations': r'(?:recommendations?|suggestions?).*?[:\n](.*?)(?=strengths|weaknesses|$)'
        }

        text_lower = text.lower()

        for section, pattern in section_patterns.items():
            match = re.search(pattern, text_lower, re.IGNORECASE | re.DOTALL)
            if match:
                content = match.group(1).strip()
                items = self._extract_items(content)
                if items:
                    sections[section] = items

        return sections

    def _extract_items(self, text: str) -> List[str]:
        """Extract items from text"""
        items = []

        # Multiple extraction strategies
        lines = text.split('\n')
        for line in lines:
            line = line.strip()
            # Skip empty lines and very short lines
            if line and len(line) > 15:
                # Remove bullet points and numbers
                clean_line = re.sub(r'^[•\-*\d\.\s]+', '', line)
                if clean_line and len(clean_line) > 10:
                    items.append(clean_line)

        return items[:4] if items else []

    def _ensure_array(self, value: Any, default: List[str]) -> List[str]:
        """Ensure value is a proper array"""
        if isinstance(value, list) and len(value) > 0:
            return value
        return default

    def _calculate_fallback_score(self, cv_text: str) -> int:
        """Calculate fallback score based on content quality"""
        cv_lower = cv_text.lower()

        score = 6
        if len(cv_text) > 1200:
            score += 1
        if any(tech in cv_lower for tech in ['python', 'django', 'aws', 'docker']):
            score += 1
        if len(re.findall(r'\d+%|\$\d+', cv_text)) >= 2:
            score += 1

        return min(9, max(5, score))

    def _create_intelligent_fallback(self, cv_text: str) -> Dict[str, Any]:
        """
        High-quality fallback when Gemini fails
        """
        cv_lower = cv_text.lower()

        # Analyze content
        analysis = self._analyze_content_intelligently(cv_text, cv_lower)

        return {
            "overall_score": str(analysis['score']),
            "strengths": analysis['strengths'],
            "weaknesses": analysis['weaknesses'],
            "recommendations": analysis['recommendations'],
            "content_analysis": {
                "clarity": analysis['clarity'],
                "relevance": analysis['relevance'],
                "achievements": analysis['achievements'],
                "uniqueness": analysis['uniqueness']
            },
            "skill_analysis": {
                "technical_skills": analysis['technical_skills'],
                "soft_skills": analysis['soft_skills'],
                "skill_gaps": analysis['skill_gaps'],
                "skill_strengths": analysis['skill_strengths']
            },
            "formatting_analysis": {
                "readability": analysis['readability'],
                "structure": analysis['structure'],
                "professionalism": analysis['professionalism']
            },
            "key_insights": analysis['insights'],
            "analysis_source": "intelligent_fallback",
            "analysis_timestamp": datetime.now().isoformat(),
            "content_preview": cv_text[:200] + "..." if len(cv_text) > 200 else cv_text,
            "model_used": "fallback_analysis",
            "ai_generated": False
        }

    def _analyze_content_intelligently(self, cv_text: str, cv_lower: str) -> Dict[str, Any]:
        """Intelligent content analysis"""
        # Simple technology analysis
        technologies = []
        tech_keywords = ['python', 'django', 'aws', 'docker', 'javascript', 'react', 'java', 'sql']
        for tech in tech_keywords:
            if tech in cv_lower:
                technologies.append(tech)

        # Basic metrics
        word_count = len(cv_text.split())
        bullet_count = cv_text.count('•') + cv_text.count('- ')
        achievement_count = len(re.findall(r'\d+%|\$\d+', cv_text))

        return {
            'score': self._calculate_intelligent_score(technologies, word_count, bullet_count, achievement_count),
            'strengths': self._generate_intelligent_strengths(technologies, word_count, cv_text),
            'weaknesses': self._generate_intelligent_weaknesses(technologies, bullet_count, achievement_count),
            'recommendations': self._generate_intelligent_recommendations(technologies, achievement_count),
            'clarity': "Very clear" if word_count > 1000 else "Clear" if word_count > 500 else "Generally clear",
            'relevance': "Highly relevant" if len(technologies) >= 3 else "Relevant",
            'achievements': "Strong" if achievement_count >= 3 else "Good" if achievement_count >= 1 else "Needs improvement",
            'uniqueness': "Unique combination of skills" if len(technologies) >= 4 else "Good professional background",
            'technical_skills': f"Strong in {', '.join(technologies[:3])}" if technologies else "Good technical foundation",
            'soft_skills': "Well demonstrated" if any(skill in cv_lower for skill in ['communication', 'leadership', 'teamwork']) else "Good indication",
            'skill_gaps': "Opportunity to expand" if len(technologies) < 5 else "Well-rounded",
            'skill_strengths': f"Expertise in {', '.join(technologies[:2])}" if technologies else "Solid core skills",
            'readability': "Excellent" if bullet_count >= 8 else "Good" if bullet_count >= 4 else "Needs improvement",
            'structure': "Professional" if any(section in cv_lower for section in ['experience', 'education', 'skills']) else "Good",
            'professionalism': "Highly professional",
            'insights': [
                f"Strong foundation in {len(technologies)} technologies" if technologies else "Good professional foundation",
                "Opportunity to enhance quantitative achievements" if achievement_count < 3 else "Good achievement documentation"
            ]
        }

    def _calculate_intelligent_score(self, technologies: List[str], word_count: int, bullet_count: int, achievement_count: int) -> int:
        """Calculate intelligent score"""
        score = 6

        if len(technologies) >= 3:
            score += 1
        if word_count > 800:
            score += 1
        if bullet_count >= 5:
            score += 1
        if achievement_count >= 2:
            score += 1

        return min(9, max(5, score))

    def _generate_intelligent_strengths(self, technologies: List[str], word_count: int, cv_text: str) -> List[str]:
        """Generate intelligent strengths"""
        strengths = []

        if technologies:
            strengths.append(f"Strong technical skills in {', '.join(technologies[:3])}")

        if word_count > 1000:
            strengths.append("Comprehensive and detailed professional background")

        if any(term in cv_text.lower() for term in ['led', 'managed', 'directed']):
            strengths.append("Clear leadership and management experience")

        return strengths if strengths else ["Professional presentation with clear information"]

    def _generate_intelligent_weaknesses(self, technologies: List[str], bullet_count: int, achievement_count: int) -> List[str]:
        """Generate intelligent weaknesses"""
        weaknesses = []

        if achievement_count < 2:
            weaknesses.append("Limited quantifiable achievements - add specific metrics")

        if bullet_count < 5:
            weaknesses.append("Could benefit from more bullet points for better readability")

        if len(technologies) < 2:
            weaknesses.append("Opportunity to showcase more technical skills")

        return weaknesses if weaknesses else ["Opportunity to enhance specific accomplishments"]

    def _generate_intelligent_recommendations(self, technologies: List[str], achievement_count: int) -> List[str]:
        """Generate intelligent recommendations"""
        recommendations = [
            "Add 2-3 quantifiable achievements with specific numbers and percentages",
            "Use bullet points consistently throughout role descriptions",
            "Highlight key projects and their business impact"
        ]

        if 'python' in technologies:
            recommendations.append("Showcase specific Python projects or contributions")

        if 'aws' in technologies:
            recommendations.append("Detail specific AWS services and architectures used")

        if achievement_count < 2:
            recommendations.append("Use the STAR method (Situation, Task, Action, Result) for achievements")

        return recommendations

# Create the instance for import
cv_analyzer = CVAnalyzer()

LangChain with Gemini 2.0 Flash initialized successfully


## 🎪 Interactive CV Analysis Demo

### **User-Friendly Interface**
- **Multiple Input Methods** - Text, file upload, PDF processing
- **Real-time Preview** - Show extracted content before analysis
- **Professional Display** - Formatted results with collapsible sections

### **Demo Features**
- **Live Analysis** - Process CVs in real-time
- **Comprehensive Reports** - All analysis dimensions displayed
- **Raw Data Access** - JSON output for developers
- **Error Resilience** - Handles various file formats and encodings

### **Output Sections**
- Overall score and summary
- Strengths with specific examples
- Areas for improvement
- Actionable recommendations
- Detailed skill analysis
- Formatting assessment
- Key insights and observations

In [85]:
# Cell 5: Test CV Analyzer with File Upload
import logging
from IPython.display import display, Markdown, HTML
import json
import tempfile
import os

# Setup logging to see the analysis process
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def test_cv_analyzer():
    """Test the CV analyzer with file upload"""

    print("🎯 ===========================================")
    print("🎯 CV ANALYZER TEST - GEMINI 2.0 FLASH")
    print("🎯 ===========================================")
    print()

    # Get CV input
    cv_text = get_cv_input()
    if cv_text is None:
        print("❌ CV input cancelled")
        return None

    print()
    print("⏳ Analyzing CV with Gemini 2.0 Flash...")
    print()

    try:
        # Analyze the CV
        analysis_result = cv_analyzer.analyze_cv(cv_text)

        # Display results
        display_analysis_results(analysis_result)

        return analysis_result

    except Exception as e:
        logger.error(f"❌ Error in CV analysis: {e}")
        display(Markdown("## ❌ ERROR"))
        display(Markdown(f"An error occurred during CV analysis: `{str(e)}`"))
        return None

def get_cv_input():
    """Get CV input from user with multiple options"""

    print("📝 CV INPUT OPTIONS:")
    print("1. 📋 Paste CV text manually")
    print("2. 📁 Upload text file (.txt)")
    print("3. 📄 Upload PDF file (.pdf)")

    while True:
        try:
            choice = input("Choose CV input method (1, 2, or 3): ").strip()
            if choice == '1':
                return get_manual_cv_input()
            elif choice == '2':
                return get_text_file_input()
            elif choice == '3':
                return get_pdf_file_input()
            else:
                print("❌ Please enter 1, 2, or 3")
        except KeyboardInterrupt:
            print("\n⏹️ Input cancelled")
            return None

def get_manual_cv_input():
    """Get manual CV text input"""
    print(f"\n📋 PASTE CV TEXT:")
    print("   (Paste the CV content below, then press Enter twice to finish)")
    print("-" * 50)

    lines = []
    empty_line_count = 0

    while True:
        try:
            line = input()
            if line.strip() == "":
                empty_line_count += 1
                if empty_line_count >= 2 and len(lines) > 0:
                    break
            else:
                empty_line_count = 0
            lines.append(line)
        except EOFError:
            break
        except KeyboardInterrupt:
            print("\n⏹️ Input cancelled")
            return None

    text = "\n".join(lines)
    print(f"✅ CV text received ({len(text)} characters)")

    # Show preview
    preview = text[:300] + "..." if len(text) > 300 else text
    print(f"📖 CV Preview:\n{preview}\n")

    return text

def get_text_file_input():
    """Get CV from text file upload"""
    print(f"\n📁 CV TEXT FILE UPLOAD:")

    try:
        from google.colab import files
        print("📤 Please upload your .txt CV file...")
        uploaded = files.upload()

        if uploaded:
            filename = list(uploaded.keys())[0]
            # Try different encodings for text files
            try:
                content = uploaded[filename].decode('utf-8')
            except UnicodeDecodeError:
                try:
                    content = uploaded[filename].decode('latin-1')
                except UnicodeDecodeError:
                    content = uploaded[filename].decode('utf-8', errors='ignore')

            print(f"✅ Text file '{filename}' uploaded successfully ({len(content)} characters)")

            # Show preview
            preview = content[:300] + "..." if len(content) > 300 else content
            print(f"📖 CV Preview:\n{preview}\n")

            return content
        else:
            print("❌ No file uploaded. Switching to manual input...")
            return get_manual_cv_input()

    except ImportError:
        print("❌ File upload not available in this environment")
        return get_manual_cv_input()

def get_pdf_file_input():
    """Get CV from PDF file upload"""
    print(f"\n📄 CV PDF FILE UPLOAD:")

    try:
        from google.colab import files
        print("📤 Please upload your PDF CV file...")
        uploaded = files.upload()

        if uploaded:
            filename = list(uploaded.keys())[0]
            if not filename.lower().endswith('.pdf'):
                print("❌ Please upload a PDF file. Switching to manual input...")
                return get_manual_cv_input()

            print(f"⏳ Extracting text from PDF '{filename}'...")
            pdf_content = uploaded[filename]

            # Extract text from PDF using the same method as in cell 3
            text_content = extract_text_from_pdf(pdf_content)

            if text_content:
                print(f"✅ PDF text extracted successfully ({len(text_content)} characters)")
                # Show preview of extracted text
                preview = text_content[:300] + "..." if len(text_content) > 300 else text_content
                print(f"📖 Extracted CV Preview:\n{preview}\n")
                return text_content
            else:
                print("❌ Failed to extract text from PDF. Switching to manual input...")
                return get_manual_cv_input()
        else:
            print("❌ No file uploaded. Switching to manual input...")
            return get_manual_cv_input()

    except ImportError:
        print("❌ PDF upload not available in this environment")
        return get_manual_cv_input()

def extract_text_from_pdf(pdf_content):
    """Extract text from PDF content using PyPDFLoader"""
    try:
        # Save PDF content to temporary file
        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as temp_file:
            temp_file.write(pdf_content)
            temp_path = temp_file.name

        # Use PyPDFLoader to extract text
        from langchain_community.document_loaders import PyPDFLoader
        loader = PyPDFLoader(temp_path)
        documents = loader.load()

        # Combine all pages into single text
        text = "\n".join([doc.page_content for doc in documents])

        # Clean up temporary file
        os.unlink(temp_path)

        return text
    except Exception as e:
        print(f"❌ Error extracting text from PDF: {e}")
        return None

def display_analysis_results(analysis_result):
    """Display the CV analysis results in a formatted way"""

    display(Markdown("## 📊 CV ANALYSIS RESULTS"))
    display(Markdown(f"**Analysis Source:** {analysis_result.get('analysis_source', 'Unknown')}"))
    display(Markdown(f"**Model Used:** {analysis_result.get('model_used', 'Unknown')}"))
    display(Markdown(f"**AI Generated:** {'Yes' if analysis_result.get('ai_generated', False) else 'No'}"))

    # Overall Score
    display(Markdown(f"### 🎯 Overall Score: {analysis_result.get('overall_score', 'N/A')}"))

    # Strengths
    display(Markdown("### ✅ STRENGTHS"))
    strengths = analysis_result.get('strengths', [])
    if strengths:
        for i, strength in enumerate(strengths, 1):
            display(Markdown(f"{i}. {strength}"))
    else:
        display(Markdown("*No specific strengths identified*"))

    # Weaknesses
    display(Markdown("### ❌ AREAS FOR IMPROVEMENT"))
    weaknesses = analysis_result.get('weaknesses', [])
    if weaknesses:
        for i, weakness in enumerate(weaknesses, 1):
            display(Markdown(f"{i}. {weakness}"))
    else:
        display(Markdown("*No specific weaknesses identified*"))

    # Recommendations
    display(Markdown("### 💡 RECOMMENDATIONS"))
    recommendations = analysis_result.get('recommendations', [])
    if recommendations:
        for i, recommendation in enumerate(recommendations, 1):
            display(Markdown(f"{i}. {recommendation}"))
    else:
        display(Markdown("*No specific recommendations*"))

    # Content Analysis
    display(Markdown("### 📝 CONTENT ANALYSIS"))
    content_analysis = analysis_result.get('content_analysis', {})
    display(Markdown(f"- **Clarity:** {content_analysis.get('clarity', 'N/A')}"))
    display(Markdown(f"- **Relevance:** {content_analysis.get('relevance', 'N/A')}"))
    display(Markdown(f"- **Achievements:** {content_analysis.get('achievements', 'N/A')}"))
    display(Markdown(f"- **Uniqueness:** {content_analysis.get('uniqueness', 'N/A')}"))

    # Skill Analysis
    display(Markdown("### 🔧 SKILL ANALYSIS"))
    skill_analysis = analysis_result.get('skill_analysis', {})
    display(Markdown(f"- **Technical Skills:** {skill_analysis.get('technical_skills', 'N/A')}"))
    display(Markdown(f"- **Soft Skills:** {skill_analysis.get('soft_skills', 'N/A')}"))
    display(Markdown(f"- **Skill Gaps:** {skill_analysis.get('skill_gaps', 'N/A')}"))
    display(Markdown(f"- **Skill Strengths:** {skill_analysis.get('skill_strengths', 'N/A')}"))

    # Formatting Analysis
    display(Markdown("### 📐 FORMATTING ANALYSIS"))
    formatting_analysis = analysis_result.get('formatting_analysis', {})
    display(Markdown(f"- **Readability:** {formatting_analysis.get('readability', 'N/A')}"))
    display(Markdown(f"- **Structure:** {formatting_analysis.get('structure', 'N/A')}"))
    display(Markdown(f"- **Professionalism:** {formatting_analysis.get('professionalism', 'N/A')}"))

    # Key Insights
    display(Markdown("### 🔍 KEY INSIGHTS"))
    insights = analysis_result.get('key_insights', [])
    if insights:
        for i, insight in enumerate(insights, 1):
            display(Markdown(f"{i}. {insight}"))
    else:
        display(Markdown("*No specific insights*"))

    # Raw JSON view (collapsible)
    display(Markdown("### 📋 RAW ANALYSIS DATA"))
    display(HTML("""
    <details>
    <summary>Click to view raw JSON data</summary>
    <pre style="background: #f4f4f4; padding: 10px; border-radius: 5px; overflow-x: auto;">
    """ + json.dumps(analysis_result, indent=2) + """
    </pre>
    </details>
    """))

# Quick test function for already loaded text
def quick_analyze_cv(cv_text):
    """Quick analysis function for already loaded CV text"""
    print("⏳ Quick analyzing CV...")
    result = cv_analyzer.analyze_cv(cv_text)
    display_analysis_results(result)
    return result


# 📄 Demo of our agent’s functionality the Analysis Pipeline
The function below runs the complete CV–JD analysis pipeline.
It orchestrates text parsing, skill extraction, and LLM-based reasoning.


In [86]:
results = run_complete_analysis()

🎯 CV-JD SKILL GAP ANALYSIS & QUESTION GENERATOR

Choose input method for CV and JD:
• Option 1: 📋 Paste text manually
• Option 2: 📁 Upload text files (.txt)
• Option 3: 📄 Upload PDF files (.pdf)


📝 CV INPUT OPTIONS:
1. 📋 Paste text manually
2. 📁 Upload text file (.txt)
3. 📄 Upload PDF file (.pdf)

📋 PASTE CV TEXT:
   (Paste the content below, then press Enter twice to finish)
--------------------------------------------------
✅ CV text received (862 characters)

📝 JD INPUT OPTIONS:
1. 📋 Paste text manually
2. 📁 Upload text file (.txt)
3. 📄 Upload PDF file (.pdf)
❌ Please enter 1, 2, or 3

📋 PASTE JD TEXT:
   (Paste the content below, then press Enter twice to finish)
--------------------------------------------------


🔍 Running skill gap analysis...
Extracted 14 clean skills: ['Django', 'Docker', 'Flask', 'Git', 'Javascript', 'Jira', 'Kubernetes', 'Languages: Python', 'Microservices', 'Python', 'React', 'React Tools: Git', 'Sql Frameworks: Django', 'Team Leadership']
Extracted 4 clean skills: ['Aws', 'Django', 'Gcp', 'Python']


✅ JD text received (877 characters)

⏳ Analyzing skills and generating questions...



Batches: 100%|██████████| 1/1 [00:00<00:00, 82.08it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 69.86it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 84.17it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 83.64it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 82.63it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 82.71it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 78.73it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 87.84it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 78.38it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 89.15it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 83.56it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 86.20it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 75.60it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 82.06it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 82.20it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 98.82it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 80.77it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 96.68it/s]
Batches: 1

## 📊 SKILL GAP ANALYSIS RESULTS

**Match Percentage:** 50.0%

**CV Skills Found:** 14

**JD Skills Required:** 4

### ✅ MATCHED SKILLS

**1. Django** (Similarity: 1.0)

   - CV Evidence: *EXPERIENCE: Senior Software Engineer, TechCorp Inc (2019-Present) - Developed and maintained **scalable microservices...*

   - JD Requirement: *• Must have **5+ years of experience** with **Python** and **Django***




**2. Python** (Similarity: 1.0)

   - CV Evidence: *EXPERIENCE: Senior Software Engineer, TechCorp Inc (2019-Present) - Developed and maintained **scalable microservices...*

   - JD Requirement: *• Must have **5+ years of experience** with **Python** and **Django***




### ❌ MISSING SKILLS

**1. Aws**

   - JD Context: *• Design, deploy, and maintain solutions on **AWS** and **GCP Cloud Platforms***




**2. Gcp**

   - JD Context: *• Design, deploy, and maintain solutions on **AWS** and **GCP Cloud Platforms***

🧠 Generating interview questions with answers and hints
📊 Matched skills: ['Django', 'Python']
📊 Missing skills: ['Aws', 'Gcp']




🔄 Generating interview questions based on analysis...



📨 Raw AI Response:
```json
[
  {
    "question": "Describe a time when you had to make a significant architectural decision that had a major impact on a project. What were the key considerations, and how did you ensure buy-in from the team?",
    "focus": "behavioral",
    "rationale": "This question assesses the candidate's ability to make strategic architectural decisions and influence others, which is crucial for a Senior Software Architect role.",
    "skill_type": "matched_skill",
    "target_skill": "Softwar...
✅ Successfully parsed 6 questions
📝 Question 1: Describe a time when you had to make a significant architectural decision that had a major impact on...
   Answer: Yes
   Hints: 4
   Key Points: 4
📝 Question 2: Tell me about a time you had to deal with significant technical debt in a project. What steps did yo...
   Answer: Yes
   Hints: 4
   Key Points: 4
📝 Question 3: Describe a complex Django project you've worked on. What were some of the key architectural decision...
 

## 🎯 GENERATED INTERVIEW QUESTIONS

**Total Questions Generated:** 6

### ❓ Question 1: Describe a time when you had to make a significant architectural decision that had a major impact on a project. What were the key considerations, and how did you ensure buy-in from the team?

**Focus:** behavioral | **Skill Type:** matched_skill

**Target Skill:** Software Architecture

**Rationale:** This question assesses the candidate's ability to make strategic architectural decisions and influence others, which is crucial for a Senior Software Architect role.

#### 💡 Model Answer:

S: We were building a new payment processing system and needed to decide on the overall architecture. T: My task was to choose an architecture that was scalable, secure, and maintainable. A: I researched several options, including a monolithic architecture and a microservices architecture. I presented the pros and cons of each to the team, highlighting the long-term benefits of microservices in terms of scalability and fault isolation. I also addressed concerns about the increased complexity by proposing a phased rollout and providing training on microservices best practices. R: We chose the microservices architecture, and the system has been successfully processing millions of transactions per day with minimal downtime. The phased rollout and training helped the team adapt quickly, and we avoided major integration issues.

#### 🎯 Answer Hints:

1. Focus on the decision-making process.

2. Explain how you considered different options.

3. Highlight your communication and persuasion skills.

4. Quantify the impact of your decision if possible.

#### 🔑 Key Points to Cover:

1. Demonstrated strategic thinking.

2. Effective communication and collaboration.

3. Consideration of various architectural options.

4. Positive impact on the project.





### ❓ Question 2: Tell me about a time you had to deal with significant technical debt in a project. What steps did you take to address it, and what were the results?

**Focus:** behavioral | **Skill Type:** matched_skill

**Target Skill:** Technical Debt

**Rationale:** This question assesses the candidate's experience in identifying and mitigating technical debt, a key responsibility for a Senior Software Architect.

#### 💡 Model Answer:

S: Our legacy system had accumulated significant technical debt due to years of quick fixes and lack of proper documentation. T: My task was to lead an effort to reduce this debt and improve the system's maintainability. A: I started by conducting a thorough code review and identifying the areas with the most technical debt. I then prioritized the issues based on their impact on the system and the effort required to fix them. I worked with the team to refactor the code, improve documentation, and implement automated testing. We also established coding standards and code review processes to prevent future technical debt. R: As a result, we significantly reduced the number of bugs, improved the system's performance, and made it easier for new developers to onboard. We also saw a decrease in the time required to implement new features.

#### 🎯 Answer Hints:

1. Describe the specific type of technical debt.

2. Explain your prioritization process.

3. Highlight the steps you took to address the debt.

4. Quantify the positive outcomes.

#### 🔑 Key Points to Cover:

1. Identified and prioritized technical debt.

2. Implemented effective solutions.

3. Improved system maintainability and performance.

4. Prevented future technical debt.





### ❓ Question 3: Describe a complex Django project you've worked on. What were some of the key architectural decisions you made, and how did you leverage Django's features to solve specific challenges?

**Focus:** technical_django | **Skill Type:** matched_skill

**Target Skill:** Django

**Rationale:** This question assesses the candidate's deep understanding of Django and their ability to use it effectively in complex projects.

#### 💡 Model Answer:

I worked on a large e-commerce platform using Django. One key decision was to use Django Rest Framework for building a robust API for mobile apps and third-party integrations. We leveraged Django's ORM for database interactions, optimizing queries using select_related and prefetch_related to minimize database hits. For handling asynchronous tasks like sending emails and processing orders, we integrated Celery with Redis as the broker. To manage user authentication and authorization, we used Django's built-in authentication system with custom permission classes to implement fine-grained access control. We also implemented caching strategies using Django's cache framework to improve performance.

#### 🎯 Answer Hints:

1. Mention specific Django features you used.

2. Explain the rationale behind your architectural decisions.

3. Discuss performance optimization techniques.

4. Highlight any custom solutions you implemented.

#### 🔑 Key Points to Cover:

1. Use of Django Rest Framework.

2. ORM optimization techniques.

3. Asynchronous task handling with Celery.

4. Custom authentication and authorization.





### ❓ Question 4: Explain your approach to writing efficient and maintainable Python code. What are some of the best practices you follow, and how do you ensure code quality?

**Focus:** technical_python | **Skill Type:** matched_skill

**Target Skill:** Python

**Rationale:** This question assesses the candidate's proficiency in Python and their commitment to writing high-quality code.

#### 💡 Model Answer:

I prioritize writing clean, readable, and well-documented Python code. I follow PEP 8 guidelines for code style and use linters like Flake8 and pylint to enforce consistency. I write unit tests using pytest to ensure that my code functions correctly and to prevent regressions. I use type hints to improve code clarity and catch potential errors early on. I also practice code review and pair programming to get feedback from other developers and improve code quality. I use virtual environments to manage dependencies and ensure reproducibility. I also use logging to track errors and debug issues.

#### 🎯 Answer Hints:

1. Mention PEP 8 and other coding standards.

2. Discuss your testing strategy.

3. Explain how you use linters and type hints.

4. Highlight the importance of code review.

#### 🔑 Key Points to Cover:

1. Adherence to coding standards.

2. Comprehensive testing strategy.

3. Use of linters and type hints.

4. Emphasis on code review and collaboration.





### ❓ Question 5: Imagine we are migrating a critical application to AWS. Describe your approach to designing a resilient and scalable architecture using AWS services. Consider factors like cost optimization, security, and disaster recovery.

**Focus:** situational | **Skill Type:** missing_skill

**Target Skill:** Aws

**Rationale:** This question assesses the candidate's ability to design cloud solutions on AWS, a missing skill identified from the job description.

#### 💡 Model Answer:

First, I'd analyze the application's requirements, including performance, scalability, and security needs. For resilience, I'd deploy the application across multiple Availability Zones using services like EC2 Auto Scaling and Elastic Load Balancing. For scalability, I'd use Auto Scaling to automatically adjust the number of EC2 instances based on demand. For cost optimization, I'd use Reserved Instances or Spot Instances for non-critical workloads and leverage AWS Cost Explorer to monitor spending. For security, I'd use IAM roles and policies to control access to AWS resources, configure security groups to restrict network traffic, and use AWS WAF to protect against web attacks. For disaster recovery, I'd implement a backup and recovery plan using services like S3 and Glacier, and consider using AWS CloudFormation to automate the deployment of infrastructure.

#### 🎯 Answer Hints:

1. Mention specific AWS services.

2. Explain how you would address resilience and scalability.

3. Discuss cost optimization strategies.

4. Highlight security best practices.

5. Describe your approach to disaster recovery.

#### 🔑 Key Points to Cover:

1. Multi-AZ deployment with Auto Scaling and ELB.

2. Cost optimization using Reserved/Spot Instances.

3. IAM roles, security groups, and AWS WAF for security.

4. Backup and recovery plan with S3 and Glacier.





### ❓ Question 6: Our company is considering adopting a multi-cloud strategy, leveraging both AWS and GCP. How would you approach designing a system that can seamlessly operate across both platforms, and what are some of the challenges we might encounter?

**Focus:** situational | **Skill Type:** missing_skill

**Target Skill:** Gcp

**Rationale:** This question assesses the candidate's ability to design cloud solutions on GCP and their understanding of multi-cloud architectures, a missing skill identified from the job description.

#### 💡 Model Answer:

To design a system that operates seamlessly across AWS and GCP, I'd focus on using platform-agnostic technologies and services where possible. For example, I'd use Kubernetes for container orchestration, as it's available on both platforms. I'd also use a service mesh like Istio to manage traffic and security between services. For data storage, I'd consider using a distributed database like Cassandra or CockroachDB that can be deployed across both clouds. For identity and access management, I'd use a centralized identity provider like Okta or Azure AD. Some challenges we might encounter include data synchronization between clouds, network latency, and the complexity of managing infrastructure across multiple platforms. We'd also need to consider the different pricing models and service offerings of each cloud provider.

#### 🎯 Answer Hints:

1. Mention platform-agnostic technologies.

2. Discuss the challenges of multi-cloud deployments.

3. Explain how you would address data synchronization and network latency.

4. Highlight the importance of centralized identity management.

#### 🔑 Key Points to Cover:

1. Kubernetes and Istio for container orchestration and service mesh.

2. Distributed database for data storage.

3. Centralized identity provider for access management.

4. Addressing data synchronization and network latency challenges.





In [89]:
analysis_results = test_cv_analyzer()

🎯 CV ANALYZER TEST - GEMINI 2.0 FLASH

📝 CV INPUT OPTIONS:
1. 📋 Paste CV text manually
2. 📁 Upload text file (.txt)
3. 📄 Upload PDF file (.pdf)

📋 PASTE CV TEXT:
   (Paste the CV content below, then press Enter twice to finish)
--------------------------------------------------


Starting CV analysis with Gemini 2.0 Flash...
Sending request to Gemini 2.0 Flash via LangChain...


✅ CV text received (2518 characters)
📖 CV Preview:
Meriem Mojaat § MeriemMojaat | ï meriem-mojaat | # meriemmojaat216@gmail.com | H +216 58 412 360 Work Experience Software Development Intern – Smart System, Tunisia Jun 2025 – Aug 2025 • Developed web applications to automate document processes. • Implemented a Spring Boot backend and built frontend...


⏳ Analyzing CV with Gemini 2.0 Flash...



Gemini 2.0 Flash response received: 5549 characters
Successfully got Gemini 2.0 Flash analysis via LangChain


## 📊 CV ANALYSIS RESULTS

**Analysis Source:** gemini_2.0_flash_langchain

**Model Used:** gemini-2.0-flash-exp

**AI Generated:** Yes

### 🎯 Overall Score: 7/10

### ✅ STRENGTHS

1. Strong project portfolio showcasing experience in AI, Machine Learning, and MLOps, demonstrated by projects like 'EmotionAI', 'Lung Nodule Detection', and 'MLOps and Model Deployment'.

2. The CV demonstrates a clear progression of skills, from basic networking internships to more advanced AI/ML projects, highlighting continuous learning and development.

3. The inclusion of NVIDIA and Cisco certifications adds credibility to the candidate's technical skills in AI and Networking.

4. The candidate's skills in both backend (Java, Spring Boot) and frontend (HTML, CSS) technologies, as well as DevOps tools (Docker, Jenkins, CI/CD), make them a well-rounded candidate.

### ❌ AREAS FOR IMPROVEMENT

1. The descriptions in the 'Work Experience' section lack quantifiable achievements. For example, 'Developed web applications to automate document processes' needs more detail on the impact or scale of the automation.

2. The 'Projects' section, while impressive, lacks specific metrics about the deployment and real-world impact of the models. For example, for 'Customer Churn Prediction', it's unclear if the model was deployed and how it improved churn rates.

3. The 'Skills' section is a long list, making it difficult to quickly assess proficiency levels. There's no indication of the depth of knowledge in each skill.

4. The dates for the projects are all listed as '2025' or '2024', which is confusing since some of the internships occurred earlier. This seems like a typo.

### 💡 RECOMMENDATIONS

1. Quantify achievements in the 'Work Experience' section. For instance, change 'Developed web applications to automate document processes' to 'Developed web applications using Java and Spring Boot that automated document processes, reducing processing time by 30% and saving 10 man-hours per week'.

2. Add metrics and impact to the 'Projects' section. For example, for 'Lung Nodule Detection', specify the dataset size, the evaluation metrics (e.g., F1-score, AUC), and whether it was tested on real-world data. Mention if the scientific article was published and where.

3. Categorize the 'Skills' section into proficiency levels (e.g., Expert, Proficient, Familiar) or years of experience. For example, under Python, add '(3+ years experience)' or '(Expert)' to indicate a higher level of expertise.

4. Correct the dates for the projects to accurately reflect when they were completed. Include the month and year if possible (e.g., January 2024 - March 2024).

5. For the MLOps project, specify which cloud platform (AWS, Azure, GCP) was used for deployment. Also, mention the specific tools used for monitoring (e.g., Prometheus, Grafana).

6. Elaborate on the 'Business Intelligence (BI)' skill. Mention specific tools used (e.g., Tableau, Power BI) and projects where BI skills were applied.

### 📝 CONTENT ANALYSIS

- **Clarity:** The writing is generally clear but lacks detail and quantifiable results. The bullet points are concise but need to be more impactful.

- **Relevance:** The content is highly relevant to roles in Software Development, Machine Learning, and DevOps, given the mix of internships, projects, and skills.

- **Achievements:** Achievements are present but not highlighted effectively. They need to be quantified and contextualized to demonstrate impact.

- **Uniqueness:** The combination of AI/ML projects with networking experience and DevOps skills makes this CV stand out, showing a diverse skill set.

### 🔧 SKILL ANALYSIS

- **Technical Skills:** The technical skills are extensive, covering a wide range of technologies from programming languages (Python, Java) to ML frameworks (TensorFlow, PyTorch) and DevOps tools (Docker, Jenkins).

- **Soft Skills:** The soft skills listed are generic. Provide examples of how these skills were applied in projects or internships. For example, 'Teamwork: Collaborated with a team of 5 engineers on the EmotionAI project'.

- **Skill Gaps:** While the skills are broad, there's a lack of emphasis on cloud computing platforms (AWS, Azure, GCP) beyond deployment. Specifying experience with cloud services would be beneficial.

- **Skill Strengths:** The candidate's strength lies in their AI/ML skills, particularly their experience with deep learning frameworks and MLOps tools.

### 📐 FORMATTING ANALYSIS

- **Readability:** The CV is generally readable, but the long list of skills could be better organized. Using bullet points effectively, but could be more concise within the descriptions.

- **Structure:** The structure is standard and logical, with clear sections for work experience, projects, education, and skills.

- **Professionalism:** The CV is professionally presented, but improving the clarity and detail of the content would enhance its impact.

### 🔍 KEY INSIGHTS

1. The candidate has a strong foundation in AI/ML and is actively pursuing related projects, indicating a clear interest and specialization in this field.

2. The combination of internships in different areas (Networking, Java Development, AI) suggests a broad exploration of career paths before focusing on AI/ML.

3. The inclusion of MLOps skills is a valuable asset, indicating an understanding of the entire ML lifecycle, from model development to deployment and monitoring.

### 📋 RAW ANALYSIS DATA

## Part Overview

This part implements a comprehensive **AI-powered CV analysis system** that combines multiple advanced techniques:

- **RAG (Retrieval-Augmented Generation)** for semantic search and context-aware analysis  
- **Fine-tuned Transformer Models** for precise job category classification  
- **Large Language Models (LLMs)** for natural language understanding and generation  
- **Vector Database** for efficient similarity search across CV documents  

The system is designed for **HR professionals and recruiters** to quickly analyze large volumes of resumes and identify the best candidates for specific job requirements.
## System Architecture Summary

### Frameworks & Technologies Used:

**Core AI Frameworks:**
- **Sentence Transformers**: For embedding generation in RAG pipeline
- **Hugging Face Transformers**: For fine-tuning BERT classification models
- **FAISS**: For efficient vector similarity search
- **PyTorch**: Deep learning backend
- **Google Generative AI**: Gemini LLM integration

**Supporting Infrastructure:**
- **Flask**: REST API server
- **n8n.cloud**: Workflow automation and webhooks
- **ngrok**: Public API tunneling

### Model Architectures:

1. **RAG Embedding Model**: `all-MiniLM-L6-v2`
   - Siamese network architecture
   - 384-dimensional embeddings
   - Cosine similarity for retrieval

2. **Fine-tuned Classifier**: `bert-base-uncased`
   - 12 transformer layers, 768 hidden size
   - Custom classification head for 22 job categories
   - Fine-tuned on CV dataset

3. **LLM**: `gemini-2.0-flash-exp`
   - 128K context window
   - Advanced reasoning capabilities

### Fine-Tuning Techniques:

- **Stratified train/validation split** (80/20)
- **Dynamic padding** and **gradient clipping**
- **Linear learning rate warmup** with decay
- **Multi-epoch training** with checkpointing
- **Weighted F1-score** for imbalanced classes

### RAG Implementation:

**Retrieval Component:**
- PDF text extraction and intelligent chunking
- FAISS vector store with cosine similarity
- Top-K semantic search with confidence scores

**Augmentation Strategy:**
- Multi-source context integration
- Category predictions from fine-tuned model
- Enhanced prompt engineering for HR context

**Generation Pipeline:**
- Structured prompts for recruitment analysis
- Professional HR-focused response formatting
- Actionable insights and recommendations



## 🛠️ Installation & Setup

First, let's install all required dependencies and import necessary libraries.

In [2]:
import os
import PyPDF2
import pdfplumber
import pandas as pd
import numpy as np
import json
import pickle
from sentence_transformers import SentenceTransformer
import faiss
import google.generativeai as genai
from sklearn.metrics.pairwise import cosine_similarity
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import torch
import requests
from flask import Flask, request, jsonify
from flask_cors import CORS
import threading
from typing import List, Dict, Any

print("✅ All libraries imported successfully!")

  from .autonotebook import tqdm as notebook_tqdm


✅ All libraries imported successfully!


## 📊 Dataset Configuration

Configure the dataset path and verify the structure. The system expects PDF files organized in category folders.

In [4]:
# Configure Gemini API
GEMINI_API_KEY = "AIzaSyCtBiJPlRK7uEEg82jKoN0ELhhsfQoZLzk"
genai.configure(api_key=GEMINI_API_KEY)

# Define dataset path (update this to your actual path)
dataset_path = r"C:\Users\merie\Downloads\dataset\data\data"

# Create necessary directories
os.makedirs('./models', exist_ok=True)
os.makedirs('./logs', exist_ok=True)

print("✅ Configuration setup complete!")
print(f"Dataset path: {dataset_path}")
print("📁 Directories created: dataset/data, models, logs")

✅ Configuration setup complete!
Dataset path: C:\Users\merie\Downloads\dataset\data\data
📁 Directories created: dataset/data, models, logs


# Core System Components

## 1. CV Processor Class

The `CVProcessor` class handles:
- PDF text extraction and chunking
- Embedding generation using Sentence Transformers
- FAISS vector index creation for efficient similarity search
- Semantic search across CV database

In [5]:
class CVProcessor:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        """
        Initialize the CV Processor with a sentence transformer model.
        
        Args:
            model_name (str): Name of the SentenceTransformer model to use for embeddings
        """
        # Initialize the sentence transformer model for generating embeddings
        self.model = SentenceTransformer(model_name)
        # FAISS index for efficient similarity search (will be created later)
        self.index = None
        # List to store all document chunks
        self.documents = []
        # List to store metadata for each document chunk
        self.metadata = []
        
    def extract_text_from_pdf(self, pdf_path: str) -> str:
        """
        Extract text content from PDF files using pdfplumber.
        
        Args:
            pdf_path (str): Path to the PDF file
            
        Returns:
            str: Extracted text content from the PDF
        """
        text = ""
        try:
            # Open PDF file using pdfplumber
            with pdfplumber.open(pdf_path) as pdf:
                # Iterate through all pages in the PDF
                for page in pdf.pages:
                    # Extract text from current page
                    page_text = page.extract_text()
                    if page_text:
                        # Append page text with newline separator
                        text += page_text + "\n"
        except Exception as e:
            # Handle any errors during PDF processing
            print(f"Error extracting text from {pdf_path}: {str(e)}")
        return text
    
    def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
        """
        Split text into overlapping chunks for better processing.
        
        Args:
            text (str): Input text to chunk
            chunk_size (int): Number of words per chunk
            overlap (int): Number of overlapping words between consecutive chunks
            
        Returns:
            List[str]: List of text chunks
        """
        # Split text into individual words
        words = text.split()
        chunks = []
        
        # Create chunks with specified overlap
        for i in range(0, len(words), chunk_size - overlap):
            # Join words to form a chunk
            chunk = " ".join(words[i:i + chunk_size])
            chunks.append(chunk)
            
            # Break if we've reached the end of the text
            if i + chunk_size >= len(words):
                break
                
        return chunks
    
    def process_dataset(self, dataset_path: str) -> List[str]:
        """
        Process all PDF files in the dataset directory structure.
        
        Expected directory structure:
        dataset_path/
        ├── category1/
        │   ├── file1.pdf
        │   └── file2.pdf
        ├── category2/
        │   └── file3.pdf
        
        Args:
            dataset_path (str): Path to the root dataset directory
            
        Returns:
            List[str]: List of all text chunks from all PDFs
        """
        all_chunks = []
        
        # Iterate through each category directory
        for category in os.listdir(dataset_path):
            category_path = os.path.join(dataset_path, category)
            
            # Check if it's a directory (not a file)
            if os.path.isdir(category_path):
                # Process each PDF file in the category directory
                for filename in os.listdir(category_path):
                    if filename.lower().endswith('.pdf'):
                        pdf_path = os.path.join(category_path, filename)
                        print(f"Processing: {pdf_path}")
                        
                        # Extract text from PDF
                        text = self.extract_text_from_pdf(pdf_path)
                        
                        # Only process if text was successfully extracted
                        if text.strip():
                            # Split text into chunks
                            chunks = self.chunk_text(text)
                            
                            # Store chunks and their metadata
                            for i, chunk in enumerate(chunks):
                                all_chunks.append(chunk)
                                self.metadata.append({
                                    'category': category,      # CV category (e.g., 'engineering', 'marketing')
                                    'filename': filename,      # Original PDF filename
                                    'chunk_id': i,            # Chunk index within the document
                                    'original_path': pdf_path # Full path to source PDF
                                })
        
        # Store all chunks for later use
        self.documents = all_chunks
        return all_chunks
    
    def create_embeddings(self):
        """
        Generate embeddings for all document chunks and create FAISS index.
        
        Raises:
            ValueError: If no documents have been processed
            
        Returns:
            embeddings: Generated embeddings for all documents
        """
        if not self.documents:
            raise ValueError("No documents processed. Call process_dataset first.")
        
        # Generate embeddings using the sentence transformer model
        embeddings = self.model.encode(self.documents, show_progress_bar=True)
        
        # Initialize FAISS index for cosine similarity search
        dimension = embeddings.shape[1]  # Get embedding dimension
        self.index = faiss.IndexFlatIP(dimension)  # Inner product index for cosine similarity
        
        # Normalize embeddings for cosine similarity (L2 normalization)
        faiss.normalize_L2(embeddings)
        # Add embeddings to the FAISS index
        self.index.add(embeddings.astype('float32'))
        
        return embeddings
    
    def search_similar(self, query: str, top_k: int = 5) -> List[Dict]:
        """
        Search for documents similar to the query using semantic similarity.
        
        Args:
            query (str): Search query string
            top_k (int): Number of top results to return
            
        Returns:
            List[Dict]: List of dictionaries containing:
                - document: The similar text chunk
                - metadata: Associated metadata
                - score: Similarity score (cosine similarity)
                
        Raises:
            ValueError: If index hasn't been created
        """
        if self.index is None:
            raise ValueError("Index not created. Call create_embeddings first.")
        
        # Encode the query into embedding space
        query_embedding = self.model.encode([query])
        # Normalize query embedding for cosine similarity
        faiss.normalize_L2(query_embedding)
        
        # Search the FAISS index for similar documents
        scores, indices = self.index.search(query_embedding.astype('float32'), top_k)
        
        # Compile results with documents, metadata, and similarity scores
        results = []
        for score, idx in zip(scores[0], indices[0]):
            # Ensure index is within bounds
            if idx < len(self.documents):
                results.append({
                    'document': self.documents[idx],  # The similar text chunk
                    'metadata': self.metadata[idx],   # Associated metadata
                    'score': float(score)             # Similarity score (0-1)
                })
        
        return results

print("✅ CV Processor class defined!")

✅ CV Processor class defined!


## 2. Data Processing Pipeline

Initialize the CV processor and process the dataset. This step:
- Extracts text from all PDF files
- Chunks text into manageable segments
- Generates embeddings for semantic search
- Builds FAISS index for fast retrieval

In [10]:
# Initialize and process CVs
print("📂 Step 1: Processing CV Data...")
cv_processor = CVProcessor()

# Check if processor already exists to avoid reprocessing
if os.path.exists('cv_processor.pkl'):
    print("Loading existing CV processor...")
    # Load pre-processed CV processor from pickle file
    with open('cv_processor.pkl', 'rb') as f:
        cv_processor = pickle.load(f)
    print(f"Loaded {len(cv_processor.documents)} existing document chunks")
else:
    print("Processing dataset...")
    # Process all PDFs in the dataset directory and chunk the text
    documents = cv_processor.process_dataset(dataset_path)
    print(f"✅ Processed {len(documents)} document chunks")

    print("Creating embeddings...")
    # Generate embeddings for all document chunks and build FAISS index
    embeddings = cv_processor.create_embeddings()
    print("✅ Embeddings created successfully!")

    # Save the processor to disk for future use (avoids reprocessing)
    with open('cv_processor.pkl', 'wb') as f:
        pickle.dump(cv_processor, f)
    print("✅ Processor saved successfully!")

# Display final statistics about the processed CV database
print(f"📊 Database stats: {len(cv_processor.documents)} CV chunks ready for search")

📂 Step 1: Processing CV Data...
Loading existing CV processor...
Loaded 5300 existing document chunks
📊 Database stats: 5300 CV chunks ready for search


## 3. Fine-Tuning Data Preparation

The `FineTuningDataPreparer` class prepares training data for job category classification by:
- Extracting text from CV PDFs
- Creating labeled training samples
- Generating category mappings for classification tasks

In [None]:
class FineTuningDataPreparer:
    def __init__(self, dataset_path: str):
        """
        Initialize the FineTuningDataPreparer for creating training datasets from CVs.
        
        Args:
            dataset_path (str): Path to the dataset directory containing categorized CVs
        """
        self.dataset_path = dataset_path
    
    def extract_text_from_pdf(self, pdf_path: str) -> str:
        """
        Extract text content from PDF files for training data.
        
        Args:
            pdf_path (str): Path to the PDF file
            
        Returns:
            str: Extracted text content from the PDF
        """
        text = ""
        try:
            # Open PDF file using pdfplumber
            with pdfplumber.open(pdf_path) as pdf:
                # Iterate through all pages and extract text
                for page in pdf.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"
        except Exception as e:
            # Handle extraction errors
            print(f"Error extracting text from {pdf_path}: {str(e)}")
        return text
    
    def extract_training_samples(self) -> List[Dict]:
        """
        Extract CV content and create structured training samples.
        
        Returns:
            List[Dict]: List of training samples with text, category, and filename
        """
        training_samples = []
        
        # Iterate through each category directory in the dataset
        for category in os.listdir(self.dataset_path):
            category_path = os.path.join(self.dataset_path, category)
            
            # Process only directories (not files)
            if os.path.isdir(category_path):
                # Process each PDF file in the category directory
                for filename in os.listdir(category_path):
                    if filename.lower().endswith('.pdf'):
                        pdf_path = os.path.join(category_path, filename)
                        print(f"Processing for training: {pdf_path}")
                        
                        # Extract text from PDF
                        text = self.extract_text_from_pdf(pdf_path)
                        
                        # Only include samples with valid text content
                        if text.strip():
                            # Create training sample with truncated text (for efficiency)
                            training_samples.append({
                                'text': text[:2000],  # Use first 2000 characters to manage size
                                'category': category,  # Job category (e.g., 'engineering', 'design')
                                'filename': filename   # Original PDF filename
                            })
        
        return training_samples
    
    def create_classification_dataset(self):
        """
        Create a structured dataset for job category classification training.
        
        Returns:
            tuple: (DataFrame, label_to_id mapping, id_to_label mapping) or (None, None, None) if no samples
        """
        # Extract training samples from all PDFs
        samples = self.extract_training_samples()
        
        # Check if any samples were found
        if not samples:
            print("No training samples found!")
            return None, None, None
        
        # Convert samples to pandas DataFrame for easier processing
        df = pd.DataFrame(samples)
        
        # Create label mappings for classification
        categories = df['category'].unique()
        label_to_id = {category: idx for idx, category in enumerate(categories)}
        id_to_label = {idx: category for category, idx in label_to_id.items()}
        
        # Add numerical labels to the DataFrame
        df['label'] = df['category'].map(label_to_id)
        
        # Print dataset statistics
        print(f"✅ Created dataset with {len(df)} samples across {len(categories)} categories")
        print("📋 Categories:", list(categories))
        
        return df, label_to_id, id_to_label

print("✅ FineTuningDataPreparer class defined!")

✅ FineTuningDataPreparer class defined!


## 4. Training Data Generation

Generate training dataset for fine-tuning the classification model. This creates:
- Structured training samples with text and categories
- Label mappings for 22 job categories
- Balanced dataset for model training

In [None]:
# Step 2: Preparing Fine-Tuning Data for Model Training
print("🎯 Step 2: Preparing Fine-Tuning Data...")

# Initialize the data preparer with the dataset path
ft_preparer = FineTuningDataPreparer(dataset_path)

# Create classification dataset from the CV PDFs
# Returns:
# - training_df: DataFrame with text samples and labels
# - label_mapping: Dictionary mapping category names to numerical IDs
# - id_to_label: Reverse mapping from numerical IDs back to category names
training_df, label_mapping, id_to_label = ft_preparer.create_classification_dataset()

# Check if training data was successfully created
if training_df is not None:
    # Save the label mapping to JSON file for future use during inference
    with open('label_mapping.json', 'w') as f:
        json.dump(label_mapping, f)
    print("✅ Label mapping saved!")
    
    # Display training dataset statistics and overview
    print(f"📊 Training data overview:")
    print(f"   - Total samples: {len(training_df)}")  # Number of CV samples processed
    print(f"   - Categories: {len(label_mapping)}")   # Number of unique job categories
    print(f"   - Sample categories: {list(label_mapping.keys())[:5]}...")  # Show first 5 categories as examples
    
else:
    # Handle case where no training data could be extracted
    print("❌ No training data available - check your dataset path")
    training_df = None  # Explicitly set to None for clarity

🎯 Step 2: Preparing Fine-Tuning Data...
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\10554236.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\10674770.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\11163645.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\11759079.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\12065211.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\12202337.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\12338274.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\12442909.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\12780508.pdf
Processing for training: C:\Users\merie\Downloads\dataset\data\data\ACCOUNTANT\12802330.pdf
Processing for training: C:\Users\merie\

## 5. Classification Model Trainer

The `CVClassifierTrainer` class implements:
- BERT-based sequence classification
- Custom training loop with evaluation
- Model fine-tuning for job categorization
- Performance metrics tracking

In [None]:
class CVClassifierTrainer:
    def __init__(self, model_name: str = "bert-base-uncased", num_labels: int = 22):
        """
        Initialize the CV Classifier Trainer for fine-tuning transformer models.
        
        Args:
            model_name (str): Hugging Face model name/path to use as base model
            num_labels (int): Number of classification categories (job types)
        """
        self.model_name = model_name
        self.num_labels = num_labels
        # Load the tokenizer for the specified model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        # Load the base model and adapt it for sequence classification
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, 
            num_labels=num_labels  # Set number of output classes
        )
    
    def prepare_dataset(self, df: pd.DataFrame, text_column: str = 'text', label_column: str = 'label'):
        """
        Prepare and tokenize the dataset for training.
        
        Args:
            df (pd.DataFrame): DataFrame containing text and label columns
            text_column (str): Name of the column containing CV text
            label_column (str): Name of the column containing numerical labels
            
        Returns:
            tuple: Tokenized training and evaluation datasets
        """
        # Split data into training (80%) and evaluation (20%) sets
        train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42)
        
        # Convert pandas DataFrames to Hugging Face datasets
        train_dataset = Dataset.from_pandas(train_df)
        eval_dataset = Dataset.from_pandas(eval_df)
        
        def tokenize_function(examples):
            """
            Tokenize text examples for model input.
            
            Args:
                examples: Batch of examples from the dataset
                
            Returns:
                dict: Tokenized inputs with attention masks
            """
            return self.tokenizer(
                examples[text_column], 
                padding="max_length",      # Pad sequences to max length
                truncation=True,           # Truncate sequences longer than max_length
                max_length=512             # BERT's maximum sequence length
            )
        
        # Apply tokenization to both datasets in batches for efficiency
        tokenized_train = train_dataset.map(tokenize_function, batched=True)
        tokenized_eval = eval_dataset.map(tokenize_function, batched=True)
        
        return tokenized_train, tokenized_eval
    
    def compute_metrics(self, eval_pred):
        """
        Compute evaluation metrics during training.
        
        Args:
            eval_pred: Tuple containing predictions and true labels
            
        Returns:
            dict: Dictionary with accuracy and F1 score
        """
        predictions, labels = eval_pred
        # Convert logits to class predictions (argmax)
        predictions = np.argmax(predictions, axis=1)
        
        return {
            'accuracy': accuracy_score(labels, predictions),
            'f1': f1_score(labels, predictions, average='weighted')  # Weighted F1 for imbalanced classes
        }
    
    def train(self, train_dataset, eval_dataset, output_dir: str = "./cv_classifier"):
        """
        Train the classification model.
        
        Args:
            train_dataset: Tokenized training dataset
            eval_dataset: Tokenized evaluation dataset
            output_dir (str): Directory to save the trained model
            
        Returns:
            Trainer: Hugging Face trainer object
        """
        # Define training arguments and hyperparameters
        training_args = TrainingArguments(
            output_dir=output_dir,          # Output directory for model checkpoints
            num_train_epochs=3,             # Number of training epochs
            per_device_train_batch_size=8,  # Batch size per device during training
            per_device_eval_batch_size=8,   # Batch size for evaluation
            warmup_steps=500,               # Number of warmup steps for learning rate
            weight_decay=0.01,              # Strength of weight decay
            logging_dir='./logs',           # Directory for storing logs
            logging_steps=10,               # Log every X steps
            evaluation_strategy="epoch",    # Evaluate after each epoch
            save_strategy="epoch",          # Save checkpoint after each epoch
            load_best_model_at_end=True,    # Load the best model at the end of training
        )
        
        # Initialize the trainer
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            compute_metrics=self.compute_metrics,  # Custom metrics function
        )
        
        print("Starting training...")
        # Start the training process
        trainer.train()
        
        # Save the final model and tokenizer
        trainer.save_model()
        self.tokenizer.save_pretrained(output_dir)
        
        return trainer

print("✅ CVClassifierTrainer class defined!")

✅ CVClassifierTrainer class defined!


## 6. Model Training Execution

Train the fine-tuned classifier with:
- Custom training loop for better control
- Real-time metrics monitoring
- Validation set evaluation
- Model checkpointing and saving

In [None]:
# Step 3: Training Fine-Tuned Classifier for CV Categorization
print("🤖 Step 3: Training Fine-Tuned Classifier...")

# Check if training data is available and model hasn't been trained yet
if training_df is not None and not os.path.exists('./cv_classifier'):
    # Initialize the classifier trainer with the number of unique job categories
    classifier_trainer = CVClassifierTrainer(num_labels=len(label_mapping))
    
    # Prepare and tokenize the training and evaluation datasets
    train_dataset, eval_dataset = classifier_trainer.prepare_dataset(training_df)
    
    print("🚀 Starting custom training loop...")
    
    # Import required libraries for custom training loop
    import torch
    from torch.utils.data import DataLoader
    from transformers import get_linear_schedule_with_warmup
    import numpy as np
    
    # Setup device - use GPU if available, otherwise CPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    
    # Move the model to the appropriate device (GPU/CPU)
    model = classifier_trainer.model.to(device)
    tokenizer = classifier_trainer.tokenizer
    
    # Convert Hugging Face datasets to PyTorch tensor format
    train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    eval_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    
    # Create data loaders for batching
    train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)  # Shuffle for better training
    eval_dataloader = DataLoader(eval_dataset, batch_size=8)  # No shuffle for evaluation
    
    # Initialize optimizer with weight decay for regularization
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
    
    # Calculate total training steps for scheduler
    total_steps = len(train_dataloader) * 3  # batches per epoch * number of epochs
    
    # Create learning rate scheduler with warmup
    scheduler = get_linear_schedule_with_warmup(
        optimizer, 
        num_warmup_steps=500,  # Warmup phase for stable training
        num_training_steps=total_steps
    )
    
    # Set model to training mode
    model.train()
    
    # Training loop over 3 epochs
    for epoch in range(3):
        total_loss = 0
        all_predictions = []  # Store predictions for metric calculation
        all_labels = []       # Store true labels for metric calculation
        
        # Iterate through training batches
        for batch_idx, batch in enumerate(train_dataloader):
            # Move batch tensors to the same device as model
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            # Reset gradients from previous iteration
            optimizer.zero_grad()
            
            # Forward pass - compute model outputs and loss
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            
            # Backward pass - compute gradients
            loss.backward()
            
            # Clip gradients to prevent explosion
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            
            # Update model parameters
            optimizer.step()
            scheduler.step()  # Update learning rate
            
            # Accumulate loss for reporting
            total_loss += loss.item()
            
            # Calculate predictions for metrics (without tracking gradients)
            with torch.no_grad():
                logits = outputs.logits
                predictions = torch.argmax(logits, dim=-1)  # Get predicted class
                all_predictions.extend(predictions.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())
            
            # Print progress every 10 batches
            if batch_idx % 10 == 0:
                current_loss = loss.item()
                print(f"Epoch {epoch+1}, Batch {batch_idx}/{len(train_dataloader)}, Loss: {current_loss:.4f}")
        
        # Calculate training metrics for the epoch
        train_accuracy = accuracy_score(all_labels, all_predictions)
        train_f1 = f1_score(all_labels, all_predictions, average='weighted')
        avg_loss = total_loss / len(train_dataloader)
        
        # Print epoch summary
        print(f"✅ Epoch {epoch+1} completed:")
        print(f"   - Average Loss: {avg_loss:.4f}")
        print(f"   - Training Accuracy: {train_accuracy:.4f}")
        print(f"   - Training F1: {train_f1:.4f}")
        
        # Validation phase
        model.eval()  # Set model to evaluation mode
        val_predictions = []
        val_labels = []
        
        # Disable gradient computation for validation (faster, less memory)
        with torch.no_grad():
            for val_batch in eval_dataloader:
                val_input_ids = val_batch['input_ids'].to(device)
                val_attention_mask = val_batch['attention_mask'].to(device)
                val_batch_labels = val_batch['label'].to(device)
                
                # Forward pass for validation
                val_outputs = model(val_input_ids, attention_mask=val_attention_mask)
                val_preds = torch.argmax(val_outputs.logits, dim=-1)
                val_predictions.extend(val_preds.cpu().numpy())
                val_labels.extend(val_batch_labels.cpu().numpy())
        
        # Calculate validation metrics
        val_accuracy = accuracy_score(val_labels, val_predictions)
        val_f1 = f1_score(val_labels, val_predictions, average='weighted')
        
        print(f"   - Validation Accuracy: {val_accuracy:.4f}")
        print(f"   - Validation F1: {val_f1:.4f}")
        
        # Set model back to training mode for next epoch
        model.train()
    
    print("🎯 Training completed successfully!")
    
    # Save the trained model and tokenizer for future use
    model.save_pretrained("./cv_classifier")
    tokenizer.save_pretrained("./cv_classifier")
    
    print("💾 Model saved to: ./cv_classifier")
    print(f"📊 Final Validation Accuracy: {val_accuracy:.4f}")
    print(f"📊 Final Validation F1: {val_f1:.4f}")
        
else:
    # Handle cases where training is skipped
    if os.path.exists('./cv_classifier'):
        print("✅ Fine-tuned model already exists at ./cv_classifier")
    else:
        print("❌ Skipping training - no training data available")

🤖 Step 3: Training Fine-Tuned Classifier...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1986/1986 [00:01<00:00, 1625.47 examples/s]
Map: 100%|██████████| 497/497 [00:00<00:00, 1519.98 examples/s]


🚀 Starting custom training loop...
Using device: cpu
Epoch 1, Batch 0/249, Loss: 3.2692
Epoch 1, Batch 10/249, Loss: 3.1067
Epoch 1, Batch 20/249, Loss: 3.2631
Epoch 1, Batch 30/249, Loss: 3.0434
Epoch 1, Batch 40/249, Loss: 3.2701
Epoch 1, Batch 50/249, Loss: 3.2211
Epoch 1, Batch 60/249, Loss: 3.1530
Epoch 1, Batch 70/249, Loss: 3.1973
Epoch 1, Batch 80/249, Loss: 3.1603
Epoch 1, Batch 90/249, Loss: 3.0785
Epoch 1, Batch 100/249, Loss: 3.0789
Epoch 1, Batch 110/249, Loss: 3.0385
Epoch 1, Batch 120/249, Loss: 3.1629
Epoch 1, Batch 130/249, Loss: 3.0685
Epoch 1, Batch 140/249, Loss: 3.2311
Epoch 1, Batch 150/249, Loss: 3.2260
Epoch 1, Batch 160/249, Loss: 3.1675
Epoch 1, Batch 170/249, Loss: 2.9358
Epoch 1, Batch 180/249, Loss: 2.6773
Epoch 1, Batch 190/249, Loss: 2.5379
Epoch 1, Batch 200/249, Loss: 2.4632
Epoch 1, Batch 210/249, Loss: 2.6021
Epoch 1, Batch 220/249, Loss: 2.3858
Epoch 1, Batch 230/249, Loss: 2.0739
Epoch 1, Batch 240/249, Loss: 2.3614
✅ Epoch 1 completed:
   - Average

## 7. Enhanced CV Analyzer

The `EnhancedCVAnalyzer` integrates all components:
- **RAG system** for semantic search
- **Fine-tuned classifier** for category prediction  
- **Gemini LLM** for intelligent analysis
- Fallback mechanisms for robustness

In [57]:
# Configure Gemini API with the provided API key
GEMINI_API_KEY = "AIzaSyB3_hb-Yws5Dcz3HL4ybbabHNQNSOEHuco"
genai.configure(api_key=GEMINI_API_KEY)

class EnhancedCVAnalyzer:
    def __init__(self, cv_processor: CVProcessor, classifier_path: str = None):
        """
        Initialize Enhanced CV Analyzer with RAG and fine-tuned classification capabilities.
        
        Args:
            cv_processor (CVProcessor): Pre-initialized CV processor for semantic search
            classifier_path (str): Path to fine-tuned classifier model directory
        """
        self.cv_processor = cv_processor
        
        # Load fine-tuned classifier if available
        if classifier_path and os.path.exists(classifier_path):
            try:
                # Load tokenizer and model from the fine-tuned classifier
                self.classifier_tokenizer = AutoTokenizer.from_pretrained(classifier_path)
                self.classifier_model = AutoModelForSequenceClassification.from_pretrained(classifier_path)
                
                # Load label mapping created during training
                with open('label_mapping.json', 'r') as f:
                    label_mapping = json.load(f)
                # Create reverse mapping from ID to label name
                self.id_to_label = {int(v): k for k, v in label_mapping.items()}
                print("✅ Fine-tuned classifier loaded successfully!")
            except Exception as e:
                print(f"❌ Error loading classifier: {e}")
                self.classifier_model = None
        else:
            self.classifier_model = None
            print("ℹ️ No fine-tuned classifier available")
        
        # Initialize Gemini LLM with fallback options
        try:
            self.llm_model = genai.GenerativeModel('gemini-2.0-flash-exp')
            print("✅ Using model: gemini-2.0-flash-exp")
        except Exception as e:
            print(f"❌ gemini-2.0-flash-exp not available: {e}")
            try:
                # Fallback to a more stable model if experimental one fails
                self.llm_model = genai.GenerativeModel('gemini-1.5-flash')
                print("✅ Using fallback model: gemini-1.5-flash")
            except:
                self.llm_model = None
                print("❌ No working LLM models found")
    
    def predict_category(self, text: str) -> str:
        """
        Use fine-tuned model to predict CV category/job role.
        
        Args:
            text (str): CV text to classify
            
        Returns:
            str: Predicted job category or error message
        """
        if self.classifier_model is None:
            return "Classifier not available"
        
        try:
            # Tokenize input text for the classifier
            inputs = self.classifier_tokenizer(
                text[:512],  # Use first 512 characters for efficiency
                return_tensors="pt",      # Return PyTorch tensors
                padding=True,             # Pad sequences to same length
                truncation=True,          # Truncate longer sequences
                max_length=512            # Maximum sequence length
            )
            
            # Make prediction without gradient computation (faster)
            with torch.no_grad():
                outputs = self.classifier_model(**inputs)
                # Convert logits to probabilities using softmax
                predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
                # Get the predicted class ID
                predicted_class_id = predictions.argmax().item()
                
            # Map class ID back to human-readable label
            return self.id_to_label.get(predicted_class_id, "Unknown")
        except Exception as e:
            return f"Prediction error: {str(e)}"
    
    def analyze_cv(self, query: str) -> str:
        """
        Enhanced RAG analysis combining semantic search and fine-tuned classification.
        
        Args:
            query (str): Job requirement or search query
            
        Returns:
            str: Comprehensive CV analysis and hiring recommendations
        """
        if self.llm_model is None:
            return "LLM model not available."
        
        # Search for relevant CV chunks using semantic similarity
        similar_docs = self.cv_processor.search_similar(query, top_k=5)
        
        if not similar_docs:
            return "No relevant CV information found for your query. Try searching for different skills or roles."
        
        # Prepare context from similar documents
        context = "\n\n".join([doc['document'] for doc in similar_docs])
        
        # Use fine-tuned classifier for additional insights on top matches
        category_insights = []
        for doc in similar_docs[:3]:  # Analyze top 3 matches
            category = self.predict_category(doc['document'])
            category_insights.append(f"• {category} (confidence: {doc['score']:.2f})")
        
        # IMPROVED PROMPT - More specific and query-focused for better recruitment analysis
        enhanced_prompt = f"""
        You are an experienced HR recruiter analyzing CVs for a client.

JOB REQUIREMENT: {query}

CANDIDATE CV DATA:
{context}

JOB CATEGORIES:
{chr(10).join(category_insights)}

TASK: Provide a recruitment assessment focusing on:
- Which candidates best match the job requirement
- Their specific relevant skills and experience
- Any gaps or concerns
- Your hiring recommendation

Keep it professional and concise (under 150 words). Focus on actionable insights for hiring.
"""
        
        try:
            # Generate analysis using Gemini LLM
            response = self.llm_model.generate_content(enhanced_prompt)
            return response.text
        except Exception as e:
            return f"Error generating response: {str(e)}"
    
    def categorize_candidate(self, cv_text: str) -> str:
        """
        Enhanced categorization using fine-tuned model with LLM fallback.
        
        Args:
            cv_text (str): Full CV text to categorize
            
        Returns:
            str: Job category or role classification
        """
        # Prefer fine-tuned classifier for accuracy
        if self.classifier_model:
            return self.predict_category(cv_text)
        else:
            # Fallback to LLM-based categorization if classifier unavailable
            prompt = f"Analyze this CV text and categorize it: {cv_text[:1000]}"
            try:
                response = self.llm_model.generate_content(prompt)
                return response.text
            except:
                return "Categorization failed"

print("✅ EnhancedCVAnalyzer class defined!")

✅ EnhancedCVAnalyzer class defined!


## 8. System Initialization

Initialize the complete analysis system and verify all components:
- RAG search capability ✅
- Fine-tuned classifier ✅  
- LLM integration ✅
- Ready for production use

In [59]:
# Step 4: Initializing the Enhanced CV Analyzer with all integrated components
print("🚀 Step 4: Initializing Enhanced Analyzer...")

# Initialize the enhanced analyzer with the CV processor and optional classifier
enhanced_analyzer = EnhancedCVAnalyzer(
    cv_processor,  # Pass the pre-initialized CV processor with FAISS index
    classifier_path="./cv_classifier" if os.path.exists("./cv_classifier") else None  # Conditionally load fine-tuned classifier if available
)

print("✅ Enhanced analyzer initialized successfully!")

# Display system capabilities and component status
print("📊 System capabilities:")
print(f"   - RAG Search: {'✅' if cv_processor.index is not None else '❌'}")  # Check if FAISS index is loaded for semantic search
print(f"   - Fine-Tuned Classifier: {'✅' if enhanced_analyzer.classifier_model is not None else '❌'}")  # Check if custom classifier is available
print(f"   - LLM Integration: {'✅' if enhanced_analyzer.llm_model is not None else '❌'}")  # Check if Gemini AI is connected

🚀 Step 4: Initializing Enhanced Analyzer...
✅ Fine-tuned classifier loaded successfully!
✅ Using model: gemini-2.0-flash-exp
✅ Enhanced analyzer initialized successfully!
📊 System capabilities:
   - RAG Search: ✅
   - Fine-Tuned Classifier: ✅
   - LLM Integration: ✅


## 9. Comprehensive System Testing

Test the integrated system with diverse queries across different job domains:
- Technical roles (Python developers)
- Business roles (Financial analysts) 
- Creative roles (Marketing professionals)
- Validation of category prediction

In [60]:
# Step 5: Comprehensive System Testing and Validation
print("🧪 Step 5: Testing the System...")

# Define diverse test queries to evaluate different system capabilities
test_queries = [
    "Find candidates with Python programming experience",           # Technical/Programming role
    "Looking for financial analysts with Excel skills",             # Finance/Business role
    "Software developers with cloud computing background",          # Cloud/Infrastructure role
    "Marketing professionals with digital media experience"         # Marketing/Creative role
]

print("Running test queries...\n")

# Iterate through each test query to validate system functionality
for i, query in enumerate(test_queries, 1):
    print(f"🔍 Test {i}: {query}")
    print("-" * 50)  # Visual separator for better readability
    
    # Execute the enhanced CV analysis using RAG + classifier + LLM
    result = enhanced_analyzer.analyze_cv(query)
    print(f"📝 Result:\n{result}")
    
    # Additional test: Validate category prediction functionality
    # Only run this once (on first query) to avoid redundancy
    if i == 1 and enhanced_analyzer.classifier_model is not None:
        sample_text = "Python developer with machine learning experience and cloud computing skills."
        predicted = enhanced_analyzer.predict_category(sample_text)
        print(f"🎯 Category Prediction Test: '{sample_text}' -> {predicted}")
    
    print("=" * 80)  # Major separator between test cases
    print()  # Blank line for better visual spacing

print("✅ System testing completed!")

🧪 Step 5: Testing the System...
Running test queries...

🔍 Test 1: Find candidates with Python programming experience
--------------------------------------------------
📝 Result:
Here's an assessment of the candidates focusing on Python programming experience:

**Best Matches:**

*   **Software Developer:** Strongest candidate. Lists Python experience, including related frameworks (TensorFlow, Keras, Pandas, NLTK). Details Python projects involving BERT, ELMO, Flask, and data visualization.
*   **Information Designer:** Lists Python under "Technical Skills," but lacks detailed project experience.

**Relevant Skills:**

*   **Software Developer:** Python, Flask, BERT, ELMO, Tensorflow, Keras, Pandas, NLTK.
*   **Information Designer:** Python.

**Gaps/Concerns:**

*   **Information Designer:** Needs verification of proficiency and practical application of Python.
*   **ALL:** The experience level is unclear since we are not given the number of years of experience.

**Recommendation:**



# Production Deployment

## 10. REST API Development

Build a production-ready Flask API with:
- RESTful endpoints for CV analysis
- Health monitoring and diagnostics
- CORS support for web applications
- Structured request/response handling

In [61]:
class CVAnalysisAPI:
    def __init__(self, analyzer: EnhancedCVAnalyzer):
        """
        Initialize CV Analysis API handler.
        
        Args:
            analyzer (EnhancedCVAnalyzer): Pre-initialized analyzer with all components
        """
        self.analyzer = analyzer
    
    def process_request(self, data: Dict) -> Dict:
        """
        Process incoming API requests and route to appropriate functionality.
        
        Args:
            data (Dict): Request data containing action, query, text, etc.
            
        Returns:
            Dict: Response data with analysis results or error messages
        """
        # Extract request parameters with defaults
        action = data.get('action', 'analyze')  # Default to analysis action
        query = data.get('query', '')           # Search query for analysis
        category = data.get('category', '')     # Optional category filter
        
        # Route to analysis functionality
        if action == 'analyze':
            if not query:
                return {'error': 'Query is required for analysis'}
            
            # Perform comprehensive RAG analysis with fine-tuned components
            result = self.analyzer.analyze_cv(query)
            
            return {
                'action': 'analysis',
                'query': query,
                'category': category,
                'analysis': result,
                'fine_tuned': True,  # Indicates enhanced analyzer is being used
                'timestamp': pd.Timestamp.now().isoformat()  # Add timestamp for tracking
            }
        
        # Route to categorization functionality
        elif action == 'categorize':
            text = data.get('text', query)  # Use text field or fallback to query
            if not text:
                return {'error': 'Text is required for categorization'}
            
            # Use fine-tuned classifier for accurate categorization
            if self.analyzer.classifier_model:
                category = self.analyzer.predict_category(text)
                return {
                    'action': 'categorization',
                    'text': text[:500],  # Return truncated text for response
                    'predicted_category': category,
                    'fine_tuned': True,  # Indicates classifier was used
                    'timestamp': pd.Timestamp.now().isoformat()
                }
            else:
                return {'error': 'Fine-tuned classifier not available'}
        
        # Handle unknown actions
        else:
            return {'error': f'Unknown action: {action}'}

def setup_flask_app(analyzer: EnhancedCVAnalyzer):
    """
    Setup Flask web application with CV analysis endpoints.
    
    Args:
        analyzer (EnhancedCVAnalyzer): Pre-initialized analyzer instance
        
    Returns:
        Flask: Configured Flask application instance
    """
    app = Flask(__name__)
    CORS(app)  # Enable Cross-Origin Resource Sharing for web clients
    
    # Initialize API handler
    cv_api = CVAnalysisAPI(analyzer)
    
    @app.route('/analyze', methods=['POST'])
    def analyze_cv():
        """
        Analyze CVs based on job requirements query.
        
        Expected JSON:
        {
            "action": "analyze",
            "query": "Find candidates with Python experience",
            "category": "optional_category_filter"
        }
        """
        try:
            data = request.json
            result = cv_api.process_request(data)
            return jsonify(result)
        except Exception as e:
            # Handle any unexpected errors
            return jsonify({'error': str(e)}), 500
    
    @app.route('/health', methods=['GET'])
    def health_check():
        """
        Health check endpoint to verify API is running.
        """
        return jsonify({
            'status': 'healthy', 
            'service': 'CV Analysis API',
            'version': '1.0'
        })
    
    @app.route('/categories', methods=['GET'])
    def get_categories():
        """
        Get list of available job categories for classification.
        """
        categories = [
            'HR', 'ACCOUNTANT', 'ADVOCATE', 'APPAREL', 'ARTS', 'AUTOMOBILE', 
            'AVIATION', 'BANKING', 'BUSINESS-DEVELOPMENT', 'BPO', 'CHEF', 
            'CONSTRUCTION', 'CONSULTANT', 'DESIGNER', 'DIGITAL-MEDIA', 
            'ENGINEERING', 'FINANCE', 'HEALTHCARE', 'INFORMATION-TECHNOLOGY',
            'SALES', 'PUBLIC-RELATIONS', 'TEACHER'
        ]
        return jsonify({'categories': categories})
    
    return app

print("✅ Flask API setup complete!")

✅ Flask API setup complete!


In [62]:
# API Testing with Correct HTTP Methods
print("🔧 Testing API with Correct Method...")

import requests
import json

# Base URL for the Flask API (assuming it's running on localhost port 8000)
base_url = "http://localhost:8000"

# CORRECT: Use POST method for /analyze endpoint with JSON payload
test_data = {
    "action": "analyze",  # Specify the action type
    "query": "Find candidates with Python programming experience",  # Search query
    "category": "INFORMATION-TECHNOLOGY"  # Optional category filter
}

try:
    # Send POST request with JSON data (correct method for /analyze endpoint)
    response = requests.post(f"{base_url}/analyze", json=test_data)  # ← Using POST not GET
    
    # Print HTTP status code for immediate feedback
    print(f"✅ Response Status: {response.status_code}")
    
    # Process successful responses
    if response.status_code == 200:
        result = response.json()
        print("🎯 Analysis Result:")
        # Pretty print the JSON response for readability
        print(json.dumps(result, indent=2))
    else:
        # Handle HTTP errors (4xx, 5xx status codes)
        print(f"❌ Error: {response.text}")
        
except requests.exceptions.ConnectionError:
    print("❌ Connection failed - Make sure the Flask server is running on port 8000")
except requests.exceptions.Timeout:
    print("❌ Request timeout - Server took too long to respond")
except Exception as e:
    print(f"❌ Request failed: {e}")  # Catch any other unexpected errors

🔧 Testing API with Correct Method...
✅ Response Status: 200
🎯 Analysis Result:
{
  "analysis": "Based on the provided CVs, the \"SOFTWARE DEVELOPER\" is the strongest candidate.\n\n**Strengths:**\n*   Explicit mention of Python experience in \"Skills\" and \"Projects\" section.\n*   Demonstrates experience with relevant frameworks like TensorFlow, Keras, Pandas, and NLTK.\n*   \"Question Answering System\" and \"Data Visualization Tool\" projects showcase practical Python application.\n\n**Gaps/Concerns:**\n*   The candidate has extensive experience using C#, so make sure their Python experience is adequate.\n\n**Recommendation:** Prioritize interviewing the \"SOFTWARE DEVELOPER.\" Verify Python proficiency and the depth of experience during the interview. The \"INFORMATION DESIGNER\" and \"INFORMATION DESIGNER\" candidates are less relevant as Python isn't prominent in their profiles. The last candidate might be worth a look if he wants to switch to IT, but he has no proven commercial

## 12. Cloud Deployment with Ngrok

Expose the local API to the internet using ngrok tunneling for:
- External access and testing
- Webhook integrations
- Demo and sharing capabilities

In [None]:
# Cell: Start ngrok tunnel for public API access
print("🌐 Starting ngrok tunnel...")

import threading
import time

def start_ngrok_tunnel():
    """
    Start ngrok tunnel to expose local Flask server to the internet.
    
    Returns:
        subprocess.Popen: Ngrok process object or None if failed
    """
    try:
        # Start ngrok tunnel forwarding port 8000 to a public URL
        process = subprocess.Popen(
            ['ngrok', 'http', '8000'],  # Command to start ngrok for port 8000
            stdout=subprocess.PIPE,     # Capture standard output
            stderr=subprocess.PIPE,     # Capture error output
            text=True                   # Return output as text (not bytes)
        )
        
        print("⏳ Waiting for ngrok to start...")
        time.sleep(5)  # Give ngrok time to initialize and establish tunnel
        
        # Check ngrok configuration to verify it's running properly
        status_process = subprocess.run(
            ['ngrok', 'config', 'check'],  # Verify ngrok configuration
            capture_output=True,           # Capture both stdout and stderr
            text=True                      # Return output as text
        )
        
        # Get list of active tunnels to find the public URL
        tunnels_process = subprocess.run(
            ['ngrok', 'api', 'tunnels', 'list'],  # List all active tunnels
            capture_output=True,
            text=True
        )
        
        # Print diagnostic information
        print("ngrok status:", status_process.stdout)
        if tunnels_process.stdout:
            print("Active tunnels:", tunnels_process.stdout)
            
        return process  # Return the process object for later management
        
    except FileNotFoundError:
        print("❌ ngrok not found. Please install ngrok first:")
        print("   - Download from: https://ngrok.com/download")
        print("   - Or install via pip: pip install pyngrok")
        return None
    except Exception as e:
        print(f"❌ Failed to start ngrok: {e}")
        return None

# Start ngrok tunnel in background
ngrok_process = start_ngrok_tunnel()

if ngrok_process:
    print("✅ ngrok tunnel started!")
    print("📢 Your API is now accessible via public URL (check ngrok output above)")
    print("💡 Use the ngrok URL to test your API from external services")
else:
    print("❌ Failed to start ngrok tunnel")
    print("💡 Alternative: You can still use the API locally at http://localhost:8000")

🌐 Starting ngrok tunnel...
⏳ Waiting for ngrok to start...
ngrok status: Downloading ngrok ...
Downloading ngrok: 0%
Downloading ngrok: 1%
Downloading ngrok: 2%
Downloading ngrok: 3%
Downloading ngrok: 4%
Downloading ngrok: 5%
Downloading ngrok: 6%
Downloading ngrok: 7%
Downloading ngrok: 8%
Downloading ngrok: 9%
Downloading ngrok: 10%
Downloading ngrok: 11%
Downloading ngrok: 12%
Downloading ngrok: 13%
Downloading ngrok: 14%
Downloading ngrok: 15%
Downloading ngrok: 16%
Downloading ngrok: 17%
Downloading ngrok: 18%
Downloading ngrok: 19%
Downloading ngrok: 20%
Downloading ngrok: 21%
Downloading ngrok: 22%
Downloading ngrok: 23%
Downloading ngrok: 24%
Downloading ngrok: 25%
Downloading ngrok: 26%
Downloading ngrok: 27%
Downloading ngrok: 28%
Downloading ngrok: 29%
Downloading ngrok: 30%
Downloading ngrok: 31%
Downloading ngrok: 32%
Downloading ngrok: 33%
Downloading ngrok: 34%
Downloading ngrok: 35%
Downloading ngrok: 36%
Downloading ngrok: 37%
Downloading ngrok: 38%
Downloading ngrok:

In [109]:
# Cell: Check ngrok tunnel details
print("🔍 Checking ngrok tunnel status")

try:
    response = requests.get('http://localhost:4040/api/tunnels', timeout=5)
    tunnels = response.json().get('tunnels', [])
    
    print("📊 Active ngrok tunnels:")
    for tunnel in tunnels:
        print(f"   → {tunnel['public_url']} -> {tunnel['config']['addr']}")
        print(f"     Protocol: {tunnel['proto']}")
        
except Exception as e:
    print(f"❌ Cannot access ngrok: {e}")

🔍 Checking ngrok tunnel status
📊 Active ngrok tunnels:
   → https://favorably-rhetorical-morris.ngrok-free.dev -> http://localhost:8000
     Protocol: https


# 🎯 Interactive CV Analysis System

## Final Production Interface

Launch the fully interactive system that provides:
- User-friendly command-line interface
- Real-time AI analysis with progress indicators
- Multiple job category selection
- Results export and saving
- Continuous analysis capability

In [68]:
# Cell: FULLY INTERACTIVE CV ANALYSIS SYSTEM
print("🎯 FULLY INTERACTIVE CV ANALYSIS SYSTEM")
print("=" * 60)

import requests
import json
import time

def interactive_cv_analysis():
    """
    Interactive command-line interface for the AI-powered CV Analysis System.
    Connects to n8n webhook for processing and provides real-time analysis results.
    """
    # Your n8n webhook URL - this is the endpoint that processes our requests
    N8N_WEBHOOK_URL = "https://mimita22.app.n8n.cloud/webhook-test/cv-analysis"
    
    # All available job categories for classification
    categories = [
        "INFORMATION-TECHNOLOGY", "FINANCE", "ENGINEERING", "MARKETING",
        "HEALTHCARE", "SALES", "HUMAN-RESOURCES", "DIGITAL-MEDIA",
        "EDUCATION", "OPERATIONS", "DESIGN", "LEGAL"
    ]
    
    # System introduction
    print("🤖 AI-Powered CV Analysis System")
    print("📊 Analyzes hundreds of CVs using RAG + Fine-tuned AI")
    print("\n" + "=" * 60)
    
    # Main interactive loop
    while True:
        print("\n🎯 ENTER YOUR ANALYSIS REQUEST")
        print("-" * 40)
        
        # Get the search query from user
        print("\n🔍 What skills or positions are you looking for?")
        print("   Examples:")
        print("   • Python developers with Django experience")
        print("   • Financial analysts with Excel skills") 
        print("   • Marketing managers with social media experience")
        print("   • Sales professionals with B2B background")
        
        query = input("\n→ Enter your query: ").strip()
        
        # Exit condition
        if query.lower() in ['exit', 'quit', 'stop']:
            print("👋 Thank you for using CV Analysis System!")
            break
            
        # Validate query
        if not query:
            print("❌ Please enter a valid query")
            continue
        
        # Category selection interface
        print("\n📂 SELECT A CATEGORY:")
        print("-" * 30)
        print("Available categories:")
        
        # Display categories in a formatted 3-column layout
        for i in range(0, len(categories), 3):
            row = categories[i:i+3]
            line = ""
            for j, category in enumerate(row):
                line += f"   {i+j+1:2d}. {category:<25}"
            print(line)
        
        print("\n💡 You can also type a custom category name")
        
        # Category input loop with validation
        while True:
            category_input = input("\n→ Enter category number or name: ").strip()
            
            # Allow empty category for broad analysis
            if not category_input:
                print("⚠️  Using no category - AI will analyze broadly")
                category = ""
                break
                
            # Handle numeric category selection
            if category_input.isdigit():
                category_num = int(category_input) - 1
                if 0 <= category_num < len(categories):
                    category = categories[category_num]
                    break
                else:
                    print(f"❌ Please enter a number between 1 and {len(categories)}")
            else:
                # Handle custom category input
                category = category_input.upper().replace(' ', '-')
                print(f"✅ Using custom category: {category}")
                break
        
        # Confirmation before processing
        print(f"\n✅ READY TO ANALYZE:")
        print(f"   🔍 Query: {query}")
        print(f"   📂 Category: {category if category else 'Any Category'}")
        
        confirm = input("\n🚀 Start AI analysis? (y/n): ").lower()
        if confirm not in ['y', 'yes']:
            print("↩️  Let's try again...")
            continue
        
        # Processing phase with visual feedback
        print(f"\n📡 Sending to AI analysis system...")
        print("⏳ Processing through RAG + Fine-tuned AI + Gemini...")
        
        # Animated progress indicator
        for i in range(5):
            dots = "." * (i % 4)
            print(f"🔄 Analyzing CVs{dots}", end="\r")
            time.sleep(0.5)
        
        try:
            # Send request to n8n webhook with timeout
            start_time = time.time()
            response = requests.post(
                N8N_WEBHOOK_URL,
                json={
                    "action": "analyze",
                    "query": query,
                    "category": category  # Can be empty string
                },
                timeout=60  # 60 second timeout for analysis
            )
            end_time = time.time()
            processing_time = end_time - start_time
            
            print(f"✅ Analysis completed in {processing_time:.1f}s")
            print("\n" + "=" * 60)
            print("📊 AI ANALYSIS RESULTS")
            print("=" * 60)
            
            # Process successful response
            if response.status_code == 200:
                result = response.json()
                
                # Handle different response formats from n8n
                analysis_text = None
                
                # Format 1: Array response with analysis data
                if isinstance(result, list) and len(result) > 0:
                    first_item = result[0]
                    if isinstance(first_item, dict):
                        if 'analysis' in first_item:
                            analysis_text = first_item['analysis']
                        elif 'data' in first_item and 'analysis' in first_item['data']:
                            analysis_text = first_item['data']['analysis']
                
                # Format 2: Direct dictionary response
                elif isinstance(result, dict):
                    if 'analysis' in result:
                        analysis_text = result['analysis']
                    elif 'message' in result:
                        print(f"ℹ️  {result['message']}")
                        print("💡 Check n8n.cloud executions for full results")
                        analysis_text = None
                
                # Display analysis results
                if analysis_text:
                    print(analysis_text)
                    
                    # Save results option
                    save = input(f"\n💾 Save this analysis to file? (y/n): ").lower()
                    if save in ['y', 'yes']:
                        timestamp = int(time.time())
                        filename = f"cv_analysis_{timestamp}.txt"
                        with open(filename, 'w', encoding='utf-8') as f:
                            f.write("CV ANALYSIS REPORT\n")
                            f.write("=" * 50 + "\n")
                            f.write(f"Query: {query}\n")
                            f.write(f"Category: {category if category else 'Any'}\n")
                            f.write(f"Time: {time.ctime()}\n")
                            f.write(f"Processing: {processing_time:.1f}s\n")
                            f.write("=" * 50 + "\n\n")
                            f.write(analysis_text)
                        print(f"✅ Analysis saved to: {filename}")
                else:
                    # Fallback: show raw response
                    print("📊 Raw response received:")
                    print(json.dumps(result, indent=2))
                    
            else:
                # Handle HTTP errors
                print(f"❌ Error: {response.status_code}")
                print(f"Response: {response.text}")
                
        except requests.exceptions.Timeout:
            print("❌ Request timeout - Analysis took too long")
        except requests.exceptions.ConnectionError:
            print("❌ Connection error - Check your internet connection")
        except Exception as e:
            print(f"❌ Request failed: {e}")
        
        # Continue or exit prompt
        print("\n" + "=" * 60)
        continue_analysis = input("\n🔄 Analyze another query? (y/n): ").lower()
        if continue_analysis not in ['y', 'yes']:
            print("🎉 Thank you for using AI CV Analysis System!")
            print("🌟 Your production system is ready for real-world use!")
            break

# Start the interactive system
interactive_cv_analysis()

🎯 FULLY INTERACTIVE CV ANALYSIS SYSTEM
🤖 AI-Powered CV Analysis System
📊 Analyzes hundreds of CVs using RAG + Fine-tuned AI


🎯 ENTER YOUR ANALYSIS REQUEST
----------------------------------------

🔍 What skills or positions are you looking for?
   Examples:
   • Python developers with Django experience
   • Financial analysts with Excel skills
   • Marketing managers with social media experience
   • Sales professionals with B2B background

📂 SELECT A CATEGORY:
------------------------------
Available categories:
    1. INFORMATION-TECHNOLOGY       2. FINANCE                      3. ENGINEERING              
    4. MARKETING                    5. HEALTHCARE                   6. SALES                    
    7. HUMAN-RESOURCES              8. DIGITAL-MEDIA                9. EDUCATION                
   10. OPERATIONS                  11. DESIGN                      12. LEGAL                    

💡 You can also type a custom category name

✅ READY TO ANALYZE:
   🔍 Query: Marketing manag

In [65]:
# Cell: Start Flask App Immediately
print("🚀 STARTING FLASK APP...")

from flask import Flask, request, jsonify
from flask_cors import CORS
import threading

# Create Flask app
app = Flask(__name__)
CORS(app)

# Your existing Flask routes
@app.route('/analyze', methods=['POST'])
def analyze_cv():
    try:
        data = request.json
        query = data.get('query', '') or data.get('message', '')
        
        print(f"🎯 Query received: '{query}'")
        
        # Get the analysis result
        analysis_result = enhanced_analyzer.analyze_cv(query)
        
        # RETURN PROPER STRUCTURE FOR n8n
        response_data = {
            'analysis': analysis_result,  # This is what n8n expects in $json.analysis
            'category': '',
            'query': query,
            'status': 'success'
        }
        
        print(f"📤 Sending to n8n: {response_data['analysis'][:100]}...")
        
        return jsonify(response_data)
        
    except Exception as e:
        print(f"❌ Error: {e}")
        return jsonify({
            'analysis': f"Error: {str(e)}",
            'category': '',
            'query': '',
            'status': 'error'
        }), 500

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({'status': 'healthy', 'service': 'CV Analysis API'})

# Start Flask in background thread
def run_flask():
    app.run(host='0.0.0.0', port=8000, debug=False, use_reloader=False)

print("⏳ Starting Flask server on port 8000...")
flask_thread = threading.Thread(target=run_flask, daemon=True)
flask_thread.start()

# Wait for Flask to start
import time
time.sleep(3)

# Test if Flask is running
print("🔍 Testing Flask connection...")
try:
    response = requests.get('http://localhost:8000/health', timeout=5)
    if response.status_code == 200:
        print("✅ Flask app is now RUNNING!")
        print(f"📊 Status: {response.json()}")
    else:
        print(f"❌ Flask started but returned: {response.status_code}")
except Exception as e:
    print(f"❌ Flask not accessible: {e}")
    print("💡 Make sure no other app is using port 8000")

🚀 STARTING FLASK APP...
⏳ Starting Flask server on port 8000...
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8000
 * Running on http://192.168.1.195:8000
Press CTRL+C to quit


🔍 Testing Flask connection...
✅ Flask app is now RUNNING!
📊 Status: {'service': 'CV Analysis API', 'status': 'healthy'}


### Key Innovations:

1. **Hybrid AI Approach**: Combines RAG, fine-tuned models, and LLMs
2. **Production-Ready**: Modular, scalable, with comprehensive error handling
3. **Domain-Specific**: Tailored for HR and recruitment use cases
4. **Integration Ready**: REST API, cloud workflows, external access

The system represents a state-of-the-art application of modern AI techniques to solve real-world recruitment challenges, providing intelligent, scalable, and actionable candidate analysis.