# Resume Parser with Transformer Models
## NavTech Assignment - AI/ML Engineer Role

This notebook demonstrates a comprehensive resume parser that can handle PDF, DOC, and DOCX files using various LLM providers including:
- Google Gemini
- OpenAI GPT
- OpenRouter (Free APIs)
- Local Transformer Models (BERT, RoBERTa)

**Output Format**: Structured JSON matching NavTech requirements

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -q python-dotenv pydantic jsonschema
!pip install -q PyPDF2 pdfplumber python-docx docx2txt
!pip install -q transformers torch spacy nltk
!pip install -q openai google-generativeai requests
!pip install -q pandas numpy regex tqdm colorama

# Download spaCy model
!python -m spacy download en_core_web_sm

print("✅ All dependencies installed successfully!")

In [None]:
# Import required libraries
import os
import json
import logging
from pathlib import Path
from typing import Dict, Any, List, Optional
import warnings
warnings.filterwarnings('ignore')

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("📚 Libraries imported successfully!")

## 2. Configuration and Schema Definition

In [None]:
# Define output schema (matching NavTech requirements)
from pydantic import BaseModel, Field
from typing import List

class Address(BaseModel):
    city: str = Field(default="", description="City name")
    state: str = Field(default="", description="State/Province code")
    country: str = Field(default="", description="Country code")

class Skill(BaseModel):
    skill: str = Field(description="Skill name")

class Education(BaseModel):
    name: str = Field(description="Institution name")
    degree: str = Field(description="Degree/qualification")
    from_date: str = Field(default=" ", description="Start date")
    to_date: str = Field(default=" ", description="End date")

class WorkExperience(BaseModel):
    company: str = Field(description="Company name")
    title: str = Field(description="Job title/position")
    description: str = Field(description="Job description")
    from_date: str = Field(default=" ", description="Start date")
    to_date: str = Field(default=" ", description="End date")

class ResumeData(BaseModel):
    first_name: str = Field(default="", description="First name")
    last_name: str = Field(default="", description="Last name")
    email: str = Field(default="", description="Email address")
    phone: str = Field(default="", description="Phone number")
    address: Address = Field(default_factory=Address, description="Address information")
    summary: str = Field(default="", description="Professional summary")
    skills: List[Skill] = Field(default_factory=list, description="List of skills")
    education_history: List[Education] = Field(default_factory=list, description="Education history")
    work_history: List[WorkExperience] = Field(default_factory=list, description="Work experience")

    def to_dict(self) -> dict:
        return {
            "first_name": self.first_name,
            "last_name": self.last_name,
            "email": self.email,
            "phone": self.phone,
            "address": {
                "city": self.address.city,
                "state": self.address.state,
                "country": self.address.country
            },
            "summary": self.summary,
            "skills": [{"skill": skill.skill} for skill in self.skills],
            "education_history": [
                {
                    "name": edu.name,
                    "degree": edu.degree,
                    "from_date": edu.from_date,
                    "to_date": edu.to_date
                }
                for edu in self.education_history
            ],
            "work_history": [
                {
                    "company": work.company,
                    "title": work.title,
                    "description": work.description,
                    "from_date": work.from_date,
                    "to_date": work.to_date
                }
                for work in self.work_history
            ]
        }

print("📋 Schema defined successfully!")

In [None]:
# API Keys Configuration
# Add your API keys here or use environment variables

# Option 1: Set API keys directly (not recommended for production)
GEMINI_API_KEY = ""  # Add your Gemini API key
OPENAI_API_KEY = ""  # Add your OpenAI API key
OPENROUTER_API_KEY = ""  # Add your OpenRouter API key

# Option 2: Use Google Colab secrets (recommended)
try:
    from google.colab import userdata
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY') if not GEMINI_API_KEY else GEMINI_API_KEY
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY') if not OPENAI_API_KEY else OPENAI_API_KEY
    OPENROUTER_API_KEY = userdata.get('OPENROUTER_API_KEY') if not OPENROUTER_API_KEY else OPENROUTER_API_KEY
except:
    pass

# Extraction prompt template
EXTRACTION_PROMPT_TEMPLATE = """
You are an expert resume parser. Extract the following information from the resume text and return it in the exact JSON format specified.

Resume Text:
{resume_text}

Required JSON Format:
{{
    "first_name": "string",
    "last_name": "string",
    "email": "string", 
    "phone": "string",
    "address": {{
        "city": "string",
        "state": "string",
        "country": "string"
    }},
    "summary": "string",
    "skills": [{{"skill": "string"}}],
    "education_history": [{{
        "name": "string",
        "degree": "string", 
        "from_date": "string",
        "to_date": "string"
    }}],
    "work_history": [{{
        "company": "string",
        "title": "string",
        "description": "string",
        "from_date": "string", 
        "to_date": "string"
    }}]
}}

Instructions:
1. Extract all personal information (name, email, phone, address)
2. Create a professional summary from the resume content
3. List all technical and professional skills
4. Extract education history with institutions, degrees, and dates
5. Extract work experience with companies, titles, descriptions, and dates
6. Use " " (space) for missing dates
7. Return ONLY the JSON object, no additional text
8. Ensure all fields are present even if empty

JSON Response:"""

print("🔧 Configuration completed!")

## 3. File Processing Classes

In [None]:
# File processors for different formats
import PyPDF2
import pdfplumber
import docx
import docx2txt
from abc import ABC, abstractmethod

class BaseFileProcessor(ABC):
    """Abstract base class for file processors"""
    
    def __init__(self):
        self.supported_extensions = []
    
    @abstractmethod
    def extract_text(self, file_path: str) -> str:
        """Extract text from file"""
        pass
    
    def clean_text(self, text: str) -> str:
        """Basic text cleaning"""
        if not text:
            return ""
        
        # Remove excessive whitespace
        text = " ".join(text.split())
        
        # Remove common artifacts
        text = text.replace("\x00", "")  # Null characters
        text = text.replace("\ufffd", "")  # Replacement characters
        
        return text.strip()

class PDFProcessor(BaseFileProcessor):
    """PDF file processor"""
    
    def __init__(self):
        super().__init__()
        self.supported_extensions = ['.pdf']
    
    def extract_text(self, file_path: str) -> str:
        """Extract text from PDF"""
        # Try pdfplumber first
        text = self._extract_with_pdfplumber(file_path)
        
        # Fallback to PyPDF2
        if not text or len(text.strip()) < 50:
            text = self._extract_with_pypdf2(file_path)
        
        return self.clean_text(text)
    
    def _extract_with_pdfplumber(self, file_path: str) -> str:
        try:
            text_parts = []
            with pdfplumber.open(file_path) as pdf:
                for page in pdf.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text_parts.append(page_text)
            return "\n".join(text_parts)
        except Exception as e:
            logger.error(f"pdfplumber extraction failed: {e}")
            return ""
    
    def _extract_with_pypdf2(self, file_path: str) -> str:
        try:
            text_parts = []
            with open(file_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                for page in pdf_reader.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text_parts.append(page_text)
            return "\n".join(text_parts)
        except Exception as e:
            logger.error(f"PyPDF2 extraction failed: {e}")
            return ""

class DOCXProcessor(BaseFileProcessor):
    """DOCX/DOC file processor"""
    
    def __init__(self):
        super().__init__()
        self.supported_extensions = ['.docx', '.doc']
    
    def extract_text(self, file_path: str) -> str:
        """Extract text from Word document"""
        file_ext = Path(file_path).suffix.lower()
        
        if file_ext == '.docx':
            text = self._extract_from_docx(file_path)
        else:
            # For .doc files, try docx2txt
            text = self._extract_with_docx2txt(file_path)
        
        return self.clean_text(text)
    
    def _extract_from_docx(self, file_path: str) -> str:
        try:
            text_parts = []
            doc = docx.Document(file_path)
            
            # Extract paragraphs
            for paragraph in doc.paragraphs:
                if paragraph.text.strip():
                    text_parts.append(paragraph.text)
            
            # Extract tables
            for table in doc.tables:
                for row in table.rows:
                    row_text = []
                    for cell in row.cells:
                        if cell.text.strip():
                            row_text.append(cell.text.strip())
                    if row_text:
                        text_parts.append(" | ".join(row_text))
            
            return "\n".join(text_parts)
        except Exception as e:
            logger.error(f"DOCX extraction failed: {e}")
            return self._extract_with_docx2txt(file_path)
    
    def _extract_with_docx2txt(self, file_path: str) -> str:
        try:
            return docx2txt.process(file_path)
        except Exception as e:
            logger.error(f"docx2txt extraction failed: {e}")
            return ""

class FileProcessorFactory:
    """Factory to create appropriate file processor"""
    
    @staticmethod
    def create_processor(file_path: str) -> BaseFileProcessor:
        file_ext = Path(file_path).suffix.lower()
        
        if file_ext == '.pdf':
            return PDFProcessor()
        elif file_ext in ['.docx', '.doc']:
            return DOCXProcessor()
        else:
            raise ValueError(f"Unsupported file format: {file_ext}")

print("📄 File processors defined successfully!")

## 4. LLM Providers

In [None]:
# Base LLM Provider
import re

class BaseLLMProvider(ABC):
    """Abstract base class for LLM providers"""
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
    
    @abstractmethod
    def extract_resume_data(self, resume_text: str) -> ResumeData:
        """Extract structured data from resume text"""
        pass
    
    @abstractmethod
    def is_available(self) -> bool:
        """Check if the LLM provider is available"""
        pass
    
    def _clean_json_response(self, response_text: str) -> str:
        """Clean and extract JSON from LLM response"""
        response_text = response_text.strip()
        
        # Remove markdown code blocks
        if response_text.startswith("```json"):
            response_text = response_text[7:]
        elif response_text.startswith("```"):
            response_text = response_text[3:]
        
        if response_text.endswith("```"):
            response_text = response_text[:-3]
        
        # Find JSON object boundaries
        start_idx = response_text.find("{")
        end_idx = response_text.rfind("}") + 1
        
        if start_idx != -1 and end_idx > start_idx:
            response_text = response_text[start_idx:end_idx]
        
        return response_text.strip()
    
    def _parse_json_response(self, response_text: str) -> Dict[str, Any]:
        """Parse JSON response from LLM"""
        try:
            cleaned_text = self._clean_json_response(response_text)
            return json.loads(cleaned_text)
        except json.JSONDecodeError as e:
            logger.error(f"Failed to parse JSON response: {e}")
            raise ValueError(f"Invalid JSON response from LLM: {e}")
    
    def _create_resume_data_from_dict(self, data_dict: Dict[str, Any]) -> ResumeData:
        """Create ResumeData object from dictionary"""
        try:
            resume_data = ResumeData(
                first_name=data_dict.get("first_name", ""),
                last_name=data_dict.get("last_name", ""),
                email=data_dict.get("email", ""),
                phone=data_dict.get("phone", ""),
                summary=data_dict.get("summary", "")
            )
            
            # Handle address
            address_data = data_dict.get("address", {})
            if isinstance(address_data, dict):
                resume_data.address = Address(
                    city=address_data.get("city", ""),
                    state=address_data.get("state", ""),
                    country=address_data.get("country", "")
                )
            
            # Handle skills
            skills_data = data_dict.get("skills", [])
            if isinstance(skills_data, list):
                resume_data.skills = [
                    Skill(skill=skill.get("skill", "") if isinstance(skill, dict) else str(skill))
                    for skill in skills_data
                ]
            
            # Handle education
            education_data = data_dict.get("education_history", [])
            if isinstance(education_data, list):
                resume_data.education_history = [
                    Education(
                        name=edu.get("name", ""),
                        degree=edu.get("degree", ""),
                        from_date=edu.get("from_date", " "),
                        to_date=edu.get("to_date", " ")
                    )
                    for edu in education_data if isinstance(edu, dict)
                ]
            
            # Handle work experience
            work_data = data_dict.get("work_history", [])
            if isinstance(work_data, list):
                resume_data.work_history = [
                    WorkExperience(
                        company=work.get("company", ""),
                        title=work.get("title", ""),
                        description=work.get("description", ""),
                        from_date=work.get("from_date", " "),
                        to_date=work.get("to_date", " ")
                    )
                    for work in work_data if isinstance(work, dict)
                ]
            
            return resume_data
        
        except Exception as e:
            logger.error(f"Failed to create ResumeData from dict: {e}")
            raise ValueError(f"Failed to process extracted data: {e}")
    
    def _get_fallback_data(self, resume_text: str) -> ResumeData:
        """Generate fallback data when LLM extraction fails"""
        logger.warning("Using fallback data extraction")
        
        # Basic regex-based extraction
        email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
        emails = re.findall(email_pattern, resume_text)
        email = emails[0] if emails else ""
        
        phone_pattern = r'[\+]?[1-9]?[0-9]{7,15}'
        phones = re.findall(phone_pattern, resume_text)
        phone = phones[0] if phones else ""
        
        return ResumeData(
            first_name="",
            last_name="",
            email=email,
            phone=phone,
            summary="Resume parsing failed. Manual review required."
        )

print("🤖 Base LLM provider defined successfully!")

In [None]:
# Google Gemini Provider
try:
    import google.generativeai as genai
    
    class GeminiLLMProvider(BaseLLMProvider):
        """Google Gemini LLM provider"""
        
        def __init__(self, api_key: str = None):
            super().__init__({"model": "gemini-1.5-pro", "temperature": 0.1})
            self.api_key = api_key or GEMINI_API_KEY
            self.model = None
            self._initialize_model()
        
        def _initialize_model(self):
            if not self.api_key:
                logger.error("Gemini API key not found")
                return
            
            try:
                genai.configure(api_key=self.api_key)
                self.model = genai.GenerativeModel(model_name="gemini-1.5-pro")
                logger.info("Gemini model initialized successfully")
            except Exception as e:
                logger.error(f"Failed to initialize Gemini model: {e}")
                self.model = None
        
        def is_available(self) -> bool:
            return self.model is not None and bool(self.api_key)
        
        def extract_resume_data(self, resume_text: str) -> ResumeData:
            if not self.is_available():
                logger.error("Gemini model not available")
                return self._get_fallback_data(resume_text)
            
            try:
                prompt = EXTRACTION_PROMPT_TEMPLATE.format(resume_text=resume_text)
                
                response = self.model.generate_content(
                    prompt,
                    generation_config=genai.types.GenerationConfig(
                        temperature=0.1,
                        max_output_tokens=4000
                    )
                )
                
                response_text = response.text
                data_dict = self._parse_json_response(response_text)
                resume_data = self._create_resume_data_from_dict(data_dict)
                
                logger.info("Successfully extracted resume data using Gemini")
                return resume_data
            
            except Exception as e:
                logger.error(f"Gemini extraction failed: {e}")
                return self._get_fallback_data(resume_text)
    
    print("✅ Gemini provider defined successfully!")
    
except ImportError:
    print("⚠️ Google Generative AI not available")
    
    class GeminiLLMProvider(BaseLLMProvider):
        def __init__(self, api_key: str = None):
            super().__init__({})
        
        def is_available(self) -> bool:
            return False
        
        def extract_resume_data(self, resume_text: str) -> ResumeData:
            return self._get_fallback_data(resume_text)

In [None]:
# OpenAI Provider
try:
    import openai
    
    class OpenAILLMProvider(BaseLLMProvider):
        """OpenAI GPT LLM provider"""
        
        def __init__(self, api_key: str = None):
            super().__init__({"model": "gpt-3.5-turbo", "temperature": 0.1})
            self.api_key = api_key or OPENAI_API_KEY
            self.client = None
            self._initialize_client()
        
        def _initialize_client(self):
            if not self.api_key:
                logger.error("OpenAI API key not found")
                return
            
            try:
                self.client = openai.OpenAI(api_key=self.api_key)
                logger.info("OpenAI client initialized successfully")
            except Exception as e:
                logger.error(f"Failed to initialize OpenAI client: {e}")
                self.client = None
        
        def is_available(self) -> bool:
            return self.client is not None and bool(self.api_key)
        
        def extract_resume_data(self, resume_text: str) -> ResumeData:
            if not self.is_available():
                logger.error("OpenAI client not available")
                return self._get_fallback_data(resume_text)
            
            try:
                prompt = EXTRACTION_PROMPT_TEMPLATE.format(resume_text=resume_text)
                
                response = self.client.chat.completions.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": "You are an expert resume parser. Extract information and return only valid JSON."},
                        {"role": "user", "content": prompt}
                    ],
                    temperature=0.1,
                    max_tokens=4000
                )
                
                response_text = response.choices[0].message.content
                data_dict = self._parse_json_response(response_text)
                resume_data = self._create_resume_data_from_dict(data_dict)
                
                logger.info("Successfully extracted resume data using OpenAI")
                return resume_data
            
            except Exception as e:
                logger.error(f"OpenAI extraction failed: {e}")
                return self._get_fallback_data(resume_text)
    
    print("✅ OpenAI provider defined successfully!")
    
except ImportError:
    print("⚠️ OpenAI not available")
    
    class OpenAILLMProvider(BaseLLMProvider):
        def __init__(self, api_key: str = None):
            super().__init__({})
        
        def is_available(self) -> bool:
            return False
        
        def extract_resume_data(self, resume_text: str) -> ResumeData:
            return self._get_fallback_data(resume_text)

In [None]:
# Local Transformer Provider
try:
    from transformers import pipeline
    import torch
    
    class TransformerLLMProvider(BaseLLMProvider):
        """Local Transformer models provider"""
        
        def __init__(self):
            super().__init__({})
            self.ner_pipeline = None
            self.device = "cuda" if torch.cuda.is_available() else "cpu"
            self._initialize_models()
        
        def _initialize_models(self):
            try:
                self.ner_pipeline = pipeline(
                    "ner",
                    model="dbmdz/bert-large-cased-finetuned-conll03-english",
                    aggregation_strategy="simple",
                    device=0 if self.device == "cuda" else -1
                )
                logger.info(f"Transformer models initialized on {self.device}")
            except Exception as e:
                logger.error(f"Failed to initialize transformer models: {e}")
                self.ner_pipeline = None
        
        def is_available(self) -> bool:
            return self.ner_pipeline is not None
        
        def extract_resume_data(self, resume_text: str) -> ResumeData:
            if not self.is_available():
                logger.error("Transformer models not available")
                return self._get_fallback_data(resume_text)
            
            try:
                # Extract entities using NER
                entities = self._extract_entities(resume_text)
                
                # Extract structured information
                resume_data = ResumeData()
                
                # Extract personal information
                resume_data.first_name, resume_data.last_name = self._extract_name(resume_text, entities)
                resume_data.email = self._extract_email(resume_text)
                resume_data.phone = self._extract_phone(resume_text)
                resume_data.address = self._extract_address(resume_text, entities)
                
                # Extract professional information
                resume_data.summary = self._extract_summary(resume_text)
                resume_data.skills = self._extract_skills(resume_text)
                resume_data.education_history = self._extract_education(resume_text, entities)
                resume_data.work_history = self._extract_work_experience(resume_text, entities)
                
                logger.info("Successfully extracted resume data using transformers")
                return resume_data
            
            except Exception as e:
                logger.error(f"Transformer extraction failed: {e}")
                return self._get_fallback_data(resume_text)
        
        def _extract_entities(self, text: str) -> List[Dict]:
            try:
                # Split text into chunks for long resumes
                max_length = 512
                chunks = [text[i:i+max_length] for i in range(0, len(text), max_length)]
                
                all_entities = []
                for chunk in chunks:
                    entities = self.ner_pipeline(chunk)
                    all_entities.extend(entities)
                
                return all_entities
            except Exception as e:
                logger.error(f"Entity extraction failed: {e}")
                return []
        
        def _extract_name(self, text: str, entities: List[Dict]) -> tuple:
            # Look for PERSON entities
            person_entities = [e for e in entities if e.get("entity_group") == "PER"]
            
            if person_entities:
                full_name = person_entities[0]["word"].strip()
                name_parts = full_name.split()
                
                if len(name_parts) >= 2:
                    return name_parts[0], " ".join(name_parts[1:])
                elif len(name_parts) == 1:
                    return name_parts[0], ""
            
            return "", ""
        
        def _extract_email(self, text: str) -> str:
            email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
            emails = re.findall(email_pattern, text)
            return emails[0] if emails else ""
        
        def _extract_phone(self, text: str) -> str:
            patterns = [
                r'\+\d{1,3}[-\.\s]?\d{3,4}[-\.\s]?\d{3,4}[-\.\s]?\d{3,4}',
                r'\(\d{3}\)[-\.\s]?\d{3}[-\.\s]?\d{4}',
                r'\d{3}[-\.\s]?\d{3}[-\.\s]?\d{4}',
                r'\+\d{10,15}'
            ]
            
            for pattern in patterns:
                matches = re.findall(pattern, text)
                if matches:
                    return matches[0]
            
            return ""
        
        def _extract_address(self, text: str, entities: List[Dict]) -> Address:
            locations = [e for e in entities if e.get("entity_group") in ["LOC", "GPE"]]
            
            city, state, country = "", "", ""
            
            if locations:
                if len(locations) >= 1:
                    city = locations[-1]["word"]
                if len(locations) >= 2:
                    state = locations[-2]["word"]
                if len(locations) >= 3:
                    country = locations[-3]["word"]
            
            return Address(city=city, state=state, country=country)
        
        def _extract_summary(self, text: str) -> str:
            summary_keywords = ["summary", "profile", "objective", "about"]
            lines = text.split('\n')
            
            for i, line in enumerate(lines):
                if any(keyword in line.lower() for keyword in summary_keywords):
                    summary_lines = []
                    for j in range(i+1, min(i+5, len(lines))):
                        if lines[j].strip() and not lines[j].isupper():
                            summary_lines.append(lines[j].strip())
                        else:
                            break
                    
                    if summary_lines:
                        return " ".join(summary_lines)
            
            # Fallback: take first paragraph
            for line in lines[:10]:
                if len(line.split()) > 10 and not line.isupper():
                    return line.strip()
            
            return "Professional summary not found in resume."
        
        def _extract_skills(self, text: str) -> List[Skill]:
            skill_keywords = [
                "python", "java", "javascript", "typescript", "react", "angular", "vue",
                "node.js", "express", "django", "flask", "spring", "html", "css", "sql",
                "mongodb", "postgresql", "mysql", "aws", "azure", "docker", "kubernetes",
                "git", "linux", "windows", "machine learning", "ai", "data science"
            ]
            
            found_skills = []
            text_lower = text.lower()
            
            for skill in skill_keywords:
                if skill in text_lower:
                    found_skills.append(Skill(skill=skill))
            
            return found_skills[:20]  # Limit to 20 skills
        
        def _extract_education(self, text: str, entities: List[Dict]) -> List[Education]:
            # Simple education extraction
            education_keywords = ["education", "university", "college", "degree"]
            lines = text.split('\n')
            
            education_list = []
            for line in lines:
                if any(keyword in line.lower() for keyword in education_keywords):
                    if len(line.split()) > 2:  # Avoid single words
                        education_list.append(Education(
                            name=line.strip(),
                            degree="",
                            from_date=" ",
                            to_date=" "
                        ))
            
            return education_list[:3]  # Limit to 3 entries
        
        def _extract_work_experience(self, text: str, entities: List[Dict]) -> List[WorkExperience]:
            # Simple work experience extraction
            work_keywords = ["experience", "work", "company", "position"]
            lines = text.split('\n')
            
            work_list = []
            for line in lines:
                if any(keyword in line.lower() for keyword in work_keywords):
                    if len(line.split()) > 3:  # Avoid single words
                        work_list.append(WorkExperience(
                            company=line.strip(),
                            title="",
                            description="",
                            from_date=" ",
                            to_date=" "
                        ))
            
            return work_list[:3]  # Limit to 3 entries
    
    print("✅ Transformer provider defined successfully!")
    
except ImportError:
    print("⚠️ Transformers not available")
    
    class TransformerLLMProvider(BaseLLMProvider):
        def __init__(self):
            super().__init__({})
        
        def is_available(self) -> bool:
            return False
        
        def extract_resume_data(self, resume_text: str) -> ResumeData:
            return self._get_fallback_data(resume_text)

## 5. Main Resume Parser Class

In [None]:
# Main Resume Parser
class ResumeParser:
    """Main resume parser class"""
    
    def __init__(self):
        self.providers = self._initialize_providers()
    
    def _initialize_providers(self) -> dict:
        """Initialize all LLM providers"""
        providers = {}
        
        try:
            providers['gemini'] = GeminiLLMProvider()
            print(f"Gemini provider: {'✅ Available' if providers['gemini'].is_available() else '❌ Not available'}")
        except Exception as e:
            print(f"❌ Failed to initialize Gemini provider: {e}")
        
        try:
            providers['openai'] = OpenAILLMProvider()
            print(f"OpenAI provider: {'✅ Available' if providers['openai'].is_available() else '❌ Not available'}")
        except Exception as e:
            print(f"❌ Failed to initialize OpenAI provider: {e}")
        
        try:
            providers['transformer'] = TransformerLLMProvider()
            print(f"Transformer provider: {'✅ Available' if providers['transformer'].is_available() else '❌ Not available'}")
        except Exception as e:
            print(f"❌ Failed to initialize Transformer provider: {e}")
        
        return providers
    
    def parse_resume(self, file_path: str, llm_provider: str = "gemini") -> ResumeData:
        """Parse resume file and extract structured data"""
        print(f"🔍 Starting resume parsing: {file_path} with {llm_provider}")
        
        # Extract text from file
        try:
            processor = FileProcessorFactory.create_processor(file_path)
            resume_text = processor.extract_text(file_path)
            print(f"📄 Extracted {len(resume_text)} characters from resume")
        except Exception as e:
            print(f"❌ Failed to extract text from file: {e}")
            raise
        
        # Get LLM provider
        if llm_provider not in self.providers:
            raise ValueError(f"Unknown LLM provider: {llm_provider}")
        
        provider = self.providers[llm_provider]
        if not provider.is_available():
            print(f"⚠️ {llm_provider} provider not available, falling back to transformer")
            provider = self.providers.get('transformer')
            if not provider or not provider.is_available():
                raise ValueError("No LLM providers available")
        
        # Extract structured data
        try:
            resume_data = provider.extract_resume_data(resume_text)
            print("✅ Successfully extracted resume data")
            return resume_data
        except Exception as e:
            print(f"❌ Failed to extract resume data: {e}")
            raise
    
    def get_available_providers(self) -> list:
        """Get list of available LLM providers"""
        available = []
        for name, provider in self.providers.items():
            if provider and provider.is_available():
                available.append(name)
        return available

# Initialize the parser
print("🚀 Initializing Resume Parser...")
resume_parser = ResumeParser()
print(f"\n📋 Available providers: {resume_parser.get_available_providers()}")

## 6. Demo and Usage

In [None]:
# File Upload Demo
from google.colab import files

def upload_and_parse_resume(llm_provider="gemini"):
    """Upload and parse a resume file"""
    print("📁 Please upload your resume file (PDF, DOC, or DOCX)")
    
    # Upload file
    uploaded = files.upload()
    
    if not uploaded:
        print("❌ No file uploaded")
        return None
    
    # Get the uploaded file
    filename = list(uploaded.keys())[0]
    print(f"📄 Processing file: {filename}")
    
    try:
        # Parse the resume
        resume_data = resume_parser.parse_resume(filename, llm_provider)
        
        # Convert to JSON and display
        output_json = resume_data.to_dict()
        
        print("\n📋 Extracted Resume Data:")
        print("=" * 50)
        print(json.dumps(output_json, indent=2, ensure_ascii=False))
        
        return output_json
    
    except Exception as e:
        print(f"❌ Error processing resume: {e}")
        return None
    
    finally:
        # Clean up uploaded file
        try:
            os.remove(filename)
        except:
            pass

print("📤 Upload function ready! Call upload_and_parse_resume() to start.")

In [None]:
# Example with sample resume text
sample_resume_text = """
John Doe
Software Engineer
Email: john.doe@email.com
Phone: +1-555-123-4567
Address: San Francisco, CA, USA

SUMMARY
Experienced software engineer with 5+ years in full-stack development. 
Proficient in Python, JavaScript, and cloud technologies.

SKILLS
• Python, JavaScript, TypeScript
• React, Node.js, Django
• AWS, Docker, Kubernetes
• SQL, MongoDB

EDUCATION
Bachelor of Science in Computer Science
Stanford University
2015 - 2019

WORK EXPERIENCE
Senior Software Engineer
Tech Corp Inc.
2021 - Present
Led development of microservices architecture and improved system performance by 40%.
"""

def parse_resume_text(resume_text: str, llm_provider="gemini"):
    """Parse resume from text input"""
    if not resume_text.strip():
        print("❌ Please provide resume text")
        return None
    
    try:
        # Get the provider
        if llm_provider not in resume_parser.providers:
            print(f"❌ Unknown LLM provider: {llm_provider}")
            return None
        
        provider = resume_parser.providers[llm_provider]
        if not provider.is_available():
            print(f"⚠️ {llm_provider} provider not available, falling back to transformer")
            provider = resume_parser.providers.get('transformer')
            if not provider or not provider.is_available():
                print("❌ No LLM providers available")
                return None
        
        print(f"🔍 Processing resume text with {llm_provider}...")
        
        # Extract structured data
        resume_data = provider.extract_resume_data(resume_text)
        
        # Convert to JSON and display
        output_json = resume_data.to_dict()
        
        print("\n📋 Extracted Resume Data:")
        print("=" * 50)
        print(json.dumps(output_json, indent=2, ensure_ascii=False))
        
        return output_json
    
    except Exception as e:
        print(f"❌ Error processing resume: {e}")
        return None

print("🧪 Running example with sample resume...")
available_providers = resume_parser.get_available_providers()
if available_providers:
    provider_to_use = available_providers[0]
    print(f"\n🤖 Using provider: {provider_to_use}")
    result = parse_resume_text(sample_resume_text, provider_to_use)
else:
    print("❌ No providers available for demo")

## 7. Instructions for Use

### Option 1: Upload File
```python
# Upload and parse a resume file
result = upload_and_parse_resume(llm_provider="gemini")
```

### Option 2: Text Input
```python
# Parse resume from text
resume_text = "Your resume text here..."
result = parse_resume_text(resume_text, llm_provider="gemini")
```

### Available LLM Providers:
- **gemini**: Google Gemini (requires API key)
- **openai**: OpenAI GPT (requires API key)  
- **transformer**: Local BERT models (no API key needed)

### Setting API Keys:
1. **Google Colab Secrets** (Recommended):
   - Go to the key icon in the left sidebar
   - Add secrets: `GEMINI_API_KEY`, `OPENAI_API_KEY`

2. **Direct Assignment**:
   ```python
   GEMINI_API_KEY = "your-api-key-here"
   OPENAI_API_KEY = "your-api-key-here"
   ```

### Output Format:
The parser returns a JSON object matching the NavTech requirements with all required fields:
- Personal information (name, email, phone, address)
- Professional summary
- Skills list
- Education history
- Work experience

---

**NavTech Assignment Completed** ✅

This notebook demonstrates a comprehensive resume parser using transformer models and multiple LLM providers, capable of handling PDF, DOC, and DOCX files with structured JSON output.