Skip to content

samitmohan/resume-parser

Repository files navigation

Resume Parser - AI-Driven Recruitment System

An intelligent resume parsing and candidate scoring system that extracts structured data from PDF/DOCX resumes, scores candidates against configurable criteria, and presents results through an interactive dashboard.

Screenshots

Verified Candidate Rejected Candidate
Verified Rejected

How It Works

  1. Select a job category from the sidebar (Data Science, Data Engineering, or Data Visualization) - or upload your own custom data dictionary (.xlsx)
  2. Adjust scoring weights using the sidebar sliders - control how much each dimension (skills, degree, experience, certifications) contributes to the overall score
  3. Upload one or more resumes (PDF or DOCX) - they are parsed concurrently with a progress bar
  4. View scored results in the Table View tab - candidates are ranked by overall score, with "Verified" or "Rejected" status based on exclusion lists
  5. Explore breakdowns across sub-tabs: Overall Score, Skill Score, Experience Score, Degree Score, Certification Score
  6. Compare top candidates side-by-side using the Radar Chart tab - a polar overlay showing each candidate's strengths across all four dimensions
  7. Save to database for future reference, or download as CSV for offline analysis
  8. Review past results in the "View Existing Resumes" tab, filtered by job category

Features

  • Multi-format support - parse PDF and DOCX resumes
  • Automated extraction - contact info, education, skills, experience, certifications
  • Configurable scoring - weighted scoring across skill match, degree level, experience, and certifications
  • Data dictionaries - customizable skill taxonomies per job category (Data Science, Data Engineering, Data Visualization)
  • Candidate comparison - radar chart overlay comparing top candidates across all dimensions
  • Batch processing - concurrent parsing of multiple resumes with progress tracking
  • Persistent storage - SQLite backend for saving and reviewing past results
  • CSV export - download scored results for offline analysis

Architecture

flowchart LR
    A[PDF / DOCX Upload] --> B[Text Extraction]
    B --> C[Field Parsing]
    C --> D[Skill Matching]
    C --> E[Degree Detection]
    C --> F[Experience Extraction]
    C --> G[Certification Count]
    D & E & F & G --> H[Weighted Scoring]
    H --> I[Dashboard + Charts]
    H --> J[SQLite Storage]
    H --> K[CSV Export]
Loading
sequenceDiagram
    participant U as User
    participant S as Streamlit UI
    participant P as Parser
    participant SC as Scorer
    participant DB as SQLite

    U->>S: Upload resumes + select category
    S->>P: Parse each file (concurrent)
    P->>P: Extract text (PDF/DOCX)
    P->>P: Extract email, phone, degrees, skills, experience
    P-->>S: ResumeData objects
    S->>SC: Build results DataFrame
    SC->>SC: Rank-based scoring with weights
    SC-->>S: Scored DataFrame
    S->>U: Display table, charts, radar comparison
    U->>S: Click Save
    S->>DB: Store results (parameterized queries)
Loading

Scoring

Each candidate is scored across four dimensions, normalized by rank within the batch:

Overall Score = (Skill Score x W1) + (Degree Score x W2) + (Experience Score x W3) + (Certification Score x W4)
Dimension What it measures How
Skill Score Match against data dictionary skill taxonomy NLP n-gram tokenization + fuzzy matching
Degree Score Education level Fuzzy match against degree classification lists
Experience Score Years of experience Regex patterns for "X years", date ranges
Certification Score Number of certifications Keyword frequency count

Weights are adjustable via sidebar sliders (must sum to 100%).

Quick Start

# Clone
git clone https://github.com/samitmohan/resume-parser.git
cd resume-parser

# Install dependencies
uv sync

# Run
uv run streamlit run app.py

Open http://localhost:8501 in your browser.

Project Structure

resume-parser/
    app.py                  # Streamlit UI - layout, charts, user interaction
    src/
        __init__.py
        config.py           # Constants, regex patterns, degree lists, paths
        parser.py           # Text extraction + field parsing (PDF/DOCX)
        scorer.py           # Rank-based scoring and DataFrame construction
        database.py         # SQLite operations with parameterized queries
    data_dictionary/        # Excel files defining skill taxonomies per category
        Data Engineering.xlsx
        Data Science.xlsx
        Data Visualization.xlsx
    assets/
        icon_g.png
    temp/                   # Temporary upload directory (gitignored)
    pyproject.toml

Tech Stack

  • Python 3.10+
  • Streamlit - interactive web dashboard
  • PyMuPDF (fitz) - PDF text extraction
  • python-docx - DOCX text extraction
  • scikit-learn - CountVectorizer for n-gram tokenization
  • rapidfuzz - fuzzy string matching for degree/skill detection
  • Plotly - bar charts, scatter plots, radar charts
  • pandas - data manipulation and scoring
  • SQLite - persistent result storage

Data Dictionaries

Each .xlsx file in data_dictionary/ contains three sheets:

Sheet Purpose
Skills Skill segments with inclusion keywords for matching
Exclusion Skills Keywords that trigger automatic rejection
Exclusion Company Company names that trigger rejection

Upload a custom dictionary via the sidebar to define your own scoring criteria.

About

AI Driven Resume Parser for Data Analytics with UI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages