An intelligent resume parsing and candidate scoring system that extracts structured data from PDF/DOCX resumes, scores candidates against configurable criteria, and presents results through an interactive dashboard.
| Verified Candidate | Rejected Candidate |
|---|---|
![]() |
![]() |
- Select a job category from the sidebar (Data Science, Data Engineering, or Data Visualization) - or upload your own custom data dictionary (.xlsx)
- Adjust scoring weights using the sidebar sliders - control how much each dimension (skills, degree, experience, certifications) contributes to the overall score
- Upload one or more resumes (PDF or DOCX) - they are parsed concurrently with a progress bar
- View scored results in the Table View tab - candidates are ranked by overall score, with "Verified" or "Rejected" status based on exclusion lists
- Explore breakdowns across sub-tabs: Overall Score, Skill Score, Experience Score, Degree Score, Certification Score
- Compare top candidates side-by-side using the Radar Chart tab - a polar overlay showing each candidate's strengths across all four dimensions
- Save to database for future reference, or download as CSV for offline analysis
- Review past results in the "View Existing Resumes" tab, filtered by job category
- Multi-format support - parse PDF and DOCX resumes
- Automated extraction - contact info, education, skills, experience, certifications
- Configurable scoring - weighted scoring across skill match, degree level, experience, and certifications
- Data dictionaries - customizable skill taxonomies per job category (Data Science, Data Engineering, Data Visualization)
- Candidate comparison - radar chart overlay comparing top candidates across all dimensions
- Batch processing - concurrent parsing of multiple resumes with progress tracking
- Persistent storage - SQLite backend for saving and reviewing past results
- CSV export - download scored results for offline analysis
flowchart LR
A[PDF / DOCX Upload] --> B[Text Extraction]
B --> C[Field Parsing]
C --> D[Skill Matching]
C --> E[Degree Detection]
C --> F[Experience Extraction]
C --> G[Certification Count]
D & E & F & G --> H[Weighted Scoring]
H --> I[Dashboard + Charts]
H --> J[SQLite Storage]
H --> K[CSV Export]
sequenceDiagram
participant U as User
participant S as Streamlit UI
participant P as Parser
participant SC as Scorer
participant DB as SQLite
U->>S: Upload resumes + select category
S->>P: Parse each file (concurrent)
P->>P: Extract text (PDF/DOCX)
P->>P: Extract email, phone, degrees, skills, experience
P-->>S: ResumeData objects
S->>SC: Build results DataFrame
SC->>SC: Rank-based scoring with weights
SC-->>S: Scored DataFrame
S->>U: Display table, charts, radar comparison
U->>S: Click Save
S->>DB: Store results (parameterized queries)
Each candidate is scored across four dimensions, normalized by rank within the batch:
Overall Score = (Skill Score x W1) + (Degree Score x W2) + (Experience Score x W3) + (Certification Score x W4)
| Dimension | What it measures | How |
|---|---|---|
| Skill Score | Match against data dictionary skill taxonomy | NLP n-gram tokenization + fuzzy matching |
| Degree Score | Education level | Fuzzy match against degree classification lists |
| Experience Score | Years of experience | Regex patterns for "X years", date ranges |
| Certification Score | Number of certifications | Keyword frequency count |
Weights are adjustable via sidebar sliders (must sum to 100%).
# Clone
git clone https://github.com/samitmohan/resume-parser.git
cd resume-parser
# Install dependencies
uv sync
# Run
uv run streamlit run app.pyOpen http://localhost:8501 in your browser.
resume-parser/
app.py # Streamlit UI - layout, charts, user interaction
src/
__init__.py
config.py # Constants, regex patterns, degree lists, paths
parser.py # Text extraction + field parsing (PDF/DOCX)
scorer.py # Rank-based scoring and DataFrame construction
database.py # SQLite operations with parameterized queries
data_dictionary/ # Excel files defining skill taxonomies per category
Data Engineering.xlsx
Data Science.xlsx
Data Visualization.xlsx
assets/
icon_g.png
temp/ # Temporary upload directory (gitignored)
pyproject.toml
- Python 3.10+
- Streamlit - interactive web dashboard
- PyMuPDF (fitz) - PDF text extraction
- python-docx - DOCX text extraction
- scikit-learn - CountVectorizer for n-gram tokenization
- rapidfuzz - fuzzy string matching for degree/skill detection
- Plotly - bar charts, scatter plots, radar charts
- pandas - data manipulation and scoring
- SQLite - persistent result storage
Each .xlsx file in data_dictionary/ contains three sheets:
| Sheet | Purpose |
|---|---|
| Skills | Skill segments with inclusion keywords for matching |
| Exclusion Skills | Keywords that trigger automatic rejection |
| Exclusion Company | Company names that trigger rejection |
Upload a custom dictionary via the sidebar to define your own scoring criteria.

