# Data Science Case Study: Candidate Resume Search Platform

## Business Context & Objective

### Background
You are joining the Business Development team at Millennium which is a global hedge fund that manages assets across multiple investment strategies (fundamental equity, systematic trading, credit, etc.). The BD team is responsible for sourcing junior analyst talent across different:

- **Geographic Markets**: US, Europe, Asia-Pacific
- **Investment Approaches**: Fundamental vs. Systematic/Quantitative strategies
- **Sectors**: Technology, Healthcare, Financial Services, Energy, Industrials, Consumer, Credit, Macro, etc.
- **Experience Levels**: depending on the job requisitions

### Goal
Build a **searchable platform** to quickly identify candidates based on specific criteria based on job requisitions.

### Your Task
1. Parse resume data from PDF/Word documents using **LLM models via API**
2. Create parsed resume data as **JSON, CSV, etc.** for further analysis
3. Create a **Streamlit web application** where BD users can search and filter candidates using multiple criteria
   (Based on the background provided above, okyou will come up with relevant filters)
4. Visualize candidate distributions and insights
5. **Design for scalability** to handle large volumes of resumes

### Output
1. Code for data parsing and Streamlit in **this Jupyter notebook** for ease of review
2. Include the **link to Streamlit app** in the notebook
3. **JSON/CSV exports** of parsed resume data
4. Discussion of additional features and implementation approach if more time was available
---

## Sample Resume Data

**You have access to 10 made up resume files representing different candidate profiles**

---

## Technical Requirements & Evaluation
  
### Evaluation Criteria
1. **Technical Implementation (40%)**: Code quality, data processing, performance
2. **User Experience (30%)**: Interface design, functionality, responsiveness
3. **Business Value (20%)**: Feature completeness, search capabilities, insights
4. **Documentation (10%)**: Code comments, README, presentation

### Time Recommendation
- **Total Estimated Time**: ~9 hours (**tight but achievable with focused approach**)
- **Recommended Strategy**: Build core MVP in 6-7 hours, then enhance

### Phased Approach (Recommended)
**Phase 1 (3 hours): Core Parsing & Data**
- Set up LLM API integration (OpenAI, Anthropic, or alternatives)
- Parse 2-3 sample resumes using LLM
- Create structured data format with JSON, etc.
- Build basic data validation and quality checks
 
**Phase 2 (3 hours): Streamlit**  
- Basic filter interface
- Simple results display
- Core visualizations

**Phase 3 (2-3 hours): Enhancement**
- Parse remaining resumes
- Add advanced filters
- Polish UI/UX

### Success Criteria (MVP)
- LLM-powered parsing of at least 5-7 resume files with high accuracy
- Working Streamlit app with basic search functionality
- At least 3-4 meaningful filters working
- A few visualizations showing candidate insights
- JSON/CSV output of parsed resume data
- Clean, documented code


In [None]:
from dotenv import load_dotenv
import os
import logging
import json
#Parser Utils
from utils.parser import extract_text


from openai import OpenAI
from utils.prompts import RESUME_PARSER_PROMPT, SUMMARY_PROMPT

In [None]:
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

In [None]:

RAW_DIR = "data/resumes/raw"
PROCESSED_DIR = "data/resumes/processed"
SUMMARY_DIR = "data/resumes/summaries"

os.makedirs(PROCESSED_DIR, exist_ok=True)
os.makedirs(SUMMARY_DIR, exist_ok=True)

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))



In [None]:
def summarize_candidate(parsed_data: dict, filename: str = "") -> dict:
    """
    Summarizes a candidate's resume into a structured JSON object.
    Uses parsed resume data to extract key highlights for quick review in the app.

    Args:
        parsed_data (dict): Parsed resume data from parse_resume().
        filename (str): Original filename to help extract candidate name.

    Returns:
        dict: Structured candidate summary.
    """

    
    prompt = SUMMARY_PROMPT.format(parsed_data=json.dumps(parsed_data, indent=2), filename=filename)
    try:
        logger.debug("Generating candidate summary...")

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a data summarization assistant that outputs only valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.2,
            response_format={"type": "json_object"}
        )

        content = response.choices[0].message.content
        logger.debug("Summary response preview: %s...", content[:500])

        summary_data = json.loads(content)
        return summary_data

    except json.JSONDecodeError as e:
        logger.error(f"Failed to parse JSON summary: {e}")
        raise
    except Exception as e:
        logger.error(f"LLM summarization failed: {e}")
        raise

In [None]:
def parse_resume(resume_text: str, filename: str = "") -> dict:
    """
    Parses resume text using an LLM prompt optimized for hedge fund BD sourcing.
    Returns structured JSON with normalized, business-relevant fields.

    Args:
        resume_text (str): Raw text content of the resume
        filename (str): Original filename to help extract candidate name

    Returns:
        dict: Structured resume data with normalized fields
    """
    from utils.prompts import RESUME_PARSER_PROMPT
    prompt = RESUME_PARSER_PROMPT.format(resume_text=resume_text, filename=filename)
    logger.debug("Prompt preview: %s...", prompt[:500])

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a data extraction expert that outputs only valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )

        content = response.choices[0].message.content
        logger.debug("Response preview: %s...", content[:500])
        parsed_data = json.loads(content)
        return parsed_data

    except json.JSONDecodeError as e:
        logger.error(f"Failed to parse JSON response: {e}")
        raise
    except Exception as e:
        logger.error(f"LLM request failed: {e}")
        raise


In [None]:
logger.info("Starting resume processing pipeline...")

files = os.listdir(RAW_DIR)
logger.info(f"Processing {len(files)} resume(s)")

for filename in files:
    file_path = os.path.join(RAW_DIR, filename)
    base_name = os.path.splitext(filename)[0]

    parsed_output_path = os.path.join(PROCESSED_DIR, f"{base_name}_parsed.json")
    summary_output_path = os.path.join(SUMMARY_DIR, f"{base_name}_summary.json")

    logger.info(f"Processing: {filename}")

    try:
        text = extract_text(file_path)

        # Parse the full structured data
        parsed_data = parse_resume(text, filename=base_name)
        with open(parsed_output_path, "w", encoding="utf-8") as f:
            json.dump(parsed_data, f, indent=2, ensure_ascii=False)
        logger.info(f"Saved parsed data → {parsed_output_path}")

        # 2Generate summarized profile using parsed data
        summary_data = summarize_candidate(parsed_data, filename=base_name)
        with open(summary_output_path, "w", encoding="utf-8") as f:
            json.dump(summary_data, f, indent=2, ensure_ascii=False)
        logger.info(f"Saved summary → {summary_output_path}")

    except Exception as e:
        logger.error(f"Failed to process {filename}: {e}")


In [None]:
files = os.listdir(RAW_DIR)[:5]  # Process first file only for testing
logger.info(f"Processing {len(files)} resume(s)")

for filename in files:
    file_path = os.path.join(RAW_DIR, filename)
    output_filename = filename.rsplit(".", 1)[0] + ".json"
    output_path = os.path.join(PROCESSED_DIR, output_filename)

    if os.path.exists(output_path):
        logger.info(f"Skipping {filename} — already processed.")
        continue

    logger.info(f"Processing: {filename}")

    try:
        text = extract_text(file_path)
        parsed_data = parse_resume(text)

        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(parsed_data, f, indent=2, ensure_ascii=False)

        logger.info(f"Successfully processed: {filename} -> {output_filename}")

    except Exception as e:
        logger.error(f"Failed to process {filename}: {e}")


Section to Handle the Data into a Warehouse to serve to Front end

In [None]:
#Warehosue Pipeline Section

from utils.db import get_connection, drop_and_create_tables 
import os
import pandas as pd
conn = get_connection()
drop_and_create_tables(conn)
conn.close()

logger.info("Warehouse database initialized at data/db/warehouse.db")

In [None]:
from utils.db import load_json, get_connection, insert_parsed, insert_candidate, insert_experience, insert_education, insert_skill, insert_to_fts, update_filter_values_for_candidate, insert_quality_score
from utils.data_validator import validate_resume_data, save_validation_report, calculate_completeness_score
import os

files = [f for f in os.listdir(RAW_DIR)]
conn = get_connection()

for filename in files:
    base_name = os.path.splitext(filename)[0]
    parsed_path = os.path.join(PROCESSED_DIR, f"{base_name}_parsed.json")
    summary_path = os.path.join(SUMMARY_DIR, f"{base_name}_summary.json")
    resume_path = os.path.join(RAW_DIR, filename)

    logger.info(f"Processing candidate: {base_name}")

    # --- Load parsed + summary files ---
    parsed_data = load_json(parsed_path)
    summary_data = load_json(summary_path)

    try:
        parsed_id = insert_parsed(conn, parsed_data, summary_data.get("name"), resume_path)

        candidate_id = insert_candidate(conn, summary_data, parsed_id, resume_path)
        #insert to respective filter tables
        for exp in parsed_data.get("experiences", []):
            insert_experience(conn, candidate_id, exp)

        for edu in parsed_data.get("education", []):
            insert_education(conn, candidate_id, edu)

        skills = summary_data.get("top_skills", [])
        for skill in skills:
            insert_skill(conn, candidate_id, skill)

        insert_to_fts(conn, candidate_id, parsed_data, summary_data)

        update_filter_values_for_candidate(conn, candidate_id, summary_data, parsed_data)

        completeness_score, completeness_grade, missing_required, missing_optional = calculate_completeness_score(parsed_data, summary_data)
        logger.info(f"Completeness: {completeness_score}% (Grade: {completeness_grade})")

        issues = validate_resume_data(parsed_data, summary_data)
        
        total_issues = len(issues["critical"]) + len(issues["formatting"]) + len(issues["warnings"])
        logger.info(f"Validation: {total_issues} total issues - Critical: {len(issues['critical'])}, Formatting: {len(issues['formatting'])}, Warnings: {len(issues['warnings'])}")
        
        insert_quality_score(
            conn,
            candidate_id,
            completeness_score,
            completeness_grade,
            total_issues,
            issues,
            missing_required,
            missing_optional
        )
        logger.info(f"Saved quality score to database")

        save_validation_report(
            summary_data.get("name"),
            candidate_id,
            issues,
            parsed_data,
            summary_data,
            completeness_score,
            completeness_grade,
            missing_required,
            missing_optional
        )

        logger.info(f"Successfully inserted candidate: {summary_data.get('name')}")

    except Exception as e:
        logger.error(f"Failed to insert {base_name}: {e}")
        import traceback
        traceback.logger.info_exc()

conn.close()
logger.info("Warehouse ingestion complete.")

In [None]:
DB_PATH='data/db/warehouse.db'
import sqlite3

# Helper function to display table data in a readable format
def show_table_preview(conn, table_name, limit=5):
    """logger.infos a preview of a table with Pandas for readability."""
    logger.info(f"\nTable: {table_name}")
    try:
        df = pd.read_sql_query(f"SELECT * FROM {table_name} LIMIT {limit}", conn)
        if df.empty:
            logger.info("No rows found.")
        else:
            logger.info(df.to_string(index=False))
    except Exception as e:
        logger.info(f"Failed to query {table_name}: {e}")

def inspect_warehouse():
    if not os.path.exists(DB_PATH):
        logger.info(f"Database not found at {DB_PATH}")
        return

    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()

    logger.info("🔍 Inspecting warehouse database...\n")

    # Get all table names
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = [row[0] for row in cursor.fetchall()]

    if not tables:
        logger.info("No tables found in warehouse.")
        conn.close()
        return

    for table in tables:
        show_table_preview(conn, table)

    conn.close()
    logger.info("\nInspection complete.")
inspect_warehouse()