*Job Description*
**Load CSV and Check the column**

In [1]:
import pandas as pd

CSV_PATH = "data/job_description.csv"
df_jd = pd.read_csv(CSV_PATH, encoding_errors="ignore")

required_cols = {"Role", "JobDescription"}
missing = required_cols - set(df_jd.columns)
assert not missing, f"CSV is missing columns: {missing}"

print("Loaded rows:", len(df_jd))
print(df_jd.head(3))


Loaded rows: 21
                 Role                                     JobDescription
0   Software Engineer  Hiring a strong Software Engineer to design, b...
1   Backend Developer  Looking for a Backend Developer to own APIs, b...
2  Frontend Developer  Seeking a Frontend Developer who ships accessi...


**Job description cleaning** then save in new column

In [2]:
import re

def clean_text(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r'http\S+|www\S+|\S+@\S+', ' ', s)
    s = re.sub(r'[^a-z0-9\s\.\+\#]', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s
df_jd['CleanedJobDescription'] = df_jd['JobDescription'].apply(clean_text)

print(df_jd.head(3))

                 Role                                     JobDescription  \
0   Software Engineer  Hiring a strong Software Engineer to design, b...   
1   Backend Developer  Looking for a Backend Developer to own APIs, b...   
2  Frontend Developer  Seeking a Frontend Developer who ships accessi...   

                               CleanedJobDescription  
0  hiring a strong software engineer to design bu...  
1  looking for a backend developer to own apis bu...  
2  seeking a frontend developer who ships accessi...  


**Read Resume from the PDF file**

In [3]:
from pypdf import PdfReader

pdf_path = 'data/frontend_strong_real.pdf'

def read_pdf_text(pdf_path: str) -> str:
    try:
        reader = PdfReader(pdf_path)
        chunks = []
        for page in reader.pages:
            txt = page.extract_text() or ""
            chunks.append(txt)
            fully_text = "\n".join(chunks).strip()
        return fully_text
    except Exception as e:
        print(f"Error reading PDF: {e}")
        return ""
    
resume_text = read_pdf_text(pdf_path)
print(resume_text)

John Doe
Frontend Engineer — 4+ Years Experience
Email: john.doe@email.com  Portfolio: johndoe.dev
LinkedIn: linkedin.com/in/johndoe  GitHub: github.com/johndoe
Summary
Experienced Frontend Engineer specializing in scalable, accessible, and performance■driven
apps. Strong collaboration with product, design, and backend teams.
Key Skills

React, Next.js, TypeScript, Redux Toolkit, React Query

Tailwind CSS, CSS-in-JS, Component Design Systems

REST, GraphQL, WebSockets

Jest, React Testing Library, Cypress

Accessibility (WCAG), Core Web Vitals, SEO

Git, CI/CD, Docker
Work Experience

Frontend Engineer | SaaS Company (2021–Present)

Improved performance scores by 40% through optimization and caching

Developed reusable UI components and dark mode support

Implemented SSR/ISR in Next.js for SEO improvement to 95+ score
Projects

Internal design system for 3 product lines

Real■time admin dashboard with charts and websocket updates
Education
B.Sc. Computer Science — ABC Uni

**Cleaning Resume's data**

In [4]:
def clean_text(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r'http\S+|www\S+|\S+@\S+', ' ', s)
    s = re.sub(r'[^a-z0-9\s\.\+\#]', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

resume_clean = clean_text(resume_text)
print(resume_clean)

john doe frontend engineer 4+ years experience email portfolio johndoe.dev linkedin linkedin.com in johndoe github github.com johndoe summary experienced frontend engineer specializing in scalable accessible and performance driven apps. strong collaboration with product design and backend teams. key skills react next.js typescript redux toolkit react query tailwind css css in js component design systems rest graphql websockets jest react testing library cypress accessibility wcag core web vitals seo git ci cd docker work experience frontend engineer saas company 2021 present improved performance scores by 40 through optimization and caching developed reusable ui components and dark mode support implemented ssr isr in next.js for seo improvement to 95+ score projects internal design system for 3 product lines real time admin dashboard with charts and websocket updates education b.sc. computer science abc university


**Detect the Candidate's Role form User typing**

In [5]:
from rapidfuzz import process, fuzz

ALLOWED_ROLES = sorted(df_jd["Role"].dropna().unique().tolist())

def normalize_role_input(user_role: str, choices=ALLOWED_ROLES, threshold=80):
    cand, score, _ = process.extractOne(user_role, choices, scorer=fuzz.WRatio)
    return (cand, score) if score >= threshold else (None, score)

user_input_role = "Frontend Developer"
matched_role, score = normalize_role_input(user_input_role)

if not matched_role:
    raise ValueError(f"Role '{user_input_role}' not recognized (best score={score}). Try one of: {ALLOWED_ROLES[:10]} ...")

print(f"User input role: {user_input_role}")


User input role: Frontend Developer


**Embeddings + Cosine similarity → % Match**

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Hyperparameters for stability
BASELINE_COS = 0.20     # หักค่า similarity ที่มักสูงเป็นพื้นฐาน
MIN_WORDS = 50          # ถ้า resume สั้น/มั่ว → ให้ 0%
NEGATIVE_PENALTY = 0.5  # ถ้าเป็นคำผิดสายงานลดคะแนน

NEGATIVE_WORDS = {
    "accounting","nurse","warehouse","driver","logistics",
    "cashier","factory","mechanical","chefs","culinary"
}

model = SentenceTransformer("all-MiniLM-L6-v2")

# === Validation: resume text too short ===
if len(resume_clean.split()) < MIN_WORDS:
    # Avoid using 'return' at top-level in a notebook cell.
    perc_match = np.float32(0.0)
    print(f"Role: {matched_role}")
    print(f"Percent match: {perc_match:.2f}% (resume too short/irrelevant)")
else:
    # Filter role JD
    sub = df_jd[df_jd["Role"].str.lower() == matched_role.lower()].copy()
    assert len(sub) > 0, f"No JD found for role: {matched_role}"S

    # Embeddings
    res_emb = model.encode([resume_clean], normalize_embeddings=True)
    jd_emb  = model.encode(sub["CleanedJobDescription"].tolist(), batch_size=64, normalize_embeddings=True)

    # Raw similarity scores
    sims = cosine_similarity(res_emb, jd_emb)[0]  # [N JDs]
    # Remove baseline cosine which inflates random text similarity
    sims = sims - BASELINE_COS
    sims = np.clip(sims, 0, 1)

    # Convert to %
    perc = sims * 100
    sub["Match_%"] = perc

    # === Penalty for wrong-field text ===
    has_negative = any(w in resume_clean.lower() for w in NEGATIVE_WORDS)
    if has_negative:
        sub["Match_%"] = sub["Match_%"] * NEGATIVE_PENALTY

    # Take best match
    perc_match = sub["Match_%"].max()

    print(f"Role: {matched_role}")
    print(f"Percent match: {perc_match:.2f}%")


  from .autonotebook import tqdm as notebook_tqdm


Role: Frontend Developer
Percent match: 53.52%


**Skills overlap**

In [7]:
SKILLS = {
  "python","pandas","numpy","scikit-learn","tensorflow","pytorch","sql","tableau","power bi",
  "react","node","docker","kubernetes","aws","azure","gcp","java","spring","c#","dotnet",".net",
  "javascript","typescript","html","css","django","flask","spark","airflow","hadoop","kafka",
  "terraform","git","ci","cd","jenkins","linux","bash","rest","graphql","firebase","swift","kotlin"
}

def extract_skills(text: str):
    found = set()
    for sk in SKILLS:
        if re.search(rf"\b{re.escape(sk)}\b", text):
            found.add(sk)
    return found

resume_sk = extract_skills(resume_clean)
sub["JD_skills_set"] = sub["CleanedJobDescription"].apply(extract_skills)
sub["SkillOverlap_%"] = sub["JD_skills_set"].apply(
    lambda s: (len(s & resume_sk) / max(1, len(s))) * 100.0
)

# ผสมคะแนน: 0.7 semantic + 0.3 skill overlap (ปรับน้ำหนักได้)
sub["Final_%"] = 0.7*sub["Match_%"] + 0.3*sub["SkillOverlap_%"]
print(sub.sort_values("Final_%", ascending=False).head(3)[["Role","Match_%","SkillOverlap_%","Final_%"]])


                 Role    Match_%  SkillOverlap_%    Final_%
2  Frontend Developer  53.522961           100.0  67.466072
