*Job Description*
**Load CSV and Check the column**

In [50]:
import pandas as pd

CSV_PATH = "data/job_description.csv"
df_jd = pd.read_csv(CSV_PATH)

required_cols = {"Role", "JobDescription"}
missing = required_cols - set(df_jd.columns)
assert not missing, f"CSV is missing columns: {missing}"

print("Loaded rows:", len(df_jd))
print(df_jd.head(3))


Loaded rows: 66
                            Role  \
0  Business Intelligence Analyst   
1                    QA Engineer   
2       Test Automation Engineer   

                                      JobDescription  
0  A Business Intelligence Analyst transforms dat...  
1  A QA Engineer ensures the quality and reliabil...  
2  A Test Automation Engineer builds and maintain...  


**Job description cleaning** then save in new column

In [51]:
import re

def clean_text(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r'http\S+|www\S+|\S+@\S+', ' ', s)
    s = re.sub(r'[^a-z0-9\s\.\+\#]', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s
df_jd['CleanedJobDescription'] = df_jd['JobDescription'].apply(clean_text)

print(df_jd.head(3))

                            Role  \
0  Business Intelligence Analyst   
1                    QA Engineer   
2       Test Automation Engineer   

                                      JobDescription  \
0  A Business Intelligence Analyst transforms dat...   
1  A QA Engineer ensures the quality and reliabil...   
2  A Test Automation Engineer builds and maintain...   

                               CleanedJobDescription  
0  a business intelligence analyst transforms dat...  
1  a qa engineer ensures the quality and reliabil...  
2  a test automation engineer builds and maintain...  


**Read Resume from the PDF file**

In [52]:
from pypdf import PdfReader

pdf_path = 'data/sample_resume.pdf'

def read_pdf_text(pdf_path: str) -> str:
    try:
        reader = PdfReader(pdf_path)
        chunks = []
        for page in reader.pages:
            txt = page.extract_text() or ""
            chunks.append(txt)
            fully_text = "\n".join(chunks).strip()
        return fully_text
    except Exception as e:
        print(f"Error reading PDF: {e}")
        return ""
    
resume_text = read_pdf_text(pdf_path)
print(resume_text)

Supawan Kongsapcharoen 
Location: Chiang Rai, Thailand 
Phone: +66 95-581-0440 | Email: supawankongsapcharoen@gmail.com 
PROFILE SUMMARY 
Analytical and detail-oriented Software Engineering student (Year 3) passionate about Data Analysis and 
Machine Learning. Skilled in SQL, Python, and data visualization, with experience as a Teaching Assistant 
in Database Systems and several academic projects involving data cleaning, analytics, and predictive 
modeling. Seeking an internship opportunity to apply analytical thinking and problem-solving skills in real-
world datasets. 
TECHNICAL SKILLS 
Languages: Python, SQL, JavaScript (basic) 
Libraries & Tools: Pandas, NumPy, Matplotlib, Scikit-learn 
Databases: MySQL, MongoDB 
Version Control: Git, GitHub 
Soft Skills: Analytical Thinking, Communication, Team Collaboration, Problem Solving 
PROJECT EXPERIENCE 
Machine Learning Resume Classifier - Python, Scikit-learn, TF-IDF | 2025 
• Built a machine learning model to classify resumes into job c

**Cleaning Resume's data**

In [53]:
def clean_text(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r'http\S+|www\S+|\S+@\S+', ' ', s)
    s = re.sub(r'[^a-z0-9\s\.\+\#]', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

resume_clean = clean_text(resume_text)
print(resume_clean)

supawan kongsapcharoen location chiang rai thailand phone +66 95 581 0440 email profile summary analytical and detail oriented software engineering student year 3 passionate about data analysis and machine learning. skilled in sql python and data visualization with experience as a teaching assistant in database systems and several academic projects involving data cleaning analytics and predictive modeling. seeking an internship opportunity to apply analytical thinking and problem solving skills in real world datasets. technical skills languages python sql javascript basic libraries tools pandas numpy matplotlib scikit learn databases mysql mongodb version control git github soft skills analytical thinking communication team collaboration problem solving project experience machine learning resume classifier python scikit learn tf idf 2025 built a machine learning model to classify resumes into job categories e.g. data science hr . processed 4 000+ text samples using data cleaning and fe

In [None]:
#  --- Extract Resume Information ---
import re
import spacy
import pdfplumber

# โหลดโมเดลภาษาอังกฤษของ spaCy สำหรับ Named Entity Recognition
# พยายามโหลด model ถ้าไม่มีให้ดาวน์โหลดอัตโนมัติ ถ้าล้มเหลวให้ fallback เป็น blank model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    try:
        print("spaCy model 'en_core_web_sm' not found. Attempting to download...")
        from spacy.cli import download
        download("en_core_web_sm")
        nlp = spacy.load("en_core_web_sm")
        print("Downloaded and loaded 'en_core_web_sm'.")
    except Exception as e:
        print("Failed to download/load 'en_core_web_sm':", e)
        print("Falling back to blank English model (limited NER).")
        nlp = spacy.blank("en")

def extract_resume_info(pdf_path: str):
    """
    Extract candidate details (name, email, phone, skills, experience)
    from resume PDF.
    """
    # อ่านข้อความจากไฟล์ PDF
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"

    # --- Email ---
    email = re.findall(r'\S+@\S+', text)
    email = email[0] if email else None

    # --- Phone ---
    phone = re.findall(r'(\+?\d[\d\s\-]{8,}\d)', text)
    phone = phone[0] if phone else None

    # --- Name (NER) ---
    doc = nlp(text)
    name = None
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            name = ent.text
            break

    # --- Skills (keyword matching + cleaning text) ---
    skill_keywords = [
        "Python", "Machine Learning", "Deep Learning", "SQL",
        "React", "Java", "AWS", "Docker", "TensorFlow", "Data Analysis",
        "NLP", "Flask", "FastAPI", "Excel"
    ]
    skills = [s for s in skill_keywords if s.lower() in text.lower()]

    # --- Experience (e.g., '3 years of experience') ---
    exp = re.findall(r'(\d+[\+]? years? of experience)', text, flags=re.IGNORECASE)
    exp = exp[0] if exp else None

    extracted = {
        "name": name,
        "email": email,
        "phone": phone,
        "skills": skills,
        "experience": exp,
        "raw_text": text  # เก็บไว้ใช้ใน pipeline อื่น
    }

    return extracted


#  ทดสอบกับ sample resume
pdf_path = "data/sample_resume.pdf"
resume_data = extract_resume_info(pdf_path)

print("Extracted Resume Info:")
for key, value in resume_data.items():
    if key != "raw_text":  
     print(f"{key}: {value}")

**Detect the Candidate's Role form User typing**

In [54]:
from rapidfuzz import process, fuzz

ALLOWED_ROLES = sorted(df_jd["Role"].dropna().unique().tolist())

def normalize_role_input(user_role: str, choices=ALLOWED_ROLES, threshold=80):
    cand, score, _ = process.extractOne(user_role, choices, scorer=fuzz.WRatio)
    return (cand, score) if score >= threshold else (None, score)

user_input_role = "Dta Anyst"
matched_role, score = normalize_role_input(user_input_role)

if not matched_role:
    raise ValueError(f"Role '{user_input_role}' not recognized (best score={score}). Try one of: {ALLOWED_ROLES[:10]} ...")

print(f"User input role: {user_input_role}")
print(f"Matched role: {matched_role} with confidence: {score}%")


User input role: Dta Anyst
Matched role: Data Analyst with confidence: 85.71428571428572%


**Embeddings + Cosine similarity → % Match**

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

# specify role that matched
sub = df_jd[df_jd["Role"].str.lower() == matched_role.lower()].copy()
assert len(sub) > 0, f"No JD found for role: {matched_role}"

# embeddings
res_emb = model.encode([resume_clean], normalize_embeddings=True)
jd_emb  = model.encode(sub["CleanedJobDescription"].tolist(), batch_size=64, normalize_embeddings=True)

# cosine similarity -> percentage match
sims = cosine_similarity(res_emb, jd_emb)[0]             # shape: [n_jds_for_role]
perc = np.clip(sims, 0, 1) * 100.0
sub["Match_%"] = perc

perc_match = sub["Match_%"].max()

print(f"Role: {matched_role}")
print(f"Percent match: {perc_match:.2f}%")



Role: Data Analyst
Percent match: 41.61%


**Cache JD embeddings to avoid recomputation**

In [61]:
import pickle

pickle.dump({"roles": df_jd["Role"].tolist(),
             "jd_texts": df_jd["CleanedJobDescription"].tolist(),
             "embeddings": jd_emb}, open("jd_embeddings_cache.pkl", "wb"))


**Skills overlap**

In [None]:
SKILLS = {
  "python","pandas","numpy","scikit-learn","tensorflow","pytorch","sql","tableau","power bi",
  "react","node","docker","kubernetes","aws","azure","gcp","java","spring","c#","dotnet",".net",
  "javascript","typescript","html","css","django","flask","spark","airflow","hadoop","kafka",
  "terraform","git","ci","cd","jenkins","linux","bash","rest","graphql","firebase","swift","kotlin"
}

def extract_skills(text: str):
    found = set()
    for sk in SKILLS:
        if re.search(rf"\b{re.escape(sk)}\b", text):
            found.add(sk)
    return found

resume_sk = extract_skills(resume_clean)
sub["JD_skills_set"] = sub["CleanedJobDescription"].apply(extract_skills)
sub["SkillOverlap_%"] = sub["JD_skills_set"].apply(
    lambda s: (len(s & resume_sk) / max(1, len(s))) * 100.0
)

# ผสมคะแนน: 0.7 semantic + 0.3 skill overlap (ปรับน้ำหนักได้)s
sub["Final_%"] = 0.7*sub["Match_%"] + 0.3*sub["SkillOverlap_%"]
print(sub.sort_values("Final_%", ascending=False).head(3)[["Role","Match_%","SkillOverlap_%","Final_%","JobDescription"]])


            Role    Match_%  SkillOverlap_%    Final_%  \
10  Data Analyst  41.606106            50.0  44.124273   

                                       JobDescription  
10  A Data Analyst collects, cleans, and interpret...  
