**Packages are installed:**

pypdf - For PDF file reading

sentence-transformers - For text embeddings

pandas - For data manipulation

numpy - For numerical operations

scikit-learn - For machine learning utilities

In [9]:
import re
import pandas as pd
import numpy as np
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

**Read the PDF file**

In [20]:
pdf_path = 'data/sample_resume.pdf'  # Path to the PDF file

def read_pdf_text(pdf_path: str) -> str:
    try:
        reader = PdfReader(pdf_path)
        chunks = []
        for page in reader.pages:
            txt = page.extract_text() or ""
            chunks.append(txt)
            fully_text = "\n".join(chunks).strip()
        return fully_text
    except Exception as e:
        print(f"Error reading PDF: {e}")
        return ""
    
resume_text = read_pdf_text(pdf_path)
print(resume_text)

Supawan Kongsapcharoen 
Location: Chiang Rai, Thailand 
Phone: +66 95-581-0440 | Email: supawankongsapcharoen@gmail.com 
PROFILE SUMMARY 
Analytical and detail-oriented Software Engineering student (Year 3) passionate about Data Analysis and 
Machine Learning. Skilled in SQL, Python, and data visualization, with experience as a Teaching Assistant 
in Database Systems and several academic projects involving data cleaning, analytics, and predictive 
modeling. Seeking an internship opportunity to apply analytical thinking and problem-solving skills in real-
world datasets. 
TECHNICAL SKILLS 
Languages: Python, SQL, JavaScript (basic) 
Libraries & Tools: Pandas, NumPy, Matplotlib, Scikit-learn 
Databases: MySQL, MongoDB 
Version Control: Git, GitHub 
Soft Skills: Analytical Thinking, Communication, Team Collaboration, Problem Solving 
PROJECT EXPERIENCE 
Machine Learning Resume Classifier - Python, Scikit-learn, TF-IDF | 2025 
• Built a machine learning model to classify resumes into job c

**Cleaning Resume's data**

In [21]:
def clean_text(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r'http\S+|www\S+|\S+@\S+', ' ', s)
    s = re.sub(r'[^a-z0-9\s]', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

resume_text = clean_text(resume_text)
print(resume_text)

supawan kongsapcharoen location chiang rai thailand phone 66 95 581 0440 email profile summary analytical and detail oriented software engineering student year 3 passionate about data analysis and machine learning skilled in sql python and data visualization with experience as a teaching assistant in database systems and several academic projects involving data cleaning analytics and predictive modeling seeking an internship opportunity to apply analytical thinking and problem solving skills in real world datasets technical skills languages python sql javascript basic libraries tools pandas numpy matplotlib scikit learn databases mysql mongodb version control git github soft skills analytical thinking communication team collaboration problem solving project experience machine learning resume classifier python scikit learn tf idf 2025 built a machine learning model to classify resumes into job categories e g data science hr processed 4 000 text samples using data cleaning and feature ex

**Merge column Job Description dataset** for use with NPL (TF-IDF/Sentence-BERT)

In [22]:
df = pd.read_csv('data/job_dataset.csv')  # Load job descriptions dataset

def build_jd_text(df: pd.DataFrame) -> pd.DataFrame:
    title_cols = ['Title']
    desc_cols  = ['Responsibilities']
    skill_cols = ['Skills']

    def safe_get(row, cols):
        return " ".join([str(row[c]) if c in df.columns else "" for c in cols])

    jd_texts = []
    for _, row in df.iterrows():
        parts = []
        if title_cols: parts.append(safe_get(row, title_cols))
        if desc_cols:  parts.append(safe_get(row, desc_cols))
        if skill_cols: parts.append(safe_get(row, skill_cols))
        jd_texts.append(" ".join(parts))
    df['jd_text_raw'] = jd_texts
    df['jd_text'] = df['jd_text_raw'].fillna("").apply(clean_text)
    return df

df = build_jd_text(df)
print(df[['jd_text_raw', 'jd_text']].head())

                                         jd_text_raw  \
0  .NET Developer Assist in coding and debugging ...   
1  .NET Developer Write simple C# programs under ...   
2  .NET Developer Contribute to development of sm...   
3  .NET Developer Support in software design docu...   
4  .NET Developer Learn to design and build ASP.N...   

                                             jd_text  
0  net developer assist in coding and debugging a...  
1  net developer write simple c programs under gu...  
2  net developer contribute to development of sma...  
3  net developer support in software design docum...  
4  net developer learn to design and build asp ne...  


**Load Job description dataset**

In [34]:
df_jd = pd.read_csv("data/job_dataset.csv")
print(df_jd.columns.tolist())

df_jd = build_jd_text(df_jd)
print(df_jd.columns.tolist())

print("-"*80)

jd_role_col = 'Title'
print(df_jd[[jd_role_col, 'jd_text']].head(3))

['JobID', 'Title', 'ExperienceLevel', 'YearsOfExperience', 'Skills', 'Responsibilities', 'Keywords']
['JobID', 'Title', 'ExperienceLevel', 'YearsOfExperience', 'Skills', 'Responsibilities', 'Keywords', 'jd_text_raw', 'jd_text']
--------------------------------------------------------------------------------
            Title                                            jd_text
0  .NET Developer  net developer assist in coding and debugging a...
1  .NET Developer  net developer write simple c programs under gu...
2  .NET Developer  net developer contribute to development of sma...


**Create Embeddings and Calculate Similarity**

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
res_emb = model.encode([resume_text], normalize_embeddings=True)
jd_emb  = model.encode(df_jd['jd_text'].tolist(), batch_size=64, normalize_embeddings=True)

#Calculate cosine similarity
S = cosine_similarity(res_emb, jd_emb)[0]
df_jd['match_percent_semantic'] = S * 100

print("sample match first row:")
print(df_jd[[jd_role_col, 'match_percent_semantic']].head(10))


sample matched first row:
            Title  match_percent_semantic
0  .NET Developer               45.426300
1  .NET Developer               48.971169
2  .NET Developer               46.751400
3  .NET Developer               43.796604
4  .NET Developer               48.460060
5  .NET Developer               36.956669
6  .NET Developer               37.137115
7  .NET Developer               42.378448
8  .NET Developer               45.590660
9  .NET Developer               42.485508


**Summeries % Match for Role**

In [38]:
role_group = (df_jd
              .groupby(jd_role_col)['match_percent_semantic']
              .agg(['max','mean','count'])
              .sort_values('max', ascending=False)
              .reset_index()
              .rename(columns={'max':'best_match_%','mean':'avg_match_%','count':'#_jds'}))

print(role_group.head(10))

                                  Title  best_match_%  avg_match_%  #_jds
0          Data Scientist - Entry Level     68.477226    68.477226      1
1  Data Analyst / Data Scientist Intern     68.438301    68.438301      1
2                Data Science Associate     67.419937    67.419937      1
3      Junior Machine Learning Engineer     66.868622    63.006714      2
4            Entry Level Data Scientist     65.772842    65.772842      1
5            Data Analyst - Experienced     64.956085    60.334721     11
6                   Data Science Intern     64.428955    64.428955      1
7              Principal Data Scientist     64.385292    64.385292      1
8          Data Scientist - Experienced     63.590973    63.590973      1
9                         Data Engineer     62.546886    50.600952     20
