![](image.jpg)

As a Data Analyst at a leading global HR consultancy, your mission is to delve into an extensive database of resumes to identify suitable candidates for tech-focused roles. This task involves using regular expressions to extract key data points and applying data preprocessing techniques to organize this information effectively.

## Dataset Summary

`resumes.csv`

| Column      | Data Type | Description                                                  |
|-------------|-----------|--------------------------------------------------------------|
| `ID`        | float     | Unique identifier for each resume.                           |
| `Resume_str`| object    | Full text of the resume, rich with details for analysis.     |
| `Category`  | object    | Job category of the resume, indicating the field of expertise. |

## Let's Get Started!

Embark on this analytical journey to harness advanced data analysis techniques for real-world HR challenges. This project is your chance to impact the hiring process by ensuring that tech talent finds their ideal job. Let's begin this exciting journey!


In [1]:
import pandas as pd
import re

# Load the resume dataset from a CSV file into a DataFrame
resumes = pd.read_csv('resumes.csv')
resumes.sample(3)

Unnamed: 0,ID,Resume_str,Category
1284,11270462.0,SOCIAL MEDIA MANAGER Summary ...,DIGITAL-MEDIA
397,11336022.0,LEAD TEACHER Summary To secu...,TEACHER
801,19037403.0,"PROFESSIONAL FITNESS TRAINER, GROUP I...",FITNESS


In [2]:
import re

job_title_regex = r"^[A-Z\s\.\,\-]+\b"
tech_skills_regex = r"\b(python|sql|r|excel)\b"
education_regex = r"\b(PhD|Master|Bachelor)\b"

job_titles = []
tech_skills = []
education = []
for resume in resumes['Resume_str']:
    
    job_title_match = re.search(job_title_regex, resume)
    if job_title_match is not None:
        job_title = job_title_match.group(0).strip()
    else:
        job_title = ""
    job_titles.append(job_title)

    tech_skill_match = re.search(tech_skills_regex, resume)
    if tech_skill_match is not None:
        tech_skill = tech_skill_match.group(0).strip()
    else:
        tech_skill = ""
    tech_skills.append(tech_skill)

    education_match = re.search(education_regex, resume)
    if education_match is not None:
        edu = education_match.group(0).strip()
    else:
        edu = ""
    education.append(edu)

resumes['job_title'] = job_titles
resumes['tech_skills'] = tech_skills
resumes['education'] = education

resumes_filtered = resumes[(resumes['job_title'] != "") & (resumes['tech_skills'] != "") & (resumes['education'] != "")]

candidates_df = resumes_filtered[['ID', 'job_title', 'tech_skills', 'education']]
candidates_df.columns = candidates_df.columns.str.lower()
candidates_df.dropna(inplace=True)

candidates_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  candidates_df.dropna(inplace=True)


Unnamed: 0,id,job_title,tech_skills,education
70,20993320.0,HR COORDINATOR,r,Master
71,14640322.0,HR GENERALIST,excel,Bachelor
80,25724495.0,REGIONAL HR MANAGER,excel,Bachelor
114,11919526.0,E-LEARNING DESIGNER,r,Master
213,93301686.0,LEAD INSTRUCTIONAL DESIGNER,excel,Bachelor
