<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-4--job-resume-improvement/05_case_study_job_resume_improvement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Case Study: Job Resume Improvement

Our goal is to extract common data science skills from the downloaded job postings.
Then we’ll compare these skills to our resume to determine which skills are
missing. 

We will do so as follows:

1. Parse all text from the downloaded HTML files.
2. Explore the parsed output to learn how job skills are described in online
postings. We’ll pay particular attention to whether certain HTML tags are
more associated with skill descriptions.
3. Attempt to filter any irrelevant job postings from our dataset.
4. Cluster job skills based on text similarity.
5. Visualize the clusters using word clouds.
6. Adjust clustering parameters, if necessary, to improve the visualized output.
7. Compare the clustered skills to our resume to uncover missing skills.

##Setup

In [None]:
!pip install bs4

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [14]:
import glob
import time
import numpy as np
import pandas as pd

from collections import defaultdict
from collections import Counter

from sklearn.feature_extraction.text import TfidfVectorizer

from bs4 import BeautifulSoup as bs

import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML

In [None]:
%%shell

wget https://github.com/rahiakela/data-science-research-and-practice/raw/main/data-science-bookcamp/case-study-4--job-resume-improvement/job_postings.zip
wget https://github.com/rahiakela/data-science-research-and-practice/raw/main/data-science-bookcamp/case-study-4--job-resume-improvement/resume.txt
wget https://github.com/rahiakela/data-science-research-and-practice/raw/main/data-science-bookcamp/case-study-4--job-resume-improvement/table_of_contents.txt

unzip -qq job_postings.zip
rm -rf job_postings.zip

##Extracting skills from job posting

Let's explore HTML files in the job_postings directory.

In [5]:
# Loading HTML files
html_contents = []
for file_name in sorted(glob.glob("job_postings/*.html")):
  with open(file_name, "r") as f:
    html_contents.append(f.read())
print(f"We have loaded {len(html_contents)} HTML files.")

We have loaded 1458 HTML files.


In [6]:
# Parsing HTML files
soup_objects = []
for html in html_contents:
  soup = bs(html)
  assert soup.title is not None
  assert soup.body is not None
  soup_objects.append(soup)

In [7]:
# Checking title and body texts for duplicates
html_dict = {"Title": [], "Body": []}

for soup in soup_objects:
  title = soup.find("title").text
  body = soup.find("body").text
  html_dict["Title"].append(title)
  html_dict["Body"].append(body)

df_jobs = pd.DataFrame(html_dict)
summary = df_jobs.describe()
df_jobs.describe()

Unnamed: 0,Title,Body
count,1458,1458
unique,1364,1458
top,"Data Scientist - New York, NY","Data Scientist - Beavercreek, OH\nData Scienti..."
freq,13,1


###Exploring the HTML

We start our exploration by rendering the HTML.

In [9]:
# Rendering the HTML of the first job posting
assert len(set(html_contents)) == len(html_contents)

display(HTML(html_contents[0]))

In [10]:
# Rendering the HTML of the second job posting
display(HTML(html_contents[1]))

In [11]:
# Extracting bullets from the HTML
df_jobs["Bullets"] = [[bullet.text.strip() for bullet in soup.find_all("li")] for soup in soup_objects]

In [13]:
# Measuring the percent of bulleted postings
bulleted_post_count = 0
for bullet_list in df_jobs.Bullets:
  if bullet_list:
    bulleted_post_count += 1

percent_bulleted = 100 * bulleted_post_count / df_jobs.shape[0]
print(f"{percent_bulleted:.2f}% of the postings contain bullets")

90.53% of the postings contain bullets


Do all (or most) of these bullets focus on
skills? 

We currently don’t know. However, we can better gauge the contents of the bullet
points by printing the top-ranked words in their text. 

We can rank these words by
occurrence count; alternatively, we can carry out the ranking using term frequencyinverse
document frequency (TFIDF) values rather than raw counts. 

As we know, such TFIDF rankings are less likely to contain irrelevant words.

In [15]:
# Examining the top-ranked words in the HTML bullets
def rank_words(text_list):
  vectorizer = TfidfVectorizer(stop_words="english")
  tfidf_matrix = vectorizer.fit_transform(text_list).toarray()
  df = pd.DataFrame({"Words": vectorizer.get_feature_names(), "Summed TFIDF": tfidf_matrix.sum(axis=0)})
  sorted_df = df.sort_values("Summed TFIDF", ascending=False)
  return sorted_df

In [16]:
all_bullets = []
for bullet_list in df_jobs.Bullets:
  all_bullets.extend(bullet_list)

sorted_df = rank_words(all_bullets)
print(sorted_df[:5].to_string(index=False))

     Words  Summed TFIDF
experience    878.030398
      data    842.978780
    skills    440.780236
      work    371.684232
   ability    370.969638


In [17]:
# Examining the top-ranked words in the HTML bodies
non_bullets = []
for soup in soup_objects:
  body = soup.body
  for tag in body.find_all("li"):
    tag.decompose()
  non_bullets.append(body.text)

sorted_df = rank_words(non_bullets)
print(sorted_df[:5].to_string(index=False))

     Words  Summed TFIDF
      data     99.111312
      team     39.175041
      work     38.928948
experience     36.820836
  business     36.140488


In [18]:
# Checking titles for references to data science positions
regex = r"Data Scien(ce|tist)"
df_non_ds_jobs = df_jobs[~df_jobs.Title.str.contains(regex, case=False)]

percent_non_ds = 100 * df_non_ds_jobs.shape[0] / df_jobs.shape[0]
print(f"{percent_non_ds:.2f}% of the job posting titles do not mention a data science position. Below is a sample of such titles:\n")

for title in df_non_ds_jobs.Title[:10]:
  print(title)

64.33% of the job posting titles do not mention a data science position. Below is a sample of such titles:

Patient Care Assistant / PCA - Med/Surg (Fayette, AL) - Fayette, AL
Data Manager / Analyst - Oakland, CA
Scientific Programmer - Berkeley, CA
JD Digits - AI Lab Research Intern - Mountain View, CA
Operations and Technology Summer 2020 Internship-West Coast - Universal City, CA
Data and Reporting Analyst - Olympia, WA 98501
Senior Manager Advanced Analytics - Walmart Media Group - San Bruno, CA
Data Specialist, Product Support Operations - Sunnyvale, CA
Deep Learning Engineer - Westlake, TX
Research Intern, 2020 - San Francisco, CA 94105


In [19]:
# Sampling bullets from a non-data science job
bullets = df_non_ds_jobs.Bullets.iloc[0]
for i, bullet in enumerate(bullets):
  print(f"{i}: {bullet.strip()}")

0: Provides all personal care services in accordance with the plan of treatment assigned by the registered nurse
1: Accurately documents care provided
2: Applies safety principles and proper body mechanics to the performance of specific techniques of personal and supportive care, such as ambulation of patients, transferring patients, assisting with normal range of motions and positioning
3: Participates in economical utilization of supplies and ensures that equipment and nursing units are maintained in a clean, safe manner
4: Routinely follows and adheres to all policies and procedures
5: Assists in performance improvement (PI) activities by serving on PI teams as warranted, assisting with PI measures and supporting and implementing changes necessary for improvement
6: Maintains performance, patient and employee satisfaction and financial standards as outlined in the performance evaluation
7: Performs compliance requirements as outlined in the Employee Handbook
8: Must adhere to the DC

We are data scientists; our primary objective isn’t patient care (index 0) or nursing equipment maintenance.

We need to delete these skills from our dataset,
but how? 

One approach is to use text similarity.

Basically, we should evaluate the relevance of each job relative to both the resume and
book material; this would allow us to filter the extraneous postings and retain only the
most relevant jobs.

We should accomplish our goal as follows:

1. Obtain relevant job postings that partially match our existing skill set.
2. Examine which bullet points in these postings are missing from our existing
skill set.

With this strategy in mind, we’ll now filter the jobs by relevance.

##Filtering jobs by relevance