<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-4--job-resume-improvement/05_case_study_job_resume_improvement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Case Study: Job Resume Improvement

Our goal is to extract common data science skills from the downloaded job postings.
Then we’ll compare these skills to our resume to determine which skills are
missing. 

We will do so as follows:

1. Parse all text from the downloaded HTML files.
2. Explore the parsed output to learn how job skills are described in online
postings. We’ll pay particular attention to whether certain HTML tags are
more associated with skill descriptions.
3. Attempt to filter any irrelevant job postings from our dataset.
4. Cluster job skills based on text similarity.
5. Visualize the clusters using word clouds.
6. Adjust clustering parameters, if necessary, to improve the visualized output.
7. Compare the clustered skills to our resume to uncover missing skills.

##Setup

In [None]:
!pip install bs4

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import glob
import time
import numpy as np
import pandas as pd

from collections import defaultdict
from collections import Counter

from bs4 import BeautifulSoup as bs

import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML

In [None]:
%%shell

wget https://github.com/rahiakela/data-science-research-and-practice/raw/main/data-science-bookcamp/case-study-4--job-resume-improvement/job_postings.zip
wget https://github.com/rahiakela/data-science-research-and-practice/raw/main/data-science-bookcamp/case-study-4--job-resume-improvement/resume.txt
wget https://github.com/rahiakela/data-science-research-and-practice/raw/main/data-science-bookcamp/case-study-4--job-resume-improvement/table_of_contents.txt

unzip -qq job_postings.zip
rm -rf job_postings.zip

##Extracting skills from job posting

Let's explore HTML files in the job_postings directory.

In [5]:
# Loading HTML files
html_contents = []
for file_name in sorted(glob.glob("job_postings/*.html")):
  with open(file_name, "r") as f:
    html_contents.append(f.read())
print(f"We have loaded {len(html_contents)} HTML files.")

We have loaded 1458 HTML files.


In [6]:
# Parsing HTML files
soup_objects = []
for html in html_contents:
  soup = bs(html)
  assert soup.title is not None
  assert soup.body is not None
  soup_objects.append(soup)

In [9]:
# Checking title and body texts for duplicates
html_dict = {"Title": [], "Body": []}

for soup in soup_objects:
  title = soup.find("title").text
  body = soup.find("body").text
  html_dict["Title"].append(title)
  html_dict["Body"].append(body)

df_jobs = pd.DataFrame(html_dict)
summary = df_jobs.describe()
df_jobs.describe()

Unnamed: 0,Title,Body
count,1458,1458
unique,1364,1458
top,"Data Scientist - New York, NY","Data Scientist - Beavercreek, OH\nData Scienti..."
freq,13,1


###Exploring the HTML

Let’s add two consecutive paragraphs to our HTML.