# Step 1: Extracting Raw Text from HTML {-}

**Objective:**
Extract relevant sections of the job posting webpages for similarity comparison with our resume and analysis of skills missing from our resume.

**Workflow:**

1. Examine some HTML pages manually by opening them in a browser and inspecting the HTML elements (e.g. ctrl+shift+i in many browsers or right click on the page and choose “inspect”). Determine which sections should be extracted and stored.
2. Load all webpages into Python as text in a list using any method.
3. Store the webpage sections identified in step 1 into a pandas DataFrame with sensible column names.
    - Examine some of the data in the DataFrame to make sure the data extraction worked as we expect.
    - Save the DataFrame to disk so we can load it at a later time for future parts of the project.

## 1. Inspecting HTML pages {-}
After opening some job postings in a browser and inspecting the HTML of the page (with ctrl+shift+i or right-clicking the page and choosing 'inspect'), we can see that typically, required skills are contained in bullet points.  Sometimes other information is contained in bullet points as well.  With more advanced webpage parsing (e.g. searching for the 'Required Skills' string or something similar) we could potentially limit our bullet points to only those with required skills.  However, we won't take those extra steps here since that level of data cleaning can be an entire project on its own.

The bullet points with skill requirements are 'ul' HTML tags with 'li' tags inside them.

We can also see the title of the posting is contained within a 'title' html tag.  The entire text of the post is contained within the 'body' tag.  So our plan is to separately store the 'body', 'title', and 'li' tags in separate pandas DataFrame columns.

The HTML file used here was 0a3e1fcd0264cf9a_fccid.html, which came up first in the file browser when the files were sorted by name.

![inspecting a job posting](images/inspect_a_page.png)

## 2. Load webpages into Python {-}
Next, we get the list of files to be opened and see how many there are.  Then we read in the files into a list.  We will look at some of the data we read in to make sure it looks ok.

Usually it's best to group your library imports in a cell or two at the beginning of your notebook, to keep it organized.  Imports should be grouped as per PEP8: https://www.python.org/dev/peps/pep-0008/#imports.  We'll import our libraries needed for each section (step) at the top of that section.

In [1]:
import glob

import pandas as pd
from bs4 import BeautifulSoup as bs

In [2]:
# get a list of the files in the html directory
files = glob.glob('html_job_postings//*.html')

In [3]:
len(files)

1337

In [4]:
# load all HTML pages as text into a list -- one entry per HTML page
html_content = []
for file in files:
    with open(file, 'r') as f:
        html_content.append(f.read())

In [5]:
# inspect the first entry to make sure it looks ok
html_content[0]

'<html><head><title>Operations Insights and Impact Analyst, YouTube - San Bruno, CA</title></head>\n<body><h2>Operations Insights and Impact Analyst, YouTube - San Bruno, CA</h2>\n<p>Minimum qualifications:</p><ul>\n<li>\nBachelor\'s degree in Data Science, Economics, Statistics, Business or Finance, or equivalent practical experience.</li>\n<li>4 years of experience in operations analytics or business analytics.</li>\n<li>Experience in using relational databases for SQL queries, database definition and schema design.</li>\n<li>Experience with building and presenting operational dashboards.</li>\n</ul><br/>\n<p>Preferred qualifications:</p><ul>\n<li>\nExperience with running operations management processes (sales and operations planning, demand/supply planning).</li>\n<li>Experience in strategy and consulting.</li>\n<li>Experience in marketing analytics, ROI, financial modeling and statistical analysis.</li>\n<li>Developed business fundamentals acumen and knowledge of video advertising

## 3. Store webpage sections into a pandas DataFrame {-}
Now we are going to parse the HTML into sections -- Title, Body, and bullets points ('li' tags which typically hold required skills) and store these in a DataFrame.  These will be stored in a dictionary with lists as values, and then converted to a DataFrame.

First, we prototype the code to extract the content from the page.  Then we'll incorporate it in a function which takes in all html pages and returns a DataFrame

In [6]:
# We use a dictionary to store data, which can easily be converted to a pandas DataFrame
html_sections = []
html_dict = {}
for key in ['title', 'body', 'bullets']:
    html_dict[key] = []

# use the first page for prototyping the code
first_page = html_content[0]
# converting the text to a beautifulsoup object makes it easily searchable
soup = bs(first_page, 'lxml')
title = soup.find('title').text
body = soup.find('body').text
bullets = soup.find_all('li')
html_dict['title'].append(title)
html_dict['body'].append(body)
# remove extra leading and trailing whitespace with strip()
html_dict['bullets'].append([b.text.strip() for b in bullets])

df = pd.DataFrame(data=html_dict)

In [7]:
df.head()

Unnamed: 0,title,body,bullets
0,"Operations Insights and Impact Analyst, YouTub...","Operations Insights and Impact Analyst, YouTub...","[Bachelor's degree in Data Science, Economics,..."


It's best to always write some documentation for your functions as shown here in the triple quotes.

In [8]:
def get_html_content(html_pages):
    """
    Extracts title, body, and list items (bullets) from HTML job postigs.
    Returns a pandas dataframe with separate columns for title, body, and bullet items.S
    """
    html_sections = []
    html_dict = {}
    for key in ['title', 'body', 'bullets']:
        html_dict[key] = []

    for html in html_content:
        soup = bs(html, 'lxml')
        title = soup.find('title').text
        body = soup.find('body').text
        bullets = soup.find_all('li')
        html_dict['title'].append(title)
        html_dict['body'].append(body)
        # remove extra leading and trailing whitespace with strip()
        html_dict['bullets'].append([b.text.strip() for b in bullets])
    
    df = pd.DataFrame(html_dict)
    
    return df

In [9]:
df = get_html_content(html_content)

In [10]:
df.head()

Unnamed: 0,title,body,bullets
0,"Operations Insights and Impact Analyst, YouTub...","Operations Insights and Impact Analyst, YouTub...","[Bachelor's degree in Data Science, Economics,..."
1,Mathematical and Statistical Scientist* - Char...,Mathematical and Statistical Scientist* - Char...,"[BS in Statistics, Mathematics, or a related f..."
2,"Data Science Lead - Chicago, IL","Data Science Lead - Chicago, IL\nWho we are:\n...","[Lead the development, implementation and opti..."
3,"Software Developer 3 - Redwood City, CA 94065","Software Developer 3 - Redwood City, CA 94065\...","[3+ years of Java, Go, or Python frontend and/..."
4,"Data Scientist - Orange, CA","Data Scientist - Orange, CA\nRole Summary / Pu...","[PhD in Industrial Engineering, computer scien..."


In [11]:
df.shape

(1337, 3)

Using pandas DataFrames enables several convenience functions, such as drop_duplicates.  However, we can't drop dupes with columns of lists, so we have to convert the bullets column (which are lists) to tuples first.

In [12]:
df['bullets'] = df['bullets'].apply(tuple, 1)

In [13]:
df.drop_duplicates(inplace=True)
df.shape

(1328, 3)

As we can see, there are a few duplicates in there which we have now removed.  Now we have our parsed text data ready for more analysis!

The last step is to save the DataFrame to disk for future steps.  There are [several methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for storing data, but we will keep it simple with a pickle file (even though it may not have the best performance for many use cases).  Using a CSV file would normally work, but since our 'bullets' column is a list, we need to do some [extra steps](https://stackoverflow.com/q/23111990/4549682) in order to get a list back when reading a CSV.

In [14]:
df.to_pickle('step1_df.pk')