# Exploring Monster.com Job Postings

Monster.com is one of the largest job sites in the United States, providing a huge clearinghouse for employers and employee prospects to find one another on.

In this notebook we will explore this dataset, a sample of all Monster.com job postings in the United States. We will hopefully learn a thing or two about what a job listing looks like in the modern day. We will probe the basic dataset attributes and hopefully uncover some interesting observations from the data! This exploratory data analytics notebook is recommended for beginners and those interested in probing this dataset further. Feel free to fork this notebook and/or copy the code here and explore further on your own!

![](https://media.wired.com/photos/593288b126780e6c04d2c7ec/master/w_999,c_limit/Palm_Monster.jpg)

In [None]:
import numpy as np
import pandas as pd

In [None]:
jobs = pd.read_csv("../input/monster_com-job_sample.csv")

## Munging the data

Many of the fields are not interesting (some as a function of the filtering applied to this dataset) and can be dropped. Some of the rest of the fields are very sparse.

In [None]:
# Clean up the fields a bit
jobs = (jobs.drop(["job_board", "has_expired", "country", "country_code", "uniq_id"],
                  axis='columns'))

The `location` field often has a copy of the `job_description` in it, for some reason; these need to be filtered out. The information in the `salary` varies by a lot in terms of formatting, in addition to being left blank most of the time anyway.

In [None]:
jobs = jobs[jobs['location'].str.len() < 40]

def map_yearly_salary_range(val):
    if pd.isnull(val):
        return np.nan
    elif "/year" in val:
        part = val.split("/year")[0].replace("$", " ").strip()
        if "-" in part:
            mn, mx = part.split("-")[0:2]
            try:
                mn = float(mn.replace(",", "").strip())
                mx = float(mx.replace(",", "").strip())
            except:
                return np.nan
            return mn, mx
        
def map_hourly_salary_range(val):
    if pd.isnull(val):
        return np.nan
    elif "/hour" in val:
        part = val.split("/hour")[0].replace("$", " ").strip()
        if "-" in part:
            mn, mx = part.split("-")[0:2]
            try:
                mn = float(mn.replace(",", "").strip())
                mx = float(mx.replace(",", "").strip())
            except:
                return np.nan
            return mn, mx
        
jobs = jobs.assign(yearly_salary_range=jobs['salary'].map(map_yearly_salary_range),
                   hourly_salary_range=jobs['salary'].map(map_hourly_salary_range))

print("We found {0} yearly and {1} hourly salaries in the dataset.".format(
    jobs['yearly_salary_range'].notnull().sum(), jobs['hourly_salary_range'].notnull().sum()
))

`job_type` ought to be distilled into full-time, part-time, and other.

In [None]:
jobs['job_type'] = jobs['job_type'].map(
    lambda j: j if pd.isnull(j) else 'Full Time' if 'Full Time' in j else 'Part Time' if 'Part Time' in j else 'Other'
)

## Types and sectors of jobs listed in Monster.com

In [None]:
# Creating the plot.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

f, axarr = plt.subplots(1, 2, figsize=(12, 5))
f.subplots_adjust(hspace=1)
plt.suptitle('Monster Top 10 Jobs by...', fontsize=18)

bar_kwargs = {'fontsize': 14, 'color': 'darkgray'}

jobs['organization'].value_counts().head(10).plot.bar(ax=axarr[0], **bar_kwargs,
                                                      title='Industry Sector')
jobs['sector'].value_counts().head(10).plot.bar(ax=axarr[1], **bar_kwargs,
                                                         title='Job Type')

sns.despine()

for n in [0, 1]:
    axarr[n].title.set_fontsize(16)
    axarr[n].set_xticklabels(axarr[n].get_xticklabels(), 
                             rotation=45, ha='right', fontsize=14)

Healthcare services seems to make heavy use of Monster.com, as does Retail and a few others. Non-managerial experience seems to have an imposing presence on the platform, but that seems quite aspirational...

## Median salaries

Splitting between yearly and hourly pay rates, what are the salaries that are on tap?

In [None]:
f, axarr = plt.subplots(2, 1, figsize=(12, 8))
f.subplots_adjust(hspace=1)

bar_kwargs = {'fontsize': 14, 'color': 'darkgray'}

jobs = jobs.assign(
    median_yearly_salary = jobs['yearly_salary_range'].map(
        lambda r: (r[0] + r[1]) / 2 if pd.notnull(r) else r
    ),
    median_hourly_salary = jobs['hourly_salary_range'].map(
        lambda r: (r[0] + r[1]) / 2 if pd.notnull(r) else r
    )
)

sns.kdeplot(jobs[pd.notnull(jobs.median_yearly_salary)]['median_yearly_salary']\
                .where(lambda v: v < 200000), 
            ax=axarr[0]
)
sns.kdeplot(jobs[pd.notnull(jobs.hourly_salary_range)]['median_hourly_salary']\
                .where(lambda v: v < 100), 
            ax=axarr[1]
)

axarr[0].set_title("Median Yearly Salary Offered", fontsize=16)
axarr[1].set_title("Median Hourly Salary Offered", fontsize=16)

sns.despine()

The salaries on tap on Monster.com are probably below the US distribution, in my estimation. Are all of those `Experienced (Non-Manager)` jobs seem rather aspirational...

To dig in further, I would suggest you try breaking this data down by sector!

## Summarizing job descriptions

Of course the most important field in the dataset is the job descriptions. But the job descriptions tend to be very long, including a preppy, mostly useless bit about what the company does and why it's awesome. We can do better by slimming the document down to the most important sentences.

There are a bunch of different text summarization schemes for doing this sort of thing, but let's do a simple and fun one: sentence ranking. In this scheme, we will weigh each sentence by how recurrent non-stopwords in that sentence is: the most times it appears in the overall document, the more important the sentence in question is (stopwords are words like "by", "if", "when"). We average this by the length of the sentence to get the sentence's overall interestingness, and pick the `n` sentences with the highest score. Simple!

The following block of code implements this scheme. It uses the `nltk` library; you can see the original source, with documentation, [here](https://glowingpython.blogspot.com/2014/09/text-summarization-with-nltk.html).

In [None]:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class FrequencySummarizer:
  def __init__(self, min_cut=0.1, max_cut=0.9):
    """
     Initilize the text summarizer.
     Words that have a frequency term lower than min_cut 
     or higer than max_cut will be ignored.
    """
    self._min_cut = min_cut
    self._max_cut = max_cut 
    self._stopwords = set(stopwords.words('english') + list(punctuation))

  def _compute_frequencies(self, word_sent):
    """ 
      Compute the frequency of each of word.
      Input: 
       word_sent, a list of sentences already tokenized.
      Output: 
       freq, a dictionary where freq[w] is the frequency of w.
    """
    freq = defaultdict(int)
    for s in word_sent:
      for word in s:
        if word not in self._stopwords:
          freq[word] += 1
    # frequencies normalization and fitering
    m = float(max(freq.values()))
    for w in list(freq.keys()):
      freq[w] = freq[w]/m
      if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
        del freq[w]
    return freq

  def summarize(self, text, n):
    """
      Return a list of n sentences 
      which represent the summary of text.
    """
    sents = sent_tokenize(text)
    
    try:
        assert n <= len(sents)
    except AssertionError:
        return ""
        
    word_sent = [word_tokenize(s.lower()) for s in sents]
    self._freq = self._compute_frequencies(word_sent)
    ranking = defaultdict(int)
    for i,sent in enumerate(word_sent):
      for w in sent:
        if w in self._freq:
          ranking[i] += self._freq[w]
    sents_idx = self._rank(ranking, n)    
    return [sents[j] for j in sents_idx]

  def _rank(self, ranking, n):
    """ return the first n sentences with highest ranking """
    return nlargest(n, ranking, key=ranking.get)

Unfortunately the corpus of descriptions is pretty badly formatted, and the sentences in this document are obnoxiously long.  Still, applying this technique can show up something interesting about the kinds of words that matter in a job description.

In [None]:
" ".join(FrequencySummarizer().summarize(
    jobs.iloc[0].job_description.replace(".", ". ").replace("•", " "), 2)
)

In [None]:
stops = set(stopwords.words('english') + list(punctuation))

In [None]:
summary_words = jobs.head(1000).job_description.map(lambda desc: set(
    word_tokenize(
        " ".join(
            FrequencySummarizer().summarize(desc.replace(".", ". ").replace("•", " "), 2)
        )
    )
) - stops
                                                     )
import itertools
# non_summary = pd.Series(list(itertools.chain.from_iterable(non_summary_words.values)))
summary = pd.Series(list(itertools.chain.from_iterable(summary_words.values)))

In [None]:
import seaborn as sns
sns.set_style("white")

(summary
     .value_counts(ascending=False)
     .head(12)
     .drop(['Qualifications', 'meet', 'including'])
     .plot.bar(fontsize=16, figsize=(14, 6)))
plt.gcf().suptitle('Top 10 Most Occurent Important Words in Job Descriptions', fontsize=20)

## Further inquiry

That's all for here folks!

Try applying more cleaning and more (and more sophisticated!) NLP techniques to the job descriptions in this dataset. What further can you learn about the US job market by seeing what comes up on Monster.com?