## Prediction of Salary using Job Description

In this project, we build a model to predict the job salaries of listings on Indeed using their descriptions. Indeed is one of the largest search engines for job listings. We scrape a subset of the data on the site to predict the job salaries.

In [None]:
# Install required packages using pip
# !pip install tqdm
# !pip install furl

Import all the required packages.

In [1]:
import pandas as pd
import numpy as np
import urllib
import requests
import re
import sys
from bs4 import BeautifulSoup as Soup
from tqdm import tqdm
from furl import furl
import nltk
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score



### Scraping

The following are some of the helper functions to scrape the data from Indeed. For a given job title, the functions scrape job listings for different salary ranges. The data can be scraped here or using the shell script __scraping_script.sh__ since it's a time consuming process. The scraped job listings are stored in jobs.csv file.

In [2]:
JOBS_FILE_PATH = 'jobs.csv'
BASE_URL = 'https://www.indeed.com/jobs'
VIEW_JOB_URL = "https://www.indeed.com/viewjob"
city = 'New York, NY'


def scrape_jobs_using_soup(url, df):
    """
        This function scrapes the jobs from Indeed using BeautifulSoup
    """
    target = Soup(urllib.request.urlopen(url), "lxml")
    target_elements = target.findAll('div', attrs={'class':
                                                   re.compile('row result')})

    for elem in target_elements:
        try:
            # Get the job key, company name, title and location
            job_key = elem.attrs['data-jk']
            company_name = elem.find('span', {'class':
                                              re.compile(
                                                  'company')}).text.strip()
            job_title = elem.find('a', {'class':
                                        'turnstileLink'}).attrs['title']
            job_location = elem.find('span', {'class':
                                              'location'}).text.strip()
        except (KeyError, AttributeError):
            pass

        try:
            if (df['job_key'] == job_key).any():
                # Check if job key already exists in the csv file
                # to avoid duplicate entries
                continue
        except KeyError:
            pass

        f = furl(VIEW_JOB_URL)
        f.args['jk'] = job_key
        job_link = f.url

        targetDesc = Soup(urllib.request.urlopen(job_link), "lxml")
        job_summary = targetDesc.find('span', {'id':
                                               'job_summary'}).text.strip()

        entry = {
            'company_name': company_name,
            'job_title': job_title,
            'job_location': job_location,
            'job_key': job_key,
            'job_link': job_link,
            'job_summary': job_summary
        }

        return entry


def get_jobs_with_salary(job_title, salary_label, df):
    """
        This function fetches jobs with the specified job title and salary
    """
    f = furl(BASE_URL)
    f.args['q'] = '{} {}'.format(job_title, salary_label)
    f.args['l'] = city
    f.args['radius'] = 100
    f.args['sort'] = 'date'

    for page in tqdm(range(1, 101)):
        # Indeed only allows job ads in the first 100 pages (1101 jobs)
        # to be fetched
        page = (page-1) * 10
        f.args['start'] = page
        url = f.url

        entry = scrape_jobs_using_soup(url, df)
        try:
            entry['salary_label'] = salary_label
        except TypeError:
            pass

        df = df.append(entry, ignore_index=True)

    return df


def get_jobs_with_title(job_title):
    """
        This function fetches job ads with the specified title from Indeed
    """
    try:
        # If there is a csv file with job descriptions populated already,
        # read the file using Pandas
        df = pd.read_csv(JOBS_FILE_PATH)
    except OSError:
        # If not, intialize a new dataframe to store the jobs
        df = pd.DataFrame()

    salary_labels = np.arange(55000, 125001, 10000)
    salary_labels = ['$' + str(label) for label in salary_labels][::-1]

    for label in salary_labels:
        df = get_jobs_with_salary(job_title, label, df)
        df.to_csv('jobs.csv', index=False)

    return df

In [3]:
df = get_jobs_with_title('Data Scientist')
# df.to_csv('jobs.csv', index=False)

100%|██████████| 5/5 [00:03<00:00,  1.65it/s]
100%|██████████| 5/5 [00:09<00:00,  1.90s/it]
100%|██████████| 5/5 [00:02<00:00,  2.05it/s]
100%|██████████| 5/5 [00:06<00:00,  1.30s/it]
100%|██████████| 5/5 [00:07<00:00,  1.45s/it]
100%|██████████| 5/5 [00:04<00:00,  1.21it/s]
100%|██████████| 5/5 [00:05<00:00,  1.00s/it]
100%|██████████| 5/5 [00:04<00:00,  1.20it/s]


In [4]:
df = pd.read_csv('jobs.csv')
df.head()

Unnamed: 0,company_name,job_key,job_link,job_location,job_summary,job_title,salary_label
0,Marsh & McLennan Companies,198968a6da733e07,https://www.indeed.com/viewjob?jk=198968a6da73...,"New York, NY",.\nLocation: Flexible; New York City preferred...,Head of Enterprise Data Architecture,$125000
1,CyberCoders,9e3296afd9d25841,https://www.indeed.com/viewjob?jk=9e3296afd9d2...,"New York, NY 10001",Data Engineer\nIf you are a Data Engineer with...,Data Engineer,$125000
2,Darwin Recruitment,33872ff1482137b7,https://www.indeed.com/viewjob?jk=33872ff14821...,"Manhattan, NY",Darwin Recruitment are currently partnered wit...,Senior Data Scientist,$125000
3,Neuberger Berman,d796a3c2b53d8cf0,https://www.indeed.com/viewjob?jk=d796a3c2b53d...,"New York, NY",Summary:\nSeeking a highly motivated individua...,Data Scientist,$125000
4,Aaptiv,82be444cca1377ae,https://www.indeed.com/viewjob?jk=82be444cca13...,"New York, NY",About Aaptiv\n\nAaptiv is the fastest growing ...,Senior Backend Engineer - Search,$125000


### Data Preprocessing

Before proceeding with modeling, we need to preprocess and clean the data. Cleaning involves removing the stop words, non alphanumeric characters and unnecessary white spaces. This is done with the help of regular expressions and the NLTK library which has a corpus of all the stop words in English.

In [5]:
def clean(job_summary):
    # Removing non alphanumeric characters
    js = re.sub('[^a-zA-Z\d]', ' ', job_summary)
    # Removing unnecessary white spaces
    js = re.sub(' +', ' ', js).strip()
    # Removing all the stop words
    words = nltk.word_tokenize(js.lower())
    filtered_words = [word for word in words
                      if word not in nltk.corpus.stopwords.words('english')]
    return ' '.join(filtered_words)


def clean_column(df, column_name):
    for index, column in tqdm(enumerate(df[column_name])):
        cleaned_column = clean(column)
        df.set_value(index, column_name, cleaned_column)

In [6]:
le = LabelEncoder()
le.fit(df['salary_label'])

df['salary_label_original'] = df['salary_label']
df['salary_label'] = le.transform(df['salary_label'])

In [7]:
clean_column(df, 'job_summary')
clean_column(df, 'job_title')
df.to_csv('jobs.csv', index=False)
df.head()

40it [00:03, 12.68it/s]
40it [00:00, 1162.32it/s]


Unnamed: 0,company_name,job_key,job_link,job_location,job_summary,job_title,salary_label,salary_label_original
0,Marsh & McLennan Companies,198968a6da733e07,https://www.indeed.com/viewjob?jk=198968a6da73...,"New York, NY",location flexible new york city preferred succ...,head enterprise data architecture,2,$125000
1,CyberCoders,9e3296afd9d25841,https://www.indeed.com/viewjob?jk=9e3296afd9d2...,"New York, NY 10001",data engineer data engineer experience please ...,data engineer,2,$125000
2,Darwin Recruitment,33872ff1482137b7,https://www.indeed.com/viewjob?jk=33872ff14821...,"Manhattan, NY",darwin recruitment currently partnered global ...,senior data scientist,2,$125000
3,Neuberger Berman,d796a3c2b53d8cf0,https://www.indeed.com/viewjob?jk=d796a3c2b53d...,"New York, NY",summary seeking highly motivated individual st...,data scientist,2,$125000
4,Aaptiv,82be444cca1377ae,https://www.indeed.com/viewjob?jk=82be444cca13...,"New York, NY",aaptiv aaptiv fastest growing mobile fitness p...,senior backend engineer search,2,$125000


We now need to split the dataset into train and test for modeling. Before proceeding, let's load in a dataset with more jobs.

In [8]:
df = pd.read_csv('jobs_final.csv')
print('The dataframe contains {} rows'.format(len(df)))

The dataframe contains 10404 rows


In [9]:
df = df.loc[:, df.columns.isin(['job_summary', 'job_title', 'salary_label'])]
df.head()

Unnamed: 0,job_summary,job_title,salary_label
0,person deep understanding big data lead team b...,data scientist,8
1,would like take experience data scientist move...,principal data scientist northern new jersey,8
2,leading trading firm searching data scientist ...,quantitative researcher data scientist,8
3,role purpose elsevier adaptive learning develo...,data scientist,8
4,new york city ny 150 000 180 000 base salary p...,director data scientist modeler,8


Split the dataset into train and test. Train = 80% and test = 20%.

In [10]:
train = df.sample(frac=0.8, random_state=200)
test = df.drop(train.index)

train.head()

Unnamed: 0,job_summary,job_title,salary_label
9578,codes software applications adhere designs sup...,full stack developer python angular,2
1783,digital revolution changing everything everywh...,digital strategy senior manager insurance,8
4328,beyond creative agency based new york london s...,product manager,6
8743,achieve success world global adaptation produc...,sr developer craft,3
4682,cherryroad looking java cloud developer partic...,java developer,6


Since job titles could determine job salaries, we concatenate it with the job description.

In [11]:
def concat(title, summary):
    job_details = title + ' ' + summary
    return job_details


def get_job_details(df):
    job_details_list = []
    for row in df.itertuples():
        job_details = concat(row[2], row[1])
        job_details_list.append(job_details)
    return job_details_list

### Modeling

In [12]:
def model(train, test):
    """
        This function trains our model and returns predictions
        for test data
    """
    train_jd = get_job_details(train)
    test_jd = get_job_details(test)

    # Convert text into matrix of token counts
    counter = CountVectorizer()
    counter.fit(train_jd)

    counts_train = counter.transform(train_jd)
    counts_test = counter.transform(test_jd)

    mnb_classifier = MultinomialNB()
    knn_classifier = KNeighborsClassifier()
    lreg_classifier = LogisticRegression()
    dt_classifier = DecisionTreeClassifier()

    predictors = [
        ('mnb', mnb_classifier),
        ('knn', knn_classifier),
        ('lreg', lreg_classifier),
        ('dt', dt_classifier)
    ]
    voting_classifier = VotingClassifier(predictors)

    # Train the model
    voting_classifier.fit(counts_train, train['salary_label'])

    # Predict the labels for test data
    predicted = voting_classifier.predict(counts_test)

    return predicted

We use a ensemble machine learing model to predict the salaries. To achieve this, voting classifier from sklearn is used. Multinomial Naive Bayes, K-Nearest Neighbor, Logistic Regression & Decision Tree classifiers are the predictors for our voting classifier.

In [13]:
predicted = model(train, test)

### Results

In [14]:
result = test.loc[:, test.columns.isin(['salary_label'])]
result['predicted'] = predicted
# Rename the columns
result.columns = ['actual', 'predicted']

result['actual'] = pd.to_numeric(result['actual'], errors='coerce')
result['predicted'] = pd.to_numeric(result['predicted'], errors='coerce')

result.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,actual,predicted
2,8,8
3,8,8
5,8,8
17,8,8
18,8,4


Calculating the accuracy score of our model

In [15]:
accuracy_score(result['actual'], result['predicted'])

0.37097549255165785

We get an accuracy of 37.1% which is pretty low. However, accuracy is always not the best indicator of how good the model is. Since we are dealing with salary ranges with difference of \$10k here, we could see how good the model is predicting with a single range error.

In [16]:
result['difference'] = np.absolute(result['actual'].
                                   subtract(result['predicted']))
result.head()

Unnamed: 0,actual,predicted,difference
2,8,8,0
3,8,8,0
5,8,8,0
17,8,8,0
18,8,4,4


In [17]:
len(result[result['difference'] <= 1]) / len(result)

0.6746756367131187

67.47% of the time, it predicts within a single range of error.

In [18]:
len(result[result['difference'] <= 2]) / len(result)

0.8279673234022105

Also, 82.8% of the data are within two ranges of error.

### Future work

We could improve the prediction by better pre-processing the data, increasing the size of the training data and using other machine learning algorithms.