# Job Salary Prediction

This Kaggle competition challenges us to predict Job salaries based on job ads. The data provided has Job_Title, Location, Category, Contract Type, etc. 

Lets jump right in - 
1. Load the libraries required for this task.
2. Read in the dataset.
3. See what the data looks like.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import re
import matplotlib.pyplot as plt
%matplotlib inline

import math

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read in the train_rev1 datafile downloaded from kaggle
df = pd.read_csv('Train_rev1.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
Id                    244768 non-null int64
Title                 244767 non-null object
FullDescription       244768 non-null object
LocationRaw           244768 non-null object
LocationNormalized    244768 non-null object
ContractType          65442 non-null object
ContractTime          180863 non-null object
Company               212338 non-null object
Category              244768 non-null object
SalaryRaw             244768 non-null object
SalaryNormalized      244768 non-null int64
SourceName            244767 non-null object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


The data has $244,768$ rows with 12 columns, most of which are text type columns as signified by the `object` data type. This means that we will have a large number of (0,1) type variables once the categorical columns are encoded. <br>

In the interest of computation time (since I am doing this analysis on my laptop), I will randomly select 2500 rows to do the analysis on.

In [4]:
# randomly sample 2500 rows from the data
import random
random.seed(1) # so that results are reproducible

# get a random sample of 2500 rows from the row indices
indices = df.index.values.tolist()
random_2500 = random.sample(indices, 2500)

# subset the imported data on the selected 2500 indices
train = df.loc[random_2500, :]
train = train.reset_index(drop = True)

In [5]:
# some problems with the way FullDescription has been encoded
def convert_utf8(s):
    return str(s)

train['FullDescription'] = train['FullDescription'].map(convert_utf8)

In [6]:
train.head(2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,68234454,Bar & Leisure Supervisor,Genting Casinos UK is looking for an experienc...,Leicester Leicestershire East Midlands,Leicester,,,Genting UK,Hospitality & Catering Jobs,"Up to 17,000 per annum",17000,caterer.com
1,70762305,Principal Development Engineer,Principal Development Engineer RF Microwave /...,Scotland,Scotland,,permanent,ATA Recruitment,Engineering Jobs,38000 - 45000/annum + 38k - 45k (DOE) + Pensio...,41500,cv-library.co.uk


## Data Preprocessing

### Clean job descriptions

Using text data is particularly tricky because of the large number of words, numbers, links, symbols, etc in it that is of no value to the prediction problem at hand. We need to manually clean the `FullDescription` column so that it is ready for our analysis. In particular, we carry out the following steps -  <br>

**Step 1:** *Make a corpus of all the descriptions provided to us.*

In [7]:
# make a corpus of all the words in the job description
corpus = ". ".join(train['FullDescription'].tolist())

# tokenize the corpus to get individual words
tokens = word_tokenize(corpus)

**Step 2:** *From this corpus, we will pick up anomalies in the descriptions - urls, numbers, etc. that are of no use to us in terms of predictions.*


In [8]:
# find all urls in the data
weblinks = [w for w in tokens if ".co.uk" in w] + [w for w in tokens if ".com" in w] + [w for w in tokens if "www" in w]
weblinks = list(set(weblinks)) # remove duplicates from weblinks

# We also notice a lot of words with '*' characters in them. These are sometimes salary figures that have been hidden to 
# keep the prediction problem meaningful. Other times its just useless strings.
def find_numbers(s):
    found = []
    if len(re.findall('.*[0-9]+.*', s)) > 0:
        found.append(re.findall('.*[0-9]+.*', s)[0])
        return found[0]
    else:
        return np.nan
    
numbers = pd.Series(tokens).map(find_numbers)
numbers = numbers[~numbers.isnull()]

# there are strings with a lot of '*' in them. We need to remove these.
def find_stars(s):
    found = []
    if len(re.findall('.*[\*]+.*', s)) > 0:
        found.append(re.findall('.*[\*]+,*', s)[0])
        return found[0]
    else:
        return 0

star_words = pd.Series(tokens).map(find_stars)
star_words = star_words[star_words != 0].tolist()

How many urls, numbers and star_words are found in the data?

In [9]:
print("Urls found:",len(weblinks))
print("Numbers found:", len(numbers))
print("Star words found:", len(star_words))

Urls found: 1098
Numbers found: 3814
Star words found: 7180


**Step 3:** *Going back to our dataset, we will clean the descriptions by removing these anomalous strings from the job descriptions.*

In [10]:
from string import punctuation

# remove urls, starwords, numbers and punctuations
def remove_anomalous_string(s):
    global weblinks, star_words, numbers, punctuation
    for i in weblinks:
        s = s.replace(i, "")

    for j in star_words:
        s = s.replace(j, "")
        
    for k in numbers:
        s = s.replace(k, "")
        
    for l in punctuation:
        s = s.replace(l, "")

    return s

train['Clean_Full_Descriptions'] = train['FullDescription'].map(remove_anomalous_string).map(lambda x: x.lower())

**Step 4:** *Finally, lets remove stopwords from our descriptions.*

In [11]:
# store english stopwords in a list
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')

# define a function to remove stopwords from descriptions
def remove_stopwords(s):
    global en_stopwords
    s = word_tokenize(s)
    s = " ".join([w for w in s if w not in en_stopwords])
    return s

# Create a new column of descriptions with no stopwords
train['Clean_Full_Descriptions_no_stop'] = train['Clean_Full_Descriptions'].map(remove_stopwords)

The `Clean_Full_Descriptions_no_stop` has the full descriptions without punctuations, numbers, star words, urls or stopwords!

Lets lemmatize the text to remove excess forms of the same word

In [12]:
# lemmatization
from nltk.corpus import wordnet

from nltk.stem import WordNetLemmatizer
word_lemm = WordNetLemmatizer()

def convert_to_valid_pos(x):
    
    x = x[0].upper() # extract first character of the POS tag
    
    # define mapping for the tag to correct tag.
    tag_dict = {"J": wordnet.ADJ,
               "N": wordnet.NOUN,
               "R": wordnet.ADV,
               "V": wordnet.VERB}
    
    return tag_dict.get(x, wordnet.NOUN)

def lemmatize_text(s):
    pos_tagged_text = nltk.pos_tag(word_tokenize(s))
    
    lemm_list = []

    for (word, tag) in pos_tagged_text:
        lemm_list.append(word_lemm.lemmatize(word, pos = convert_to_valid_pos(tag)))


    lemm_text = " ".join(lemm_list)
    return lemm_text

train['Clean_Full_Descriptions_no_stop'] = train['Clean_Full_Descriptions_no_stop'].map(lemmatize_text)

### Get a list of expensive cities in England

`LocationNormalized` has the locations for the jobs. If a city is expensive to live in, I presume that salaries on average would be higher there. This could be an important predictor.

In [13]:
exp_cities = ['London', 'Oxford', 'Brighton', 'Cambridge', 'Bristol', 'Portsmouth', 
              'Reading', 'Edinburgh', 'Leicester', 'York', 'Exeter']

train['Exp_Location'] = np.where(train['LocationNormalized'].map(lambda x: x in exp_cities), 1, 0)

### Preparing the target Variable

`SalaryNormalized` has the salary values for each job description. We need to create a new categorical variable based off of this that has the value $1$ if salary value is greater than or equal to the $75^{th}$ percentile or $0$ otherwise

In [14]:
# get the 75th percentile value of salary!
sal_perc_75 = np.percentile(train['SalaryNormalized'], 75)

# make a new target variable that captures whether salary is high (1) or low (0)
train['Salary_Target'] = np.where(train['SalaryNormalized'] >= sal_perc_75, 1, 0)

### Encoding Categorical Variables

Most values in our dataframe are of the 'Object' or 'String' data type. This means that we will have to convert these to dummy variables to proceed!

Let's first check for missing values in the data!

In [15]:
train.isnull().sum()[train.isnull().sum()>0]

ContractType    1851
ContractTime     653
Company          328
dtype: int64

There are missing values in the variables as shown above! I will not try to impute the most frequent category, instead I will proceed with encoding these and hence there will be a new column for the 'NA' values.

In [16]:
# Subset the columns required
columns_required = ['ContractType', 'ContractTime', 'Company', 'Category', 'SourceName', 'Exp_Location', 'Salary_Target']

train_b1 = train.loc[:, columns_required]

# Convert the categorical variables to dummy variables
train_b1 = pd.get_dummies(train_b1)

# Lets separate the predictors from the target variable
columns_selected = train_b1.columns.values.tolist()
target_variable = ['Salary_Target']

predictors = list(set(columns_selected) - set(target_variable))

### Modelling

#### Approach 1: Predict using variables other than job descriptions

**Model:** *Bernoulli Naive Bayes*

In [20]:
# setup the model
from sklearn.naive_bayes import BernoulliNB

X = np.array(train_b1.loc[:,predictors])
y = np.array(train_b1.loc[:,target_variable[0]])

# create test train splits 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

model = BernoulliNB()

# Fit the model and predict the output on the test data
model.fit(X_train, y_train)

# Predicted output
predicted = model.predict(X_test)

# Accuracy
from sklearn import metrics

print("Model Accuracy is:", metrics.accuracy_score(y_test, predicted))
print("Area under the ROC curve:", metrics.roc_auc_score(y_test, predicted))
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test, predicted))

Model Accuracy is: 0.78
Area under the ROC curve: 0.589823330907166
Confusion Matrix:
 [[363  18]
 [ 92  27]]


The prediction accuracy is 78% using the categorical variables. Let's see what we get if we use job descriptions! Since this is a rare class problem, the AUROC score is more important and its just 0.59 which means very close to a random classifier.

#### Approach 2: Predict using the job descriptions



**Model 1:** *Bernoulli Naive Bayes* <br>
Bernoulli naive bayes is most suited for small documents since it only considers appearance/absence of terms. Let's test it anyway on the data we have

In [21]:
# Calculate the frequencies of words using the TfidfTransformer
X = np.array(train.loc[:, 'Clean_Full_Descriptions_no_stop'])
y = np.array(train.loc[:, 'Salary_Target'])

# split into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

# Convert the arrays into a presence/absence matrix
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

X_train_bern = np.where(X_train_counts.todense() > 0 , 1, 0)
X_test_bern = np.where(X_test_counts.todense() > 0, 1, 0)

# Fit the model
from sklearn.naive_bayes import BernoulliNB
nb_bern_model = BernoulliNB().fit(X_train_bern, y_train)
predicted = nb_bern_model.predict(X_test_bern)

# print the accuracies
print("Model Accuracy:", metrics.accuracy_score(y_test, predicted))
print("Area under the ROC curve:", metrics.roc_auc_score(y_test, predicted))
print("Model Confusion Matrix:\n", metrics.confusion_matrix(y_test, predicted))

Model Accuracy: 0.786
Area under the ROC curve: 0.6110964070667636
Model Confusion Matrix:
 [[360  21]
 [ 86  33]]


**Model 2:** *Multinomial Naive Bayes*

In [22]:
# lets use the lemmatized descriptions to fit the model
from sklearn.naive_bayes import MultinomialNB
nb_mult_model = MultinomialNB().fit(X_train_counts, y_train)
predicted = nb_mult_model.predict(X_test_counts)

print("Model Accuracy:", metrics.accuracy_score(y_test, predicted))
print("Area under the ROC curve:", metrics.roc_auc_score(y_test, predicted))
print("Model Confusion Matrix:\n", metrics.confusion_matrix(y_test, predicted))

Model Accuracy: 0.834
Area under the ROC curve: 0.7697236374864906
Model Confusion Matrix:
 [[340  41]
 [ 42  77]]


**Model 3:** *Logistic Regression*

In [23]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(class_weight = "balanced")

log_reg.fit(X_train_counts, y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

In [24]:
predicted = log_reg.predict(X_test_counts)

print("Model Accuracy:", metrics.accuracy_score(y_test, predicted))
print("Area under the ROC curve:", metrics.roc_auc_score(y_test, predicted))
print("Model Confusion Matrix:\n", metrics.confusion_matrix(y_test, predicted))

Model Accuracy: 0.834
Area under the ROC curve: 0.7552769139151724
Model Confusion Matrix:
 [[345  36]
 [ 47  72]]


**Model 4:** *Support Vector Machines*

In [25]:
from sklearn.svm import SVC
svm_clf = SVC(kernel = "rbf", gamma = "auto", C = 100, random_state = 42, class_weight = "balanced")

svm_clf.fit(X_train_counts, y_train)

SVC(C=100, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)

In [26]:
predicted = svm_clf.predict(X_test_counts)

print("Model Accuracy:", metrics.accuracy_score(y_test, predicted))
print("Area under the ROC curve:", metrics.roc_auc_score(y_test, predicted))
print("Model Confusion Matrix:\n", metrics.confusion_matrix(y_test, predicted))

Model Accuracy: 0.838
Area under the ROC curve: 0.7694589646882375
Model Confusion Matrix:
 [[343  38]
 [ 43  76]]


The SVM method has the most accuracy for prediction of the salaries! Its AUROC score is just under that of Multinomial Naive bayes

#### Words that indicate high/low salary

Mutual information measures how much information a particular token contains about the class. Essentially, this is saying something like this- ‘Knowing that this token appears in the document how much can we say about what class this is from’. Let's see which words have the most predictive power for high and low salary classes.

In [27]:
# extract the column names for the columns in our training dataset.
column_names = [x for (x,y) in sorted(count_vectorizer.vocabulary_.items(), key = lambda x:x[1])]

# probability of high salary
p_1 = np.mean(y_train)

# probability of low salary
p_0 = 1 - p_1

# create an array of feature vectors
feature_vectors = np.array(X_train_bern)

# probability of word appearance
word_probabilities = np.mean(feature_vectors, axis = 0)

# probability of seeing these words for class= 1 and class = 0 respectively
p_x_1 = np.mean(feature_vectors[y_train==1, :], axis = 0)
p_x_0 = np.mean(feature_vectors[y_train==0, :], axis = 0)

# words that are good indicators of high salary (class = 1)
high_indicators = p_x_1 * (np.log2(p_x_1) - np.log2(word_probabilities) - np.log2(p_1))

high_indicators_series = pd.Series(high_indicators, index = column_names)

# words that are good indicators of low salary (class = 0)
low_indicators = p_x_0 * (np.log2(p_x_0) - np.log2(word_probabilities) - np.log2(p_0))

low_indicators_series = pd.Series(low_indicators, index = column_names)

#### Words indicative of low salary
The numbers against the terms show the mutual information of these words with the low salary output

In [28]:
low_indicators_series[[i for i in low_indicators_series.index]].\
sort_values(ascending = False)[:10].index

Index(['work', 'experience', 'role', 'client', 'team', 'look', 'please', 'job',
       'apply', 'skill'],
      dtype='object')

### Words indicative of high salary
The numbers against the terms show the mutual information of these words with the low salary output

In [29]:
high_indicators_series[[i for i in high_indicators_series.index]].\
sort_values(ascending = False)[:10].index

Index(['experience', 'work', 'role', 'business', 'team', 'lead', 'opportunity',
       'management', 'client', 'project'],
      dtype='object')

#### Approach 2: Predict using all variables

**Model:** *Bernoulli Naive Bayes*

In [30]:
# convert text data to dataframe
X = np.array(train.loc[:, 'Clean_Full_Descriptions_no_stop'])

count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(X)

column_names = [x for (x,y) in sorted(count_vectorizer.vocabulary_.items(), key = lambda x:x[1])]
X_counts_to_occurence = np.where(X_counts.todense() > 0, 1, 0)

text_data = pd.DataFrame(X_counts_to_occurence, columns = column_names)

# train_b1 has the numerical data we used earlier
# Lets separate the predictors from the target variable
columns_selected = train_b1.columns.values.tolist() + text_data.columns.values.tolist()

target_variable = ['Salary_Target']

predictors = list(set(columns_selected) - set(target_variable))

full_data = pd.concat([train_b1, text_data], axis = 1)

X = np.array(full_data.loc[:,predictors])
y = np.array(full_data.loc[:,target_variable[0]])

# create test train splits 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

model = BernoulliNB()

# Fit the model and predict the output on the test data
model.fit(X_train, y_train)

# Predicted output
predicted = model.predict(X_test)

# Accuracy
from sklearn import metrics

print("Model Accuracy is:", metrics.accuracy_score(y_test, predicted))
print("Area under the ROC curve:", metrics.roc_auc_score(y_test, predicted))
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test, predicted))

Model Accuracy is: 0.774
Area under the ROC curve: 0.6147797701757868
Confusion Matrix:
 [[350  31]
 [ 82  37]]


In [31]:
model = LogisticRegression(C = 0.1, class_weight = "balanced")

# Fit the model and predict the output on the test data
model.fit(X_train, y_train)

# Predicted output
predicted = model.predict(X_test)

# Accuracy
from sklearn import metrics

print("Model Accuracy is:", metrics.accuracy_score(y_test, predicted))
print("Area under the ROC curve:", metrics.roc_auc_score(y_test, predicted))
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test, predicted))

Model Accuracy is: 0.832
Area under the ROC curve: 0.7741899909570128
Confusion Matrix:
 [[337  44]
 [ 40  79]]


The accuracy of the model is 0.832%. AUROC score is at 0.77 which is the highest we have obtained so far!

### Conclusions

We find that job descriptions are very effective in helping predict job salaries. I achieved an accuracy of 83.8% when using just job descriptions with an AUROC score of 0.769. This was slightly higher than the value of 78.6% when using predictors other than job descriptions. I was surprised to find that the final model (where I used all variables) didn't perform much better, it performed only slightly better!

With mutual information, the results were interesting - <br>
All jobs have a mention of experience, role and work <br>
Higher salaried jobs mention terms like *business*, *management*, *lead*, etc. <br>
Lower salaried jobs mention terms like *job*, *skill*, *apply*, etc.