## Job Salary Prediction using Neural Networks

This Kaggle competition challenges us to predict Job salaries based on job ads. The data provided has Job_Title, Location, Category, Contract Type, etc. 

Lets jump right in - 
1. Load the libraries required for this task.
2. Read in the dataset.
3. See what the data looks like.

The other jupyter notebook contains the code for predicting job salaries using simple methods such as Naive Bayes, logistic regression and an advanced method - SVM. In this notebook, we will look at solving the problem using a Neural network. This time we will use a much larger sample of the data so that we can effectively train our network

In [1]:
# import libraries
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import re
import matplotlib.pyplot as plt
%matplotlib inline

import math

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read in the train_rev1 datafile downloaded from kaggle
df = pd.read_csv('Train_rev1.csv')

The data has $244,768$ rows with 12 columns, most of which are text type columns as signified by the `object` data type. This means that we will have a large number of (0,1) type variables once the categorical columns are encoded. <br>

To be able to use neural networks effective, I will randomly select 80000 rows to do the analysis on.

In [3]:
# randomly sample k rows from the data
import random
random.seed(1) # so that results are reproducible

# get a random sample of k rows from the row indices
indices = df.index.values.tolist()
random_k = random.sample(indices, 80000)

# subset the imported data on the selected 2500 indices
train = df.loc[random_k, :]
train = train.reset_index(drop = True)

In [4]:
# some problems with the way FullDescription has been encoded
def convert_utf8(s):
    return str(s)

train['FullDescription'] = train['FullDescription'].map(convert_utf8)

## Data Preprocessing

### Clean job descriptions

**Step 1:** Lets lemmatize the text to remove excess forms of the same word

In [5]:
# lemmatization
from nltk.corpus import wordnet

from nltk.stem import WordNetLemmatizer
word_lemm = WordNetLemmatizer()

def convert_to_valid_pos(x):
    
    x = x[0].upper() # extract first character of the POS tag
    
    # define mapping for the tag to correct tag.
    tag_dict = {"J": wordnet.ADJ,
               "N": wordnet.NOUN,
               "R": wordnet.ADV,
               "V": wordnet.VERB}
    
    return tag_dict.get(x, wordnet.NOUN)

def lemmatize_text(s):
    pos_tagged_text = nltk.pos_tag(word_tokenize(s))
    
    lemm_list = []

    for (word, tag) in pos_tagged_text:
        lemm_list.append(word_lemm.lemmatize(word, pos = convert_to_valid_pos(tag)))


    lemm_text = " ".join(lemm_list)
    return lemm_text

train['Full_Description_Lemm'] = train['FullDescription'].map(lemmatize_text)

Using text data is particularly tricky because of the large number of words, numbers, links, symbols, etc in it that is of no value to the prediction problem at hand. We need to manually clean the `FullDescription` column so that it is ready for our analysis. In particular, we will remove urls, numbers, words with stars (hidden characters) and stopwords!

In [6]:
from string import punctuation
from nltk.corpus import stopwords # store english stopwords in a list
en_stopwords = stopwords.words('english')

def remove_anomalies(s):
    
    tokens = word_tokenize(s)
    
    # urls
    weblinks = [w for w in tokens if ".co.uk" in w] + [w for w in tokens if ".com" in w] + [w for w in tokens if "www" in w]
    weblinks = list(set(weblinks)) # remove duplicates from weblinks
    
    # numbers
    numbers = []
    for x in tokens:
        if len(re.findall('.*[0-9]+.*', x)) > 0:
            numbers.append(re.findall('.*[0-9]+.*', x)[0])
        else:
            numbers.append(np.nan)
    
    numbers = pd.Series(numbers)
    numbers = numbers[~numbers.isnull()].tolist()
    
    # stars
    stars = []
    for x in tokens:
        if len(re.findall('.*[\*]+.*', x)) > 0:
            stars.append(re.findall('.*[\*]+,*', x)[0])
        else:
            stars.append(0)
        
    stars = pd.Series(stars)   
    stars = stars[stars != 0].tolist()
    
    #stopwords
    global en_stopwords
    
    answer = " ".join([w for w in tokens if (w not in en_stopwords) & (w not in weblinks) & (w not in numbers) & (w not in stars)])
    
    for l in punctuation:
        answer = answer.replace(l, "")
    
    return answer

train['Clean_Full_Descriptions_no_stop'] = train['Full_Description_Lemm'].map(remove_anomalies)

The `Clean_Full_Descriptions_no_stop` has the full descriptions without punctuations, numbers, star words, urls or stopwords!

### Preparing the target Variable

`SalaryNormalized` has the salary values for each job description. We need to create a new categorical variable based off of this that has the value $1$ if salary value is greater than or equal to the $75^{th}$ percentile or $0$ otherwise

In [7]:
# get the 75th percentile value of salary!
sal_perc_75 = np.percentile(train['SalaryNormalized'], 75)

# make a new target variable that captures whether salary is high (1) or low (0)
train['Salary_Target'] = np.where(train['SalaryNormalized'] >= sal_perc_75, 1, 0)

My modelling approach changes this time. I realised that the approach to create a categorical variable on the basis of 75th percentile of the salary results in an imbalanced dataset. To counter the negative effects of this, I will use a neural network to predict the actual salaries for each job description and then convert them to categorical outputs based on whether they are greater than the 75th percentile of the training salaries.

In [30]:
# Calculate the frequencies of words using the TfidfTransformer
X = np.array(train.loc[:, 'Clean_Full_Descriptions_no_stop'])
y_pxy = np.array(train.loc[:, 'SalaryNormalized'])
y_act = np.array(train.loc[:, 'Salary_Target'])

# split into test and train data. train_pxy is the continuous variable, train_act is the categorical output that we want.
from sklearn.model_selection import train_test_split
X_train, X_val, y_train_pxy, y_val_pxy = train_test_split(X, y_pxy, test_size = 0.2, random_state = 42)
X_train, X_test, y_train_pxy, y_test_pxy = train_test_split(X_train, y_train_pxy, test_size = 0.2, random_state = 42)

X_train, X_val, y_train_act, y_val_act = train_test_split(X, y_act, test_size = 0.2, random_state = 42)
X_train, X_test, y_train_act, y_test_act = train_test_split(X_train, y_train_act, test_size = 0.2, random_state = 42)

# for training to be faster, I will normalize the data as well!
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()

y_train_pxy_scaled = std_scaler.fit_transform(y_train_pxy.reshape(-1, 1))
y_val_pxy_scaled = std_scaler.transform(y_val_pxy.reshape(-1, 1))

# Convert the arrays into a presence/absence matrix
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(max_features = 10000)
X_train_counts = count_vectorizer.fit_transform(X_train)
X_val_counts = count_vectorizer.transform(X_val)
X_test_counts = count_vectorizer.transform(X_test)

X_train_nn = np.where(X_train_counts.todense() > 0 , 1, 0)
X_val_nn = np.where(X_val_counts.todense() > 0, 1, 0)
X_test_nn = np.where(X_test_counts.todense() > 0, 1, 0)

In [48]:
from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Dense(128, activation = 'relu', input_shape = (10000, )))
network.add(layers.Dense(128, activation = 'relu'))
network.add(layers.Dense(1))

In [49]:
# define the loss and optimizer
network.compile(optimizer = 'rmsprop',
               loss = 'mse',
               metrics = ['mae'])

In [50]:
# train the network
history = network.fit(X_train_nn,
           y_train_pxy_scaled,
           epochs = 10, 
           batch_size = 128,
           validation_data = (X_val_nn, y_val_pxy_scaled))

Train on 51200 samples, validate on 16000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Generating the categorical outputs
From the predicted standardized salaries, we will get back the categorical output and compare to the actual target values to calculate the accuracy! Let's first calculate the validation accuracy and then the test accuracy!

In [51]:
from sklearn.metrics import accuracy_score
predictions = network.predict(X_val_nn)
predictions = predictions*np.sqrt(std_scaler.var_) + std_scaler.mean_

pred_act = (predictions >= sal_perc_75).astype('int').reshape(-1, )
accuracy_score(y_val_act, pred_act)

0.866125

In [52]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val_act, pred_act)

0.8076429132486971

In [53]:
predictions = network.predict(X_test_nn)
predictions = predictions*np.sqrt(std_scaler.var_) + std_scaler.mean_

pred_act = (predictions >= sal_perc_75).astype('int').reshape(-1, )
accuracy_score(y_test_act, pred_act)

0.862265625

In [54]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test_act, pred_act)

0.7990948119669864

### Conclusions

I achieved an accuracy of 86.2% when using just job descriptions with an AUROC score of 0.799 using the neural network approach. This is significantly higher than the value of 83.8% when using SVM. The AUROC score is also higher signifying that this distinguishes between the two classes better!