<font color='darkblue' size=7 style="font-family:garamond;">Leaving job prediction & EDA</font>

Welcome to this notebook, where today we will be performing EDA and predicting whether a person will leave their job.

<img src="http://adcengineers.com/wp-content/uploads/2016/06/careers.jpg" width="1000px">

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import boxcox
from collections import Counter
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<font color='purple' size=6 style="font-family:garamond;">Gathering data</font>

The first step is to acquire our dataset and put it in an 'X' variable. We extract the 'target' feature out of X and assign a variable 'y' to it.

In [None]:
X = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
y = X['target']

X = X.drop(['enrollee_id', 'target'], axis=1)
X['major_discipline'][X['major_discipline']=='Business Degree'] = 'Business'

In [None]:
X

<font color='darkblue' size=7 style="font-family:garamond;">Categorical feature visualisation</font>

<img src="https://miro.medium.com/max/1000/1*WACiczYwdWTnJ94mnizS4Q.jpeg" width="400px">

<font color='purple' size=6 style="font-family:garamond;">Pie charts</font>

The first part of our EDA is displaying the different categorical features in our dataset. We take three columns: '**experience**', '**company_size**' and '**last_new_job**' and plot them out using pie charts.

As seen below, we visualise 'experience' using a pie chart, which tells us how many years the candidate has been working. More than 17% of the people have been working for more than 20 years, 7.46% have worked for 5 years and 7.32% have had 4 years.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
count = Counter(X['experience'])
plt.pie(count.values(), labels=count.keys(), labeldistance=0.75, autopct=lambda p:f'{p:.2f}%',
       explode=[0]*14+[0.12]+[0.06]+[0]*7, shadow=True, rotatelabels=0.75)
plt.title('Work experience (in years)', fontsize=20)
plt.show()

The 'company_size' tells us the range of how many employees were in the companies. In this column, a lot of our samples are missing (almost two fifths!), as indicated by the 'nan'. Though, the largest group out of the recorded samples is 50-99, followed by 100-500 (13.42%) and 10,000+ (10.54%).

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
count = Counter(X['company_size'])
plt.pie(count.values(), labels=count.keys(), labeldistance=0.75, autopct=lambda p:f'{p:.2f}%',
       shadow=True, explode=[0.075]+[0]*8, rotatelabels=0.75)
plt.title('Company size (number of people)', fontsize=20)
plt.show()

The final feature which we will visualise using a pie chart is the 'last_new_job'. This tells us how many years have passed between the last and the current job. Over 40% of people have had one year between their previous and current job, then 17.17% have had a gap of more than four years and for 15.14% of the candidates, two years have passed.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
count = Counter(X['last_new_job'])
plt.pie(count.values(), labels=count.keys(), labeldistance=0.75, autopct=lambda p:f'{p:.2f}%',
       explode=[0.05]+[0]*6, shadow=True)
plt.title('Number of years between last and current job', fontsize=20)
plt.show()

<font color='purple' size=6 style="font-family:garamond;">Bar graphs</font>

Subsequently, we now take a look at six different columns: '**gender**', '**relevent_experience**', '**enrolled_university**', '**education_level**', '**major_discipline**', '**company_type**'.

For the **gender** column, we can see that there are significantly more men than any other gender. All of the non-male samples combined make up of less than one half of the amount of males in the dataset.

The amount of people who have had **relevent experience** is more than twice the amount who have not.

The vast majority of people (13817 candidates) have not **enrolled** to any university, followed by 3757 candidates at a full time course and 1198 at a part time course.

As for the **education level**, 60% are graduates, 23% have a master's and 11% only finished high school.

In the **major discipline**, a substantial amount of people are involved in STEM, as only 24% are involved in any other discipline and 76% are in STEM.

More than half of candidates (51%) were involved in a Pvt Ltd **company type**, while 32% of the samples are missing and 5% were in a funded startup.

In [None]:
names = [['gender', 'relevent_experience'], ['enrolled_university', 'education_level'], 
 ['major_discipline', 'company_type']]
colours = ['red', 'green', 'blue', 'purple', 'orange', 'skyblue']

fig, axes = plt.subplots(3, 2, figsize=(15, 17))
fig.tight_layout(h_pad=3)

for i in names:
    for name, ax in zip(i, axes[names.index(i)]):
        col = X[name].fillna('NaN')
        count = Counter(col)
        count = pd.Series(count).sort_values(ascending=False)
        bars = ax.bar(count.keys(), count, color=colours[list(axes.flatten()).index(ax)])
        for bar in bars:
            label = count[list(bars).index(bar)]
            ax.text(bar.get_x() + bar.get_width()/2., bar.get_height(), label, ha='center', 
                     va='bottom', fontsize=15)  
        ax.set_title(name, fontsize=15)
        ax.xaxis.label.set_size(50)
        plt.xticks(rotation='vertical')

<font color='darkblue' size=7 style="font-family:garamond;">Data Transformation</font>

Now we move onto transforming our data in certain ways so that it can be inputted into a model for predictions.

<font color='purple' size=6 style="font-family:garamond;">Missing data</font>

We start off by replacing the null values in 'experience', 'company_size' and 'last_new_job' with -1's. This is because these columns are ordinal (ordered) and need to have a replacement for any missing data.

In [None]:
X['experience'] = X['experience'].fillna('-1')
X['company_size'] = X['company_size'].fillna('-1')
X['last_new_job'] = X['last_new_job'].fillna('-1')

<font color='purple' size=6 style="font-family:garamond;">Categorical to numerical</font>

Furthermore, we will transform our categorical columns to numerical ones. We use a LabelEncoder for the ordinal (ordered) features and a get_dummies/One Hot Encoder for the nominal (unordered) features.

In [None]:
cat_cols = ['gender', 'relevent_experience', 'enrolled_university', 'education_level', 
            'major_discipline', 'experience', 'company_size', 'company_type', 'last_new_job']
ordinal = ['experience', 'company_size', 'last_new_job']

for col in cat_cols:
    if col in ordinal:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])
    else:
        dummies = pd.get_dummies(X[col])
        for d_col in dummies:
            X[col+' '+d_col] = dummies[d_col]
        X = X.drop(col, axis=1)
                           
X['city'] = [int(city[5:]) for city in X['city']]

<font color='purple' size=6 style="font-family:garamond;">Log and Box Cox</font>

Afterwards, we plot the distribution of the 'city', 'city_development_index', 'experience', 'company_size', 'last_new_job','training_hours' using histograms and box plots. Also, we check how they fare when transformed using a log transform and a box cox.

In [None]:
for col in X.columns[:6]:
    normal = X[col]
    transforms = [[normal, 'Normal', 'lightblue'], 
                  [(normal+1).transform(np.log), 'Log Transform', 'lightgreen'], 
                  [boxcox(normal+1)[0], 'Box Cox', 'pink']]
    fig, axes = plt.subplots(2, len(transforms), figsize=(20, 12))
    
    for ax in axes[0]:
        transform = transforms[list(axes[0]).index(ax)]
        pd.DataFrame(transform[0]).hist(ax=ax, color=transform[2])
        ax.set_title('')
        ax.set_xlabel(transform[1], fontsize=15)
        
        deciles = pd.Series(transform[0]).quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
        for pos in np.array(deciles).reshape(1, -1)[0]:
            handle = ax.axvline(pos, color='darkblue', linewidth=1)
        ax.legend([handle], ['deciles'])
        
    for ax in axes[1]:
        transform = transforms[list(axes[1]).index(ax)]
        ax.boxplot(transform[0])
        ax.set_title('')
        ax.set_xlabel(transform[1], fontsize=15)
        
    axes[0][int(np.floor(len(transforms)/2))].set_title(col, pad=25, fontsize=30)
    plt.show()

Our conclusions are that the 'city' is best left alone, the 'city_development_index', 'experience' and 'training_hours' features should be transformed using box cox and the 'company_size' and 'last_new_job' work best with a log transform.

In [None]:
X['city'] = X['city']
X['city_development_index'] = boxcox(X['city_development_index']+1)[0]
X['experience'] = boxcox(X['experience']+1)[0]
X['company_size'] = (X['company_size']+1).transform(np.log)
X['last_new_job'] = (X['last_new_job']+1).transform(np.log)
X['training_hours'] = boxcox(X['training_hours']+1)[0]

<font color='purple' size=6 style="font-family:garamond;">Binning</font>

Another useful technique to change the data is binning, which reduces the amount of unique classes in the columns to a specified amount by grouping certain ranges together.

The variables we choose to bin are 'city' and 'training_hours'. The graph below shows us not just their compared distribution, but also how many different classes they have - 20, as shown by the maximum value on the x axis.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(13, 8))
for name in ['city', 'training_hours']:
    col = X[name]
    X[name] = np.digitize(col, np.arange(col.min(), col.max(), (col.max()-col.min())/20))
    X[name].hist(alpha=0.65, legend=True)
    plt.title('Binned Distribution', fontsize=20)
        
plt.show()
X = X.drop(['city', 'training_hours'], axis=1)

<font color='purple' size=6 style="font-family:garamond;">Resampling</font>

The distribution of y has a lot more samples in '0' than in '1'. Therefore, we will use the SMOTE class to resample our data.

In [None]:
count = Counter(y)
plt.bar(['1', '0'], count.values(), color='blue')
plt.title('Distribution of y')
plt.xlabel('Class')
plt.ylabel('Number of samples')
plt.show()

In [None]:
smote = SMOTE()
X, y = smote.fit_resample(X, y)

Now the number of samples per class in y are equal, so the predictor can have a higher accuracy.

In [None]:
count = Counter(y)
plt.bar(['1', '0'], count.values(), color='blue')
plt.title('Distribution of y')
plt.xlabel('Class')
plt.ylabel('Number of samples')
plt.show()

<font color='purple' size=6 style="font-family:garamond;">Splitting sets and sampling</font>

An extremely common practice among people who wish to predict data is to split the X and y into train and test sets, which is what we do in the cell below. The test set has 20% of data, while the train has 80%.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Then we use a Standard Scaler to scale our X train and test data so that it can be more useful to our classifier.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

<font color='purple' size=6 style="font-family:garamond;">Dimensionality reduction</font>

Another very useful piece of data cleaning that we will perform is dimensionality reduction, in which we remove useless features so that our model can have greater accuracy.

The correlation of the columns in our data is shown below using a heatmap. There don't seem to be any strong connections between them that jump out to us, which is a good sign because if they were correlated then that would mean that the probability of us having a useless feature would be greater.

In [None]:
sns.heatmap(X.corr())
plt.title('Correlation of features', fontsize=15, pad=10)
plt.show()

In the next cell, we use a PCA to determine how much each feature contributes to our dataset. The first column has 12% of explained variance ratio, and we can also see that 25 other variables also have some contribution.

In [None]:
pca = PCA(n_components=29).fit(X_train)
evr = pca.explained_variance_ratio_
plt.bar(range(len(evr)), evr)
plt.title('Explained variance ratio for features')
plt.xlabel('Features')
plt.ylabel('Explained variance ratio')
plt.show()

We use a PCA to pull out the best 26 components and apply it to our X train and test.

In [None]:
pca = PCA(n_components=26)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

<font color='darkblue' size=7 style="font-family:garamond;">Creating a classifier</font>

The final step is to create a machine learning classifier which will accurately predict our data.

<img src="http://www.rgitaa.com/wp-content/uploads/2018/12/2.png" width="500">

Here, we test four predictors: XGBoost, Random Forest, Linear SVC and SGD Classifier. We use the model, cross validation and ROC AUC scores in order to evaluate which model performs the best.

In [None]:
classifiers = [['XGBoost', XGBClassifier()], ['Random Forest', RandomForestClassifier()], 
               ['Linear SVC', LinearSVC(dual=False)], ['SGD', SGDClassifier()]]
scores = []
cross_vals = []
roc_aucs = []

for classifier in classifiers:
    name = classifier[0]
    model = classifier[1]
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    score = model.score(X_test, y_test)
    cross_val = cross_val_score(model, X_test, y_test).mean()
    roc_auc = roc_auc_score(y_test, y_pred)
    
    scores.append(score)
    cross_vals.append(cross_val)
    roc_aucs.append(roc_auc)
    
    print(name)
    print('model score:    ', score)
    print('cross val score:', cross_val)
    print('ROC AUC score:  ', roc_auc)
    if classifier != classifiers[-1]:
        print('')

It seems that the ensemble algorithms: XGBoost and Random Forest perform the best on our dataset.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 7))
metrics = [scores, cross_vals, roc_aucs]
metric_names = ['model score', 'cross validation score', 'ROC AUC score']
names = ['XGBoost', 'Random Forest', 'Linear SVC', 'SGD']
colours = ['red', 'lightgreen', 'blue']

for metric in metrics:
    index = metrics.index(metric)
    ax = axes.flatten()[index]
    bars = ax.bar(names, metric, color=colours[index])
    for bar in bars:
        label = str(metric[list(bars).index(bar)])[:4]
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(), label, ha='center', 
                va='bottom', fontsize=16)
    ax.set_title(metric_names[index], fontsize=20)
plt.show()

<font color='darkblue' size=5 style="font-family:georgia">Thank you for reading my notebook.</font>

<font color='darkblue' size=5 style="font-family:georgia;">If you enjoyed this notebook and found it helpful, please upvote it and give feedback as it would help me make more of these.</font>