In this notebook we will only be working with `aug_train.csv`. The ultimate goal is to try to accurately predict whether particular candidate will be looking for a new job.

Importing relevant libraries

In [None]:
import os
import random
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import tensorflow as tf
import math
from scipy import special,stats #comb, factorial
from keras import backend as K
from scipy.stats import uniform
from matplotlib import pyplot as plt
from sklearn import tree
from scipy import sparse,stats
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.preprocessing import MinMaxScaler, StandardScaler,LabelEncoder, OneHotEncoder
from sklearn.metrics import classification_report, roc_auc_score, recall_score, make_scorer, plot_confusion_matrix, confusion_matrix, accuracy_score,f1_score

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
df.head()

In [None]:
print(f'Shape of the dataset: {df.shape}')

In [None]:
df.info()

We see that we will mostly be dealing with categorical features. Furhermore, it seems that there are quite a lot of nulls. Let's see where the nulls are:

In [None]:
df.isnull().sum()

Let's have a closer look at the columns with nulls:

In [None]:
ser = df.isnull().sum()
ser = ser[ser > 0]
columns = ser.index.values
df[columns]

We see that all features with nulls are categorical. For each column, we will replace all `NaN` values with `Unknown`. But first let's verify that neither of the columns with nulls contain unique value `Unknown` (if some column already has value `Unknown`, then it clearly wouldn't make sense to replace `NaN` values with it)

In [None]:
val = 'Unknown'
taken = False
for col in columns:
    if val in df[col].unique():
        taken = True
if taken:
    print("value `Unknown` is already taken")
else: print('There is no column that contains unique value `Unknown`')

Since no columns contains value `Unknown`, we will replace all nulls in the columns with it.

In [None]:
ser = df.isnull().sum()
ser = ser[ser > 0]
columns = ser.index.values

for col in columns:
    df[col].fillna(value='Unknown',inplace=True)

Let's check whether we have nulls now:

In [None]:
df.isnull().sum()

No, we don't

In [None]:
df.head()

The first observation suggests that `enrollee_id` will not be of much use.

In [None]:
df['enrollee_id'].astype(str).describe()

Indeed, every single value in the column is unique. We will remove the column.

In [None]:
df.drop(['enrollee_id'],axis=1,inplace=True)

In [None]:
df.head()

Let's have a look at the distribution of our label, namely `target`:

In [None]:
df['target'].value_counts()

In [None]:
df['target'].value_counts(normalize=True)

We see that the label is pretty disbalanced.

Our dataset contains only two numeric features, namely `city_development_index` and `training_hours`. Let's check the statistics summary and the distribution.

In [None]:
cont_feat = ['city_development_index', 'training_hours']
round(df[cont_feat].describe(),2)

In [None]:
cont_features = ['city_development_index', 'training_hours']
WIDTH = 10
LENGTH = 6

rows = math.ceil(len(cont_features)/3)
fig, ax = plt.subplots(1,2,figsize=(WIDTH,LENGTH))
ax = ax.flatten()
for i,feature in enumerate(cont_features):
    ax[i].hist(df[feature],alpha=0.6)
    ax[i].set_title(f'Distribution of a feature `{feature}`')

In [None]:
cont_features = ['city_development_index', 'training_hours']
cat_variable = 'target'
WIDTH = 10
LENGTH = 6

rows = math.ceil(len(cont_features)/3)
fig, ax = plt.subplots(1,2,figsize=(WIDTH,LENGTH))
ax = ax.flatten()
for i,feature in enumerate(cont_features):
    sns.boxplot(x=cat_variable, y=feature, data=df,ax=ax[i])
    ax[i].set_title(f'Cond. dist. of feature `{feature}`')

We see that while `training_hours` doesn't seem to be doing good job at discerning those who will move to a new job, but `city_development_index` does give us some insights: the smaller the `city_development_index` (generally speaking), the more likely it is that he will be looking for a new job.

Now let's check the bivariate conditional distribution

In [None]:
plt.figure(figsize=(10,6))
cont_features = ['city_development_index', 'training_hours']
sns.scatterplot(data=df, x=cont_features[0], y=cont_features[1], hue='target',alpha=0.6)
plt.show()

Besides what we have already mentioned (i.e., smaller `city_development_index` implies higher chance of a candidate looking for a new job), there doesn't seem to be any significant pattern.

Now let's have a look at the categorical features:

In [None]:
cat_features = ['city', 'gender', 'relevent_experience',
       'enrolled_university', 'education_level', 'major_discipline',
       'experience', 'company_size', 'company_type', 'last_new_job',]

count = np.array([df[feature].unique().size for feature in cat_features])

to_sort = np.argsort(count)[::-1]
cat_features = np.array(cat_features)[to_sort]
count = count[to_sort]

plt.figure(figsize=(11,6))
graph = sns.barplot(cat_features,count)
for p in graph.patches:
    graph.annotate(p.get_height(), (p.get_x()+0.4, p.get_height()),
                   ha='center', va='bottom',
                   color= 'black')


plt.title("Number of unique values per each feature")
plt.xticks(rotation=45)
plt.ylabel('Count')
plt.xlabel('Feature')
plt.show()

For each feature (besides `city`), let's visualize the distribution (conditional on `target`)

In [None]:
cat_features = ['city', 'gender', 'relevent_experience',
       'enrolled_university', 'education_level', 'major_discipline',
       'experience', 'company_size', 'company_type', 'last_new_job',]

plt.figure(figsize=(10,30))
for feature in cat_features[1:]:
    dataframe = df
    feature_1 = feature # FEATURE
    feature_2 = 'target' # LABEL
    to_sort = True # `True` would be useful if label is binary



    cs = pd.crosstab(dataframe[feature_1],
                     dataframe[feature_2],
                     normalize='index')
    if to_sort == True:
        cs.sort_values(by=[cs.columns[0]],inplace=True)
    cs.plot.bar(stacked=True,figsize=(10,6))
    plt.xlabel(feature)
    plt.xticks(rotation=45)
    plt.title(f'Conditional distributions of `{feature_2}`')
plt.show()

Couple of observations can be made here: 
1. Gender doesn’t seem to be a good predictor of people who want to switch jobs.
2. If one didn’t have a previous relevant experience, one is more likely to look for a new job.
3.  Those who signed up for a full time course are more likely to look for a new job (especially when we compare with the candidates who didn’t sign up for any course)
4. Those with little or no working experience (i.e., working experience less than 1 year) are the most likely to look for a new job (roughly 50% probability). Furthermore, based on the graph, we see that the experience is (roughly) negatively correlated with the proportion of people who look for a new job, in other words: the more experience you have, the less likely it is that you will be looking for a new job.
5.  The number of previous jobs is negatively correlated with the probability of looking for a new job (as the last graph suggests): the more jobs you have had previously, the less likely it is that you will be looking for a new job.

# Feature preprocessing

Preprosessing cat. features

In [None]:
cat_features = ['city', 'gender', 'relevent_experience',
       'enrolled_university', 'education_level', 'major_discipline',
       'experience', 'company_size', 'company_type', 'last_new_job',]


cat_feat_df = df[cat_features].copy()
cat_feat_df = OneHotEncoder().fit_transform(cat_feat_df)

Concatenating matrices containing cat. and cont. features.

In [None]:
cont_feat = ['city_development_index','training_hours']
cont_feat_df = df[cont_feat].copy()
cont_feat_df = sparse.csr_matrix(cont_feat_df.values)

In [None]:
X,y = sparse.hstack((cat_feat_df,cont_feat_df)), df['target']

Split our dataset into training and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=11)


sc = StandardScaler()

left = X_train[:,:-2]
right = sparse.csr_matrix(sc.fit_transform(X_train[:,-2:].todense()))
X_train = sparse.hstack((left,right)).tocsr()


left = X_test[:,:-2]
right = sparse.csr_matrix(sc.transform(X_test[:,-2:].todense()))
X_test = sparse.hstack((left,right)).tocsr()

# Evaluate models

Since we deal with the imbalanced target variable (i.e., there are way more entries with label $0$ than with label $1$), we would expect our models to incorrectly predict a lot of entries with label $1$. Hence the metric that we will be closely looking at is f1 score (where positive label is $1$)

# Logistic Regression

In [None]:
log_random_state = None
log_clf = LogisticRegression(random_state=log_random_state,max_iter=500).fit(X_train, y_train)
print(classification_report(y_true=y_test, y_pred=log_clf.predict(X_test)))
plot_confusion_matrix(log_clf, X_test, y_test)

As expected, we see that the model misclassifies a lot of people who chose to look for a new job (i.e., entries where the value in the `target` is $1$)

# KNN (25 neighbors)

In [None]:
knn_clf = KNeighborsClassifier(n_neighbors=25).fit(X_train,y_train)
print(classification_report(y_true=y_test, y_pred=knn_clf.predict(X_test)))
plot_confusion_matrix(knn_clf, X_test, y_test)

# Random Forest
Where the hyperparameters are:
1. `max_depth` = 20
2. `n_estimators` = 700
3. `bootstrap` = False

In [None]:
rf_clf = RandomForestClassifier(bootstrap=False, 
                                max_depth=20, 
                                n_estimators=700,
                                random_state=13).fit(X_train, y_train)

print(classification_report(y_true=y_test, y_pred=rf_clf.predict(X_test)))
plot_confusion_matrix(rf_clf, X_test, y_test)

# SVM: Default hyperparameters

In [None]:
svm_clf = SVC(gamma=0.0870736086175949).fit(X_train,y_train)
print(classification_report(y_true=y_test, y_pred=svm_clf.predict(X_test)))
plot_confusion_matrix(svm_clf, X_test, y_test)

We see that even without any hyperparameter tuning, SVM performs way better than the previous models (mainly signified by the fact that SVM gives us the highest `f1 score` (where $1$ is positive label))
In a separate notebook, I have tested different sets of hyperparameters. After trying dozens of combinations, I haven't found any set that would give us better f1 score (positive label: 1) than the default set (by the "default" set of hyperparameters, I mean $C=1$, gamma $=$ 'scale' $\approx  0.087$, and kernel $=$ 'rbf')

# XGBoost

After using grid search (in a separate notebook), the optimal (i.e., those that maximize f1-score (where positive label is $1$)) hyperparameters found are:

1. `max_depth` = 7

2. `eta` = 0.047895

3. `ojbective` = 'binary:hinge'


In [None]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_train,label=y_train)
param = {'max_depth': 7, 
         'eta': 0.047895, 'objective': 
         'binary:hinge'}
bst = xgb.train(params=param,dtrain=dtrain, num_boost_round=30)


dtest = xgb.DMatrix(X_test)
print(classification_report(y_true=y_test,y_pred=bst.predict(dtest)))

We see that while precision for label $1$ drops by roughly $3\%$, the recall increases significantly, thus giving us the best f1-score (0.62) out of all models used.