## Task Details
This dataset is designed to understand the factors that lead to a person to work for a different company(leaving current job), by model(s) that uses the current credentials/demographics/experience to predict the probability of a candidate to look for a new job or will work for the company.

The whole data divided to train and test. Sample submission has been provided correspond to enrollee id of test set (enrolle id | target)

## Notes
The dataset is imbalanced.

Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality.

Missing imputation can be a part of your pipeline as well.

## Features
enrollee_id : Unique ID for candidate

city: City code

city_development_index : Developement index of the city (scaled)

gender: Gender of candidate

relevent_experience: Relevant experience of candidate

enrolled_university: Type of University course enrolled if any

education_level: Education level of candidate

major_discipline :Education major discipline of candidate

experience: Candidate total experience in years

company_size: No of employees in current employer's company

company_type : Type of current employer

lastnewjob: Difference in years between previous job and current job

training_hours: training hours completed

target: 0 – Not looking for job change, 1 – Looking for a job change

## Importing Libraries

In [None]:
#%% Imports
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import lightgbm as lgb
import shap
%matplotlib inline

from pprint import pprint
from IPython.display import display 
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score, roc_auc_score

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Read in Training Data (aug_train.csv)

In [None]:
# Read aug_train.csv
aug_train = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
# Initial Glance at Data
display(aug_train.info(verbose = True,null_counts=True))
print(aug_train.shape)

aug_train has 19,158 observations with 13 features and 1 target variable. The dataset has missing data and must be handled properly.

## EDA

In [None]:
# enrolle_id is an meaningless feature that is a unique value for each employee.
# count total number of unique values in enrollee_id column
print('Number of Unique Values: ' + str(aug_train['enrollee_id'].nunique()))

In [None]:
# city has 123 unique values and is a categorical variable.
print('Number of Unique Values: ' + str(aug_train['city'].nunique()))
print('Number of NaN Values: ' + str(sum(aug_train['city'].isnull())))
# top 10 cities 
print((aug_train['city'].value_counts()[0:10]))

In [None]:
# city_development_index is Continous Variable

print("Number of Missing Values: ", aug_train['city_development_index'].isna().sum())
display(aug_train['city_development_index'].describe())
boxplot = aug_train.boxplot(column ='city_development_index')

In [None]:
# gender is Catagorical Variable: Male, Female, Other, or NaN
print("Number of Missing Values: ", aug_train['gender'].isna().sum())
fig = px.pie(aug_train['gender'].value_counts(), values='gender', names = aug_train['gender'].value_counts().index,title = 'gender',template='plotly_dark')
fig.show()

In [None]:
# relevent_experience is Binary Variable with no missing values.
print("Number of Missing Values: ", aug_train['relevent_experience'].isna().sum())
fig = px.pie(aug_train['relevent_experience'].value_counts(), values='relevent_experience', 
             names = aug_train['relevent_experience'].value_counts().index,title = 'relevent_experience',template = 'plotly_dark')
fig.show()

In [None]:
# education_level is Catagorical Variable indicating education level of worker, has 460 missing values

print("Number of Missing Values: ", aug_train['education_level'].isna().sum())
fig = px.pie(aug_train['education_level'].value_counts(), values='education_level', 
             names = aug_train['education_level'].value_counts().index,title = 'education_level',template='plotly_dark')
fig.show()

In [None]:
# major_discipline is a Catagorical Variable indicating major discipline of worker, has 2813 missing values
print("Number of Missing Values: ", aug_train['major_discipline'].isna().sum())
fig = px.pie(aug_train['major_discipline'].value_counts(), values='major_discipline', 
             names = aug_train['major_discipline'].value_counts().index,title = 'major_discipline',template='plotly_dark')
fig.show()

In [None]:
# experience is a Ordinal Variable, can replace <1 with 0 and >20 with 21

print("Number of Missing Values: ", aug_train['experience'].isna().sum())
fig = px.pie(aug_train['experience'].value_counts(), values='experience', 
             names = aug_train['experience'].value_counts().index,title = 'experience',template='plotly_dark')
fig.show()

In [None]:
# company_size is a Ordinal Catagorical variable has 5938 missing variables

print("Number of Missing Values: ", aug_train['company_size'].isna().sum())
fig = px.pie(aug_train['company_size'].value_counts(), values='company_size', 
             names = aug_train['company_size'].value_counts().index,title = 'company_size',template='plotly_dark')
fig.show()

In [None]:
# company_type is a Catagorical Variable

print("Number of Missing Values: ", aug_train['company_type'].isna().sum())
fig = px.pie(aug_train['company_type'].value_counts(), values='company_type', 
             names = aug_train['company_type'].value_counts().index,title = 'company_type',template='plotly_dark')
fig.show()

In [None]:
# last_new_job is a Catagorical Variable

print("Number of Missing Values: ", aug_train['last_new_job'].isna().sum())
fig = px.pie(aug_train['last_new_job'].value_counts(), values='last_new_job', 
             names = aug_train['last_new_job'].value_counts().index,title = 'last_new_job',template='plotly_dark')
fig.show()

In [None]:
# training_hours is a Continous variable

print("Number of Missing Values: ", aug_train['training_hours'].isna().sum())
display(aug_train['training_hours'].describe())
aug_train.boxplot(column ='training_hours')

### Tarrget variable
target is the variable we are trying to predict and calculate the probablities for.

0 – Not looking for job change, 1 – Looking for a job change.

It is better to have a high recall to better target employees who are looking for a job change.
This is an unbalanced classification problem as seen in the pie chart below.

In [None]:
print("Number of Missing Values: ", aug_train['target'].isna().sum())
fig = px.pie(aug_train['target'].value_counts(), values='target', 
             names = aug_train['target'].value_counts().index,title = 'target',template='ggplot2')
fig.show()

## Testing Data Initail Glance

In [None]:
# Read aug_test.csv
aug_test = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_test.csv')
# Initial Glance at Data
display(aug_test.info(verbose = True,null_counts=True))
print(aug_test.shape)

## Prepare Data for LightGBM
Extract only the features from aug_train and aug_test and rowbind them. We then will perform label encoding so that the LightGBM can be used.

In [None]:
# Seperate aug_train into target and features 
y = aug_train['target']
X_aug_train = aug_train.drop('target',axis = 'columns')
# save the index for X_aug_train 
X_aug_train_index = X_aug_train.index.to_list()

# row bind aug_train features with aug_test features 
# this makes it easier to apply label encoding onto the entire dataset 
X_aug_total = X_aug_train.append(aug_test,ignore_index = True)
display(X_aug_total.info(verbose = True,null_counts=True))

# save the index for X_aug_test 
X_aug_test_index = np.setdiff1d(X_aug_total.index.to_list() ,X_aug_train_index) 

## MultiColumnLabelEncoder

In [None]:
# MultiColumnLabelEncoder
# Code snipet found on Stack Exchange 
# https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
# from sklearn.preprocessing import LabelEncoder

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                # convert float NaN --> string NaN
                output[col] = output[col].fillna('NaN')
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

# store the catagorical features names as a list      
cat_features = X_aug_total.select_dtypes(['object']).columns.to_list()

# use MultiColumnLabelEncoder to apply LabelEncoding on cat_features 
# uses NaN as a value , no imputation will be used for missing data
X_aug_total_transform = MultiColumnLabelEncoder(columns = cat_features).fit_transform(X_aug_total)

In [None]:
# Before and After LabelEncoding
display(X_aug_total)
display(X_aug_total_transform)

## Split X_aug_total_transform
Split X_aug_total_transform back into X_aug_train_transform and X_aug_test_transform by using the index we saved before.

In [None]:
# Split X_aug_total_transform 
X_aug_train_transform = X_aug_total_transform.iloc[X_aug_train_index, :]
X_aug_test_transform = X_aug_total_transform.iloc[X_aug_test_index, :].reset_index(drop = True) 

In [None]:
#After LabelEncoding for aug_train 
display(X_aug_train_transform)

In [None]:
#After LabelEncoding for aug_test 
display(X_aug_test_transform)

## Train-Test Stratified Split

In [None]:
# drop enrollee_id for aug_train as it is a useless feature 
train_x, valid_x, train_y, valid_y = train_test_split(X_aug_train_transform.drop('enrollee_id',axis = 'columns'), y, test_size=0.2, shuffle=True, stratify=y, random_state=1301)

# Create the LightGBM data containers
# Make sure that cat_features are used
train_data=lgb.Dataset(train_x,label=train_y, categorical_feature = cat_features)
valid_data=lgb.Dataset(valid_x,label=valid_y, categorical_feature = cat_features)

#Select Hyper-Parameters
params = {'objective':'binary',
          'metric' : 'auc',
          'boosting_type' : 'gbdt',
          'colsample_bytree' : 0.9234,
          'num_leaves' : 13,
          'max_depth' : -1,
          'n_estimators' : 200,
          'min_child_samples': 399, 
          'min_child_weight': 0.1,
          'reg_alpha': 2,
          'reg_lambda': 5,
          'subsample': 0.855,
          'verbose' : -1,
          'num_threads' : 4
}

## Run LightGBM on train data

In [None]:
# Train model on selected parameters and number of iterations
lgbm = lgb.train(params,
                 train_data,
                 2500,
                 valid_sets=valid_data,
                 early_stopping_rounds= 30,
                 verbose_eval= 10
                 )

In [None]:
# Overall AUC
y_hat = lgbm.predict(X_aug_train_transform.drop('enrollee_id',axis = 'columns'))
score = roc_auc_score(y, y_hat)
print("Overall AUC: {:.3f}" .format(score))

In [None]:
# ROC Curve for training/validation data
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
y_probas = lgbm.predict(valid_x) 
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(valid_y, y_probas)
roc_auc = auc(fpr, tpr)

plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.4f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for training data')
plt.legend(loc="lower right")
plt.show()

## Feature Importance

In [None]:
# Feature Importance 
lgb.plot_importance(lgbm)

In [None]:
# Feature Importance using shap package 
lgbm.params['objective'] = 'binary'
shap_values = shap.TreeExplainer(lgbm).shap_values(valid_x)
shap.summary_plot(shap_values, valid_x)

From both feature importance, we can see that city contributes a lot if a employee is looking to change jobs or not. The next feature that is also important is company_size. The shap package is prefer when finding feature importance as it preservces consistency and accuracy. You can read more about the shap package in the links provided below

https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d
https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27

### Predictions for aug_test.csv

In [None]:
# Predictions for aug_test.csv
predict = lgbm.predict(X_aug_test_transform.drop('enrollee_id',axis = 'columns')) 
submission = pd.DataFrame({'enrollee_id':X_aug_test_transform['enrollee_id'],'target':predict})
display(submission)

### Submit Predictions

In [None]:
## Submit Predictions
submission.to_csv('submission.csv',index=False)

## Conclusions

LightGBM is a great ML algorithim that handles catagorical features and missing values

This is a great dataset to work on and lots of knowledge can be gain from withing with this dataset

Researching and reading other Kaggle notebooks is essential for becoming a better data scientist

## Challenges

LightGBM has many parameters and other methods that can be utilize to better tune the parameters, this is my first time using LightGBM so mistakes might have occured

Working with catagorical features is difficult, especialy when using One-Hot Encoding, this leads to a messy dataframe and longer computational. This is why I opt for Label Encoding and LightGBM

### Closing Remarks
Please comment and like the notebook if it of use to you! Have a wonderful year!