# Preamble

This Kaggle notebook is a follow-up on the exploratory analysis conducted in a separate Power BI dashboard. In the dashboard, it was realized that no standalone single variable is indicative of whether a candidate will quit after training. **Based on this, if the company wants to predict whether a person will quit after completing their training courses, a more sophisticated approach (compared to just looking at the values of certain fields) is needed. This motivates the development and validation of a machine learning model that uses only the information from the candidate that would be known prior to them being hired.** If such a model is successful, the company could use this to better predict whether a candidate will quit after completing their training courses.

*Since it was already completed in the Power BI dashboard, this notebook doesn't include any exploratory data visualization. Instead, the model jumps right into necessary data transformations and machine learning model devleopment.*

## Illustrated concepts
- Ordinal encoding
- One-hot encoding
- Hyperparameter grid search for a machine learning model
- Cross-validation 

# Imports, Data Encoding
Data cleaning steps include omitting rows that contain nan values and encoding categorical data. Ordinal encoding is used for "gender", "education level", "enrolled university", and "company_size". Categories with a mix of numerical and text values (i.e. 20 vs. ">20") are changed to be strictly numerical with the largest value being the value included with a ">" or "<" symbol (e.g. ">20" becomes 20). For categorical variables "major discipline" and "company type", one hot encoding is used. All other variables are dropped because of either high cardinality ("city"), they use information that would be available only after hiring ("training hours"), they are the label ("target"), or they contain person-specific information that won't help with predictions ("enrollee id"). 

In [None]:
# imports and data loading. remove instances where there are nan values
import numpy as np 
import pandas as pd 
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_validate

hr_temp= pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
hr= hr_temp.dropna()
print(np.shape(hr))
hr.head()

In [None]:
# for categorical fields with only 2 possible values, convert those to be numerical (i.e. value of 0/1)
hr_num= hr.copy()
hr_num.loc[hr['relevent_experience'] == 'No relevent experience', 'relevent_experience'] = 0
hr_num.loc[hr['relevent_experience'] == 'Has relevent experience', 'relevent_experience'] = 1
hr_num['relevent_experience']= hr_num['relevent_experience'].astype(int)

# for categorical fields with a mix of numerical and text values, convert the text values to numbers
hr_num.loc[hr['experience'] == '>20', 'experience']= 20
hr_num.loc[hr['experience'] == '<1', 'experience']= 1
hr_num['experience']= hr_num['experience'].astype(int)

hr_num.loc[hr['last_new_job'] == '>4', 'last_new_job']= 4
hr_num.loc[hr['last_new_job']== 'never', 'last_new_job']= 100
hr_num['last_new_job']= hr_num['last_new_job'].astype(int)

# use ordinal encoding for gender, education_level, enrolled_university, company_size
def ord_encode(df, col_name, cats):
    df[col_name]= OrdinalEncoder(categories= [cats], dtype= int).fit_transform(df[col_name].values.reshape(-1,1))
    return df
hr_num= ord_encode(hr_num, 'gender', ['Male', 'Female', 'Other'])
hr_num= ord_encode(hr_num, 'education_level', ['Graduate', 'Masters', 'Phd'])
hr_num= ord_encode(hr_num, 'enrolled_university', ['no_enrollment', 'Part time course', 'Full time course'])
hr_num= ord_encode(hr_num, 'company_size', ['<10', '10/49', '50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+'])

# use one-hot-encoding for remaing categorical variables: major_discipline, and company_type
# first specify which columns are the features and which column is the labels
X= hr_num.drop(['city','enrollee_id', 'training_hours', 'target'], axis= 1)
Y= hr_num['target'].astype(int)

# define function that uses the names of the categorical columns to be one-hot-encoded and the original feature matrix
# function returns a strictly numeric feature matrix (i.e. categorical columns are one-hot-coded and the original categorical columns are removed)
def ohe_features(cat_cols, X):
    ohe_transform= ColumnTransformer(transformers= [('cat', OneHotEncoder(), cat_cols)], sparse_threshold= 0)
    ohe_df= pd.DataFrame(ohe_transform.fit_transform(X), columns= ohe_transform.get_feature_names())
    X[ohe_transform.get_feature_names()]= ohe_transform.fit_transform(X)
    X= X.drop(cat_cols, axis= 1)
    return X 

cat_cols= ['major_discipline', 'company_type']
X= ohe_features(cat_cols, X)

# Hyperparameter Grid Search

Every machine learning model has settings ("hyperparameters") that are not learned by the dataset, but are instead set by the person implementing the machine learning model. To know which values to assign these hyperparameters, we run something called a hyperparameter grid search, where we try a bunch of different combinations of hyperparameter values and see which gives us the best results. The hyperparameters differ depending on the model, so here we will use a decision tree and try out different values for the following hyperparameter settings: max depth of tree, class weight, and min number of samples needed to split an internal node.

*The hyperparameter grid search should only be run on the training set! For this particular dataset, there are two separate files for the training and test sets, so we don't need to split the data ourselves.*

In [None]:
# define decision tree model, the different hyperparameter settings to test and the grid search
dt= DecisionTreeClassifier()
dt_grid= {'max_depth': [None, 2, 5, 10, 15], 'class_weight': [None, 'balanced'], 'min_samples_split': [2, 10, 100, 500], 'random_state': [24]}
search= GridSearchCV(dt, dt_grid, scoring= 'roc_auc', n_jobs= -1, cv= 10)

# run the grid search and get the best hyperparameter settings (the settings that give the best ROC-AUC score)
search.fit(X, Y)
print('Best hyperparameter settings:', search.best_params_)
print('Corresponding ROC-AUC score (avg of 10-fold CV):', search.best_score_)

An ROC-AUC score of 0.76 is not very good - 1 is a perfect score and means that the model has no trouble distinguishing between the two classes. We will try one more model to see if we can improve our cross validation score before we move on to evaluating the test set. XGBoost is an ensemble method that combines the opinions of multiple weak learners (e.g. weak decision trees) to come up with the final prediction for an observation. Here we won't run a grid search to determine the best hyperparameter settings, we will just use arbitary settings and run our cross-validation.

In [None]:
xgb= XGBClassifier(objective= 'binary:logistic', random_state= 24, n_estimators= 100, max_depth= 3, learning_rate= 0.1, alpha= 10)
xgb_cv= cross_validate(xgb, X, Y, n_jobs= -1, scoring= 'roc_auc', cv= 10)
print('XGBoost model ROC-AUC score (avg of 10-fold CV):', np.mean(xgb_cv['test_score']))

We can see that we achieve slightly better performance with the XGBoost model compared to the decision tree model. From here, we can continue to try out different models until we achieve a ROC-AUC score we are happy with and then use our final model to evaluate predictions on our test set.

# Conclusions
Recognizing that the standalone variables in the HR Analytics dataset are not good indicators of whether a candidate will quit, this notebook aimed to develop and train 2 machine learning models with the goal of predicting whether a candidate will quit. Although the performance of the models were not outstanding, this notebook provides examples on how to implement ordinal encoding, one-hot encoding, a hyperparameter grid search, and cross-validation of a model. For more examples of my work, including feature selection techniques, check out the links below:

https://www.kaggle.com/valbauman/random-forests-for-thermal-comfort

https://www.kaggle.com/valbauman/feature-selection-for-heart-disease-prediction

(don't get me wrong, I love implementing ML models as much as the next person, but data wrangling and cleaning before implementing a ML model is just as important!)