# HR Analytics Job Prediction

This notebook is a workflow for various Python-based machine learning model for predicing if a person leave the company or will continue to work.

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation

# 1. Problem Definition

Given the set of parameters, can we predict if a person leave the company or will continue to work?

# 2. Data

https://www.kaggle.com/mfaisalqureshi/hr-analytics-and-job-prediction

## Context

Hr Data Analytics
This dataset contains information about employees who worked in a company.

## Content

This dataset contains columns: Satisfactory Level, Number of Project, Average Monthly Hours, Time Spend Company, Promotion Last 5
Years, Department, Salary

## Acknowledgements

You can download, copy and share this dataset for analysis and Predictions employees Behaviour.

## Inspiration

Answer the following questions would be worthy
1. Do Exploratory Data analysis to figure out which variables have a direct and clear impact on employee retention (i.e. whether they leave the company or continue to work)
2. Plot bar charts showing the impact of employee salaries on retention
3. Plot bar charts showing a correlation between department and employee retention
4. Now build a logistic regression model using variables that were narrowed down in step 1
5. Measure the accuracy of the model

# 3. Evalutation

Creating a Classification Model mainly Logisitic Regression model (we will also try other model) and to score it by classification metrics to check it's performance

# 4. Features

## Inputs / Features:

* Satisfactory Levels
* Number Project
* Average Monthly Hour
* Time Spent
* Promotion last 5 years
* Salary
* Satisfactory Levels
* Number Project
* Average Monthly Hour
* Time Spend

# Output / label:
* left


## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Google Drive
# df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML Self-Projects/HR Analytics Job Prediction/HR_comma_sep.csv')
# Local
# df = pd.read_csv('HR_comma_sep.csv')

# Kaggle
df = pd.read_csv('/kaggle/input/hr-analytics-and-job-prediction/HR_comma_sep.csv')
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

## Data Exploration (Exploratory Data Analysis (EDA) )

In [None]:
df

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of total person left')
sns.countplot(data=df, x='left');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of salary vs left')
sns.countplot(data=df, x='salary', hue='left');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of Department vs left')
sns.countplot(data=df, x='Department', hue='left');

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(data=df.corr(),annot=True);

In [None]:
df.corr()['left'].sort_values()[:-1]

In [None]:
plt.figure(figsize=(20,20))
sns.pairplot(data=df,hue='left')

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of satisfaction level vs time spend company vs left')
sns.scatterplot(data=df, x='satisfaction_level',y='time_spend_company', hue='left',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of Work accident vs left')
sns.countplot(data=df, x='Work_accident', hue='left');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of number project vs time spend company vs left')
sns.countplot(data=df, x='number_project', hue='left');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of promotion in last 5 years vs time spend company vs left')
sns.countplot(data=df, x='promotion_last_5years', hue='left');

# 5. Modelling

In [None]:
df = pd.get_dummies(df, drop_first=True)

In [None]:
X = df.drop('left', axis=1)
y = df['left']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model Imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier

## Baseline Modelling

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
        
    return model_scores

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(),
          'XGBRFClassifier': XGBRFClassifier()}

In [None]:
baseline_model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores.sort_values('Score')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores.sort_values('Score').T)
plt.title('Baseline Model Precision Score')
plt.xticks(rotation=90);

We will use the top 3 model to turn the hyperparameters and as in the task, we will also include the logisic Regression to compare the scores

* LogisticRegression 	0.786000
* GradientBoostingClassifier 	0.972444
* DecisionTreeClassifier 	0.973556
* RandomForestClassifier 	0.986889

## HyperTurning via Random Seach CV

As the labels are in-balance in the dataset we will rate the scoring using the F1 score.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import f1_score

In [None]:
def randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_rs_scores = {}
    model_rs_best_param = {}
    
    for name, model in models.items():
        rs_model = RandomizedSearchCV(model,
                                     param_distributions=params[name],
                                        scoring='f1',
                                      cv=5,
                                     n_iter=20,n_jobs=-1,
                                     verbose=2)        
        rs_model.fit(X_train,y_train)
        y_pred = rs_model.predict(X_test)
        model_rs_scores[name] = f1_score(y_test,y_pred)
        model_rs_best_param[name] = rs_model.best_params_
        
    return model_rs_scores, model_rs_best_param

### RS model 1

In [None]:
models = {'LogisticRegression' : LogisticRegression(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          }

params = {'LogisticRegression': {'C': [0.001,0.01,0.1,1.0,10,100],
                                 'penalty': ['none', 'l1', 'l2', 'elasticnet'],
                                 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']},
          'GradientBoostingClassifier' : {'loss': ['deviance', 'exponential'],
                                          'learning_rate': [0.001,0.01,0.1,1.0],
                                          'n_estimators': [20,50,100,200,400],
                                          'criterion': ['friedman_mse', 'mse'],
                                          'max_depth' : [2,3,6,10,20],
                                          'ccp_alpha' : [0.0,0.001,0.01,0.1,1]
                                          },
          'DecisionTreeClassifier' : {'criterion': ['gini', 'entropy'],
                                      'max_depth': [None, 3,5,10,20,50],
                                      'max_leaf_nodes': [None, 3,5,10,20,50],
                                      'ccp_alpha' : [0.0,0.001,0.01,0.1,1]
                                      },
          'RandomForestClassifier': {'n_estimators': [20,50,100,200,400],
                                     'criterion': ['gini', 'entropy'],
                                     'max_depth': [None, 2,10,50,100],
                                     'bootstrap': [True, False],
                                     'oob_score': [True, False],
                                     'ccp_alpha': [0.1,0.01,0.001],
                                     },
          }

In [None]:
model_rs_scores_1, model_rs_best_param_1 =randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores_1

In [None]:
model_rs_best_param_1

### RS model 2

In [None]:
params = {'LogisticRegression': {'C': [0.1,0.2,0.4,0.8,1],
                                 'penalty': ['none'],
                                 'solver': ['saga']},
          'GradientBoostingClassifier' : {'loss': ['exponential'],
                                          'learning_rate': [0.01,0.02,0.05],
                                          'n_estimators': [150,200,250,300],
                                          'criterion': ['friedman_mse'],
                                          'max_depth' : [15,20,30,50],
                                          'ccp_alpha' : [0.0]
                                          },
          'DecisionTreeClassifier' : {'criterion': ['gini', 'entropy'],
                                      'max_depth': [40,50,60,70],
                                      'max_leaf_nodes': [30,50,60,100],
                                      'ccp_alpha' : [0.0]
                                      },
          'RandomForestClassifier': {'n_estimators': [10,15,20,25,30],
                                     'criterion': ['entropy'],
                                     'max_depth': [5,10,20,25],
                                     'bootstrap': [True],
                                     'oob_score': [True],
                                     'ccp_alpha': [0.0001,0.001,0.005],
                                     },
          }

In [None]:
model_rs_scores_2, model_rs_best_param_2 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores_2

In [None]:
model_rs_best_param_2

### RS Model 3

In [None]:
params = {'LogisticRegression': {'C': [0.01,0.0001,0.1],
                                 'penalty': ['none'],
                                 'solver': ['saga']},
          'GradientBoostingClassifier' : {'loss': ['exponential'],
                                          'learning_rate': [0.02,0.03,0.04],
                                          'n_estimators': [275,300,400],
                                          'criterion': ['friedman_mse'],
                                          'max_depth' : [11,12,13,14,15],
                                          'ccp_alpha' : [0.0]
                                          },
          'DecisionTreeClassifier' : {'criterion': ['gini'],
                                      'max_depth': [65,70,75,80],
                                      'max_leaf_nodes': [25,30,35,40],
                                      'ccp_alpha' : [0.0]
                                      },
          'RandomForestClassifier': {'n_estimators': [18,20,22,34],
                                     'criterion': ['entropy'],
                                     'max_depth': [18,20,22],
                                     'bootstrap': [True],
                                     'oob_score': [True],
                                     'ccp_alpha': [0.0001,0.001,0.005],
                                     },
          }

In [None]:
model_rs_scores_3, model_rs_best_param_3 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores_3

In [None]:
model_rs_best_param_3

Since GradientBoostingClassifier is performing the best we will use that to perfrom a Grid Search to tune it's hyperparams.

## Grid Search CV

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
def gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_gs_scores = {}
    model_gs_best_param = {}
    
    for name, model in models.items():
        gs_model = GridSearchCV(model,
                                param_grid=params[name],
                                scoring='f1',
                                n_jobs=-1,
                                cv=5,
                                verbose=2)
        
        gs_model.fit(X_train,y_train)
        y_pred = gs_model.predict(X_test)
        model_gs_scores[name] = f1_score(y_test,y_pred)
        model_gs_best_param[name] = gs_model.best_params_

    model_gs_scores = pd.DataFrame(model_gs_scores, index=['F1'])
    model_gs_scores = model_gs_scores.transpose().sort_values('F1')
        
    return model_gs_scores, model_gs_best_param

### GS Model 1

In [None]:
models = {'GradientBoostingClassifier': GradientBoostingClassifier(),
          }

params = {'GradientBoostingClassifier' : {'loss': ['exponential'],
                                          'learning_rate': [0.04,0.05,0.06,0.07],
                                          'n_estimators': [350,400,450,500],
                                          'criterion': ['friedman_mse'],
                                          'max_depth' : [12],
                                          'ccp_alpha' : [0.0]
                                          },
          }

In [None]:
model_gs_scores_1, model_gs_best_param_1 = gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_1

In [None]:
model_gs_best_param_1

### GS Model 2

In [None]:
params = {'GradientBoostingClassifier' : {'loss': ['exponential'],
                                          'learning_rate': [0.07,0.08,0.09],
                                          'n_estimators': [500, 550, 600],
                                          'criterion': ['friedman_mse'],
                                          'max_depth' : [12],
                                          'ccp_alpha' : [0.0]
                                          },
          }

In [None]:
model_gs_scores_2, model_gs_best_param_2 = gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_2

In [None]:
model_gs_best_param_2

# 6. Model Evalution 

In [None]:
model = GradientBoostingClassifier(ccp_alpha=0.0, 
                                   criterion='friedman_mse', 
                                   learning_rate=0.08, 
                                   loss= 'exponential',
                                   max_depth=12,
                                   n_estimators=500)
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix, plot_roc_curve 

## Classification Report 

In [None]:
print(classification_report(y_test,y_preds))

## Confusion Matrix

In [None]:
plot_confusion_matrix(model, X_test, y_test)

## ROC Curve

In [None]:
plot_roc_curve(model, X_test, y_test)

## Calculate evalution metrices using cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
def get_cv_score(model, X, y, cv=5):
    
    cv_accuracy = cross_val_score(model,X,y,cv=5,
                         scoring='accuracy')
    print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
    print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')
    
    cv_precision = cross_val_score(model,X,y,cv=5,
                         scoring='precision')
    print(f'Cross Validaion precision Scores: {cv_precision}')
    print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')
    
    cv_recall = cross_val_score(model,X,y,cv=5,
                         scoring='recall')
    print(f'Cross Validaion recall Scores: {cv_recall}')
    print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')
    
    cv_f1 = cross_val_score(model,X,y,cv=5,
                         scoring='f1')
    print(f'Cross Validaion f1 Scores: {cv_f1}')
    print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}')   
    
    cv_merics = pd.DataFrame({'Accuracy': cv_accuracy.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
    
    return cv_merics

In [None]:
cv_merics = get_cv_score(model, X, y, cv=5)

In [None]:
cv_merics

In [None]:
plt.figure(figsize=(20,10))
plt.title('CV Scores')
sns.barplot(data=cv_merics);

## Feature Importances

In [None]:
feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns)

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
plt.title('Feature Importances')
sns.barplot(data= feat_importances.sort_values(0).T);

With the GradientBoostingClassifier model, we have managed to get scores of:
* Accuracy: 0.990466
* Precision: 0.987063
* Recall: 0.97451 
* f1: 0.97451