### HR - Attrition Analytics -  Exploratory Analysis & Predictive Modeling
> Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire <br>
> and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources.  <br>
> So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition  <br>
> risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the  <br>
> impact to the orgranization.

### DATA ATRRIBUTES

satisfaction_level: Employee satisfaction level <br>
last_evaluation: Last evaluation  <br>
number_project: Number of projects  <br>
average_montly_hours: Average monthly hours <br>
time_spend_company: Time spent at the company <br>
Work_accident: Whether they have had a work accident <br>
promotion_last_5years: Whether they have had a promotion in the last 5 years <br>
department: Department <br>
salary: Salary <br>
left: Whether the employee has left <br>

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Load the data
hr_df = pd.read_csv( '/kaggle/input/hr-data-for-analytics/HR_comma_sep.csv' )

In [None]:
hr_df.columns

In [None]:
hr_df.info()

In [None]:
#missings
hr_df.isnull().any().sum()

In [None]:
hr_df.describe().T

In [None]:
hr_df.tail()

The summary statistics for Work_accident, left and promotion_last_5years does not make sense, as they are categorical variables

### PREDICTIVE MODEL: Build a model to predict if an employee will leave the company

In [None]:
# Encoding Categorical Features
numerical_features = ['satisfaction_level', 'last_evaluation', 'number_project',
     'average_montly_hours', 'time_spend_company']

categorical_features = ['Work_accident','promotion_last_5years', 'sales', 'salary']

In [None]:
# An utility function to create dummy variable
def create_dummies( df, colname ):
    col_dummies = pd.get_dummies(df[colname], prefix=colname)
    col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
    df = pd.concat([df, col_dummies], axis=1)
    df.drop( colname, axis = 1, inplace = True )
    return df

In [None]:
for c_feature in categorical_features:
    hr_df = create_dummies( hr_df, c_feature )

In [None]:
hr_df.head()

In [None]:
#Splitting the data

feature_columns = hr_df.columns.difference( ['left'] )
#feature_columns1 = feature_columns

In [None]:
feature_columns

In [None]:
from sklearn.model_selection import train_test_split


train_X, test_X, train_y, test_y = train_test_split( hr_df[feature_columns],
                                                  hr_df['left'],
                                                  test_size = 0.3,
                                                  random_state = 123 )

In [None]:
# Building Models
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit( train_X, train_y)

In [None]:
logreg.predict(train_X)   #by default, it use cut-off as 0.5

In [None]:
list( zip( feature_columns, logreg.coef_[0] ) )

In [None]:
logreg.intercept_

In [None]:
#Predicting the test cases
hr_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': logreg.predict( test_X ) } )

In [None]:

hr_test_pred = hr_test_pred.reset_index()

In [None]:
#Comparing the predictions with actual test data
hr_test_pred.sample( n = 10 )

In [None]:
# Creating a confusion matrix

from sklearn import metrics

cm = metrics.confusion_matrix( hr_test_pred.actual,
                            hr_test_pred.predicted, [1,0] )
cm

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

In [None]:
sn.heatmap(cm, annot=True,  fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')

In [None]:
score = metrics.accuracy_score( hr_test_pred.actual, hr_test_pred.predicted )
round( float(score), 2 )

Overall test accuracy is 78%. But it is not a good measure. The result is very high as there are lots of cases which are no left and the model has predicted most of them as no left. <br>
The objective of the model is to indentify the people who will leave, so that the company can intervene and act.<br>
This might be the case as the default model assumes people with more than 0.5 probability will not leave the company

The model is predicting the probability of him leaving the company is only 0.027, which is very low.

In [None]:
#How good the model is?
predict_proba_df = pd.DataFrame( logreg.predict_proba( test_X ) )
predict_proba_df.head()

In [None]:
hr_test_pred = pd.concat( [hr_test_pred, predict_proba_df], axis = 1 )

In [None]:
hr_test_pred.columns = ['index', 'actual', 'predicted', 'Left_0', 'Left_1']

In [None]:
auc_score = metrics.roc_auc_score( hr_test_pred.actual, hr_test_pred.Left_1  )
round( float( auc_score ), 2 )

In [None]:
sn.distplot( hr_test_pred[hr_test_pred.actual == 1]["Left_1"], color = 'b' )
sn.distplot( hr_test_pred[hr_test_pred.actual == 0]["Left_1"], color = 'g' )

In [None]:
# Finding the optimal cutoff probability
fpr, tpr, thresholds = metrics.roc_curve( hr_test_pred.actual,
                                     hr_test_pred.Left_1,
                                     drop_intermediate = False )

plt.figure(figsize=(6, 4))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

In [None]:
print(thresholds[0:10])
print(fpr[0:10])
print(tpr[0:10])

In [None]:
tpr[np.abs(tpr - 0.7).argmin()]

In [None]:
cutoff_prob = thresholds[(np.abs(tpr - 0.7)).argmin()]

In [None]:
round( float( cutoff_prob ), 2 )

In [None]:
#Predicting with new cut-off probability
hr_test_pred['new_labels'] = hr_test_pred['Left_1'].map( lambda x: 1 if x >= 0.3 else 0 )

In [None]:
metrics.accuracy_score( hr_test_pred.actual, hr_test_pred['new_labels'])

In [None]:
hr_test_pred[0:10]

In [None]:

cm = metrics.confusion_matrix( hr_test_pred.actual,
                          hr_test_pred.new_labels, [1,0] )
sn.heatmap(cm, annot=True,  fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')

### Building Decision Tree Model

In [None]:
import sklearn.tree as dt

In [None]:
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

#### Fine Tuning the parameters

In [None]:
train_X.shape

In [None]:
param_grid = {'max_depth': np.arange(2, 12),
             'max_features': np.arange(10,18)}

In [None]:
train_y.shape

In [None]:
tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv = 10,verbose=1,n_jobs=-1)
tree.fit( train_X, train_y )

In [None]:
tree.best_score_

In [None]:
tree.best_estimator_

In [None]:
tree.best_params_

In [None]:
train_pred = tree.predict(train_X)

In [None]:
print(metrics.classification_report(train_y, train_pred))

In [None]:
test_pred = tree.predict(test_X)

In [None]:
print(metrics.classification_report(test_y, test_pred))

### Building Final Decision Tree Model

In [None]:
clf_tree = DecisionTreeClassifier( max_depth = 9, max_features=17)
clf_tree.fit( train_X, train_y )

## Feature Importance

In [None]:
train_X.columns

In [None]:
clf_tree.feature_importances_

In [None]:
list(zip(train_X.columns,clf_tree.feature_importances_ ))

In [None]:
tree_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': clf_tree.predict( test_X ) } )

In [None]:
tree_test_pred.sample( n = 10 )

In [None]:
metrics.accuracy_score( tree_test_pred.actual, tree_test_pred.predicted )

In [None]:
tree_cm = metrics.confusion_matrix( tree_test_pred.predicted,
                                 tree_test_pred.actual,
                                 [1,0] )
sn.heatmap(tree_cm, annot=True,
         fmt='.2f',
         xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )

plt.ylabel('True label')
plt.xlabel('Predicted label')

In [None]:
metrics.roc_auc_score( tree_test_pred.actual, tree_test_pred.predicted )

### Generate Rules from Decision Trees

#### To create a decision tree visualization graph.
- Install GraphViz (As per the OS and version you are using)
- pip install pydotplus
- Add the path to environmental variables
- Note: The notebook needs a restart.

In [None]:
import os     
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

In [None]:
#!pip install --upgrade pip
#!pip install pydotplus

In [None]:
# Exporting the tree output in the form opendocument
#export_graphviz( clf_tree,
#              out_file = "hr_tree.odt",
#              feature_names = train_X.columns )

In [None]:
# Converting open document file to jpg imanage

#import pydotplus as pdot

#chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'hr_tree.odt' )

In [None]:
#chd_tree_graph.write_jpg( 'hr_tree.jpg' )

In [None]:
# Viewing the image in the notebook (display the image)
#from IPython.display import Image
#Image(filename='hr_tree.jpg')

---
# Model $Ensembles$


> Ensemble methods combine multiple classifiers (using _model averaging_ or _voting_) which may differ in algorithms, input features, or input samples. Statistical analyses showed that ensemble methods yield better classification performances and are also less prone to overfitting. Different methods, e.g., bagging or boosting, are used to construct the final classification decision based on weighted votes.

## What is ensembling?

**Ensemble learning (or "ensembling")** is the process of combining several predictive models in order to produce a combined model that is more accurate than any individual model.

- **Regression:** take the average of the predictions
- **Classification:** take a vote and use the most common prediction, or take the average of the predicted probabilities

For ensembling to work well, the models must have the following characteristics:

- **Accurate:** they outperform the null model
- **Independent:** their predictions are generated using different processes

**The big idea:** If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.

There are two basic **methods for ensembling:**

- Manually ensemble your individual models
- Use a model that ensembles for you

---
Why are we learning about ensembling?

- Very popular method for improving the predictive performance of machine learning models

- Provides a foundation for understanding more sophisticated models

# Bagging

The primary weakness of **decision trees** is that they don't tend to have the best predictive accuracy. This is partially due to **high variance**, meaning that different splits in the training data can lead to very different trees.

**Bagging** is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees. Bagging is short for **bootstrap aggregation**, meaning the aggregation of bootstrap samples.

What is a **bootstrap sample**? A random sample with replacement:

---
**How does bagging work (for decision trees)?**

1. Grow B trees using B bootstrap samples from the training data.
2. Train each tree on its bootstrap sample and make predictions.
3. Combine the predictions:
    - Average the predictions for **regression trees**
    - Take a vote for **classification trees**

Notes:

- **Each bootstrap sample** should be the same size as the original training set.
- **B** should be a large enough value that the error seems to have "stabilized".
- The trees are **grown deep** so that they have low bias/high variance.

Bagging increases predictive accuracy by **reducing the variance**, similar to how cross-validation reduces the variance associated with train/test split (for estimating out-of-sample error) by splitting many times an averaging the results.

In [None]:
import sklearn.ensemble as en

In [None]:
dir(en)

### Bagged decision trees (with B=10)

In [None]:
from sklearn.ensemble import BaggingClassifier

In [None]:
bagclm = BaggingClassifier(oob_score=True, n_estimators=100, verbose=0, n_jobs=-1)
bagclm.fit(train_X, train_y)

In [None]:
bagclm.predict(train_X)

In [None]:
bagclm.oob_score_

In [None]:
y_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': bagclm.predict( test_X) } )

In [None]:
print(metrics.accuracy_score( y_pred.actual, y_pred.predicted ))
print(metrics.roc_auc_score( y_pred.actual, y_pred.predicted ))

In [None]:
tree_bg = metrics.confusion_matrix( y_pred.predicted,
                                 y_pred.actual,
                                 [1,0] )
sn.heatmap(tree_bg, annot=True,
         fmt='.2f',
         xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )

plt.ylabel('True label')
plt.xlabel('Predicted label')

----
## Estimating out-of-sample error

For bagged models, out-of-sample error can be estimated without using **train/test split** or **cross-validation**!

On average, each bagged tree uses about **two-thirds** of the observations. For each tree, the **remaining observations** are called "out-of-bag" observations.

How to calculate **"out-of-bag error":**

1. For every observation in the training data, predict its response value using **only** the trees in which that observation was out-of-bag. Average those predictions (for regression) or take a vote (for classification).
2. Compare all predictions to the actual response values in order to compute the out-of-bag error.

When B is sufficiently large, the **out-of-bag error** is an accurate estimate of **out-of-sample error**.

In [None]:
pargrid_bagging = {'n_estimators': [20,50,100,200,250,300,350,400]}

gscv_bagging = GridSearchCV(estimator=BaggingClassifier(), 
                        param_grid=pargrid_bagging, 
                        cv=5,
                        verbose=1, n_jobs=-1)

In [None]:
gscv_results = gscv_bagging.fit(train_X, train_y)

In [None]:
gscv_results.best_params_

In [None]:
gscv_results.best_score_

In [None]:
y_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': gscv_results.predict( test_X) } )

In [None]:

print(metrics.accuracy_score( y_pred.actual, gscv_results.predict( test_X)))
print(metrics.roc_auc_score( y_pred.actual, gscv_results.predict( test_X)))

In [None]:
#gscv_results.feature_importances_

## Estimating feature importance

Bagging increases **predictive accuracy**, but decreases **model interpretability** because it's no longer possible to visualize the tree to understand the importance of each feature.

However, we can still obtain an overall summary of **feature importance** from bagged models:

- **Bagged regression trees:** calculate the total amount that **MSE** is decreased due to splits over a given feature, averaged over all trees
- **Bagged classification trees:** calculate the total amount that **Gini index** is decreased due to splits over a given feature, averaged over all trees

# BUILDING RANDOM FOREST MODEL

Random Forests is a **slight variation of bagged trees** that has even better performance:

- Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
- However, when building each tree, each time a split is considered, a **random sample of m features** is chosen as split candidates from the **full set of p features**. The split is only allowed to use **one of those m features**.
    - A new random sample of features is chosen for **every single tree at every single split**.
    - For **classification**, m is typically chosen to be the square root of p.
    - For **regression**, m is typically chosen to be somewhere between p/3 and p.

What's the point?

- Suppose there is **one very strong feature** in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are **highly correlated**.
- Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
- By randomly leaving out candidate features from each split, **Random Forests "decorrelates" the trees**, such that the averaging process can reduce the variance of the resulting model.

### Tuning n_estimators

One important tuning parameter is **n_estimators**, which is the number of trees that should be grown. It should be a large enough value that the error seems to have "stabilized".

### Tuning max_features

The other important tuning parameter is **max_features**, which is the number of features that should be considered at each split.

## Comparing Random Forests with decision trees

**Advantages of Random Forests:**

- Performance is competitive with the best supervised learning methods
- Provides a more reliable estimate of feature importance
- Allows you to estimate out-of-sample error without using train/test split or cross-validation

**Disadvantages of Random Forests:**

- Less interpretable
- Slower to train
- Slower to predict

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
pargrid_rf = {'n_estimators': [50, 60, 70, 80, 90, 100],
                  'max_features': [5,6,7,8,9,10,11,12]}

#from sklearn.grid_search import GridSearchCV
gscv_rf = GridSearchCV(estimator=RandomForestClassifier(), 
                        param_grid=pargrid_rf, 
                        cv=10,
                        verbose=True, n_jobs=-1)

gscv_results = gscv_rf.fit(train_X, train_y)

In [None]:
gscv_results.best_params_

In [None]:
gscv_rf.best_score_

In [None]:
radm_clf = RandomForestClassifier(oob_score=True,n_estimators=80, max_features=7, n_jobs=-1)
radm_clf.fit( train_X, train_y )

In [None]:
radm_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': radm_clf.predict( test_X ) } )

In [None]:
print(metrics.accuracy_score( radm_test_pred.actual, radm_test_pred.predicted ))
print(metrics.roc_auc_score( radm_test_pred.actual, radm_test_pred.predicted ))

In [None]:
tree_cm = metrics.confusion_matrix( radm_test_pred.predicted,
                                 radm_test_pred.actual,
                                 [1,0] )
sn.heatmap(tree_cm, annot=True,
         fmt='.2f',
         xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )

plt.ylabel('True label')
plt.xlabel('Predicted label')

### Feature importance from the Random Forest Model

In [None]:
print(radm_clf.feature_importances_)
print(np.argsort(radm_clf.feature_importances_))

In [None]:
indices = np.argsort(radm_clf.feature_importances_)[::-1]

In [None]:
indices = np.argsort(radm_clf.feature_importances_)[::-1]
feature_rank = pd.DataFrame( columns = ['rank', 'feature', 'importance'] )
for f in range(train_X.shape[1]):
  feature_rank.loc[f] = [f+1,
                         train_X.columns[indices[f]],
                         radm_clf.feature_importances_[indices[f]]]
sn.barplot( y = 'feature', x = 'importance', data = feature_rank )

<b> Note: </b>
As per the model, the most important features which influence whether to leave the company,in descending order, are

- satisfaction_level
- number_project
- time_spend_company
- last_evaluation
- average_montly_hours
- work_accident

### Boosting

#### Ada Boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
pargrid_ada = {'n_estimators': [100, 200,250,300,350,400],
               'learning_rate': [10 ** x for x in range(-1, 3)]}

In [None]:
from sklearn.model_selection import GridSearchCV
gscv_ada = GridSearchCV(estimator=AdaBoostClassifier(), 
                        param_grid=pargrid_ada, 
                        cv=5,
                        verbose=1, n_jobs=-1)

In [None]:
gscv_ada.fit(train_X, train_y)

In [None]:
gscv_ada.best_params_

In [None]:
gscv_ada.best_score_

In [None]:
clf_ada = gscv_ada.best_estimator_

In [None]:
ad=clf_ada.fit(train_X, train_y )

In [None]:
print(metrics.accuracy_score(test_y,ad.predict(test_X)))
print(metrics.roc_auc_score(test_y,ad.predict(test_X)))

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
print(pd.Series(cross_val_score(clf_ada, test_X, test_y, cv=10)))

print(pd.Series(cross_val_score(clf_ada, test_X, test_y, cv=10)).describe()[['min', 'mean', 'max', 'std']])

#### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
pargrid_gbm = {'n_estimators': [350,400,450,500],
               'learning_rate': [10 ** x for x in range(-3, 1)],
                'max_features': [5,6,7,8,9,10]}

In [None]:
from sklearn.model_selection import GridSearchCV
gscv_gbm = GridSearchCV(estimator=GradientBoostingClassifier(), 
                        param_grid=pargrid_gbm, 
                        cv=5,
                        verbose=True, n_jobs=-1)

In [None]:
gscv_gbm.fit(train_X, train_y)

In [None]:
gscv_gbm.best_params_

In [None]:
gbm = gscv_gbm.best_estimator_

In [None]:
gscv_gbm.best_score_

In [None]:
gbm.fit(train_X, train_y )

In [None]:
print(metrics.accuracy_score(test_y,gbm.predict(test_X)))
print(metrics.roc_auc_score(test_y,gbm.predict(test_X)))

In [None]:
print(pd.Series(cross_val_score(gbm, test_X, test_y, cv=10)))
print(pd.Series(cross_val_score(gbm, test_X, test_y, cv=10)).describe()[['min', 'mean', 'max']])

#### Xtreme Gradient Boosting

In [None]:
from xgboost import XGBClassifier

In [None]:
pargrid_xgbm = {'n_estimators': [200, 250, 300, 400, 500],
               'learning_rate': [10 ** x for x in range(-3, 1)],
                'max_features': [5,6,7,8,9,10]}

In [None]:
#from sklearn.model_selection import GridSearchCV
gscv_xgbm = GridSearchCV(estimator=XGBClassifier(), 
                        param_grid=pargrid_xgbm, 
                        cv=5,
                        verbose=True, n_jobs=-1)

In [None]:
gscv_xgbm.fit(train_X, train_y)

In [None]:
gscv_xgbm.best_params_

In [None]:
xgbm = gscv_xgbm.best_estimator_

In [None]:
gscv_gbm.best_score_

In [None]:
xgbm.fit(train_X, train_y)

In [None]:
print(metrics.accuracy_score(test_y,xgbm.predict(test_X)))
print(metrics.roc_auc_score(test_y,xgbm.predict(test_X)))

In [None]:
print(pd.Series(cross_val_score(xgbm, test_X, test_y, cv=10)))

print(pd.Series(cross_val_score(xgbm, test_X, test_y, cv=10)).describe()[['min', 'mean', 'max']])

### Hetrogenous induction Algo - Voting Classifier

In [None]:
from sklearn.ensemble import VotingClassifier

In [None]:
voting_clf = VotingClassifier(estimators = [('logreg',logreg), ('radm_clf',radm_clf), ('xgbm',xgbm)], voting = 'hard')
voting_clf.fit(train_X, train_y)

In [None]:
print(metrics.accuracy_score(test_y,voting_clf.predict(test_X)))
print(metrics.roc_auc_score(test_y,voting_clf.predict(test_X)))