# Project description

## Dataset
The dataset has 15000 records and 10 variables.

## Target
The dataset has a variable churn. In this excercise, we define 1 as employees who left the company and 0 as employees who stayed.
The objective is to create a model that classify left/stayed employees.

## Process
- Analyzing and transforming categorical variables
- Fitting a decision tree
- Understanding overfitting
- Analyzing acuraccy metrics: precision, recall, AUC/ROC
- Hyperparameter tunning
- Importance features

In [46]:
#!pip install matplotlib

In [63]:
# import libraries
import numpy as np
import pandas as pd
# split datatset for validation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# metrics
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
# models
from sklearn.tree import DecisionTreeClassifier
# graphical tools
import matplotlib.pyplot as plt
from sklearn.tree import export_graphviz

In [14]:
# read the data file
data = pd.read_csv("./turnover.csv")
# check data info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   satisfaction          14999 non-null  float64
 1   evaluation            14999 non-null  float64
 2   number_of_projects    14999 non-null  int64  
 3   average_montly_hours  14999 non-null  int64  
 4   time_spend_company    14999 non-null  int64  
 5   work_accident         14999 non-null  int64  
 6   churn                 14999 non-null  int64  
 7   promotion             14999 non-null  int64  
 8   department            14999 non-null  object 
 9   salary                14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


There are a few variables (objects) in our dataset: salary and department.

## Analyzing the dataset

Let's check these two and transform to numerical to make them more suitable for modeling.

In [6]:
# Print the unique values of the "department" column
print(data.department.unique())

# Print the unique values of the "salary" column
print(data.salary.unique())

['sales' 'accounting' 'hr' 'technical' 'support' 'management' 'IT'
 'product_mng' 'marketing' 'RandD']
['low' 'medium' 'high']


**Encoding categories**

The model will need some help to understand that it's dealing with categories. We will encode categories of the salary variable, which you know is ordinal based on the values observed before.
This means that each level will be encoded with a number according of ordering: 0 to low, 1 to medium, and 2 to high.

In [15]:
## SALARY variable
# Change the type of the "salary" column to categorical
data.salary = data.salary.astype('category')

# Provide the correct order of categories
data.salary = data.salary.cat.reorder_categories(['low', 'medium', 'high'])

# Encode categories
data.salary = data.salary.cat.codes

In [16]:
## DEPARTMENT variable
# Get dummies and save them inside a new DataFrame
departments = pd.get_dummies(data.department)

# Take a quick look to the first 5 rows of the new DataFrame called departments
departments.head()

Unnamed: 0,IT,RandD,accounting,hr,management,marketing,product_mng,sales,support,technical
0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,1,0,0


In [17]:
# Drop the "accounting" column to avoid "dummy trap"
departments = departments.drop("accounting", axis=1)

# Drop the old column "department" as you don't need it anymore
data = data.drop("department", axis=1)

# Join the new dataframe "departments" to your employee dataset: done
data = data.join(departments)

In [18]:
# Check the new dataset
data.head()

Unnamed: 0,satisfaction,evaluation,number_of_projects,average_montly_hours,time_spend_company,work_accident,churn,promotion,salary,IT,RandD,hr,management,marketing,product_mng,sales,support,technical
0,0.38,0.53,2,157,3,0,1,0,0,0,0,0,0,0,0,1,0,0
1,0.8,0.86,5,262,6,0,1,0,1,0,0,0,0,0,0,1,0,0
2,0.11,0.88,7,272,4,0,1,0,1,0,0,0,0,0,0,1,0,0
3,0.72,0.87,5,223,5,0,1,0,0,0,0,0,0,0,0,1,0,0
4,0.37,0.52,2,159,3,0,1,0,0,0,0,0,0,0,0,1,0,0


**Percentage of employees who churn**

The column churn is providing information about whether an employee has left the company or not:
- if the value of this column is 0, the employee is still with the company
- if the value of this column is 1, then the employee has left the company

Let’s calculate the turnover rate:
- first count the number of times the variable churn has the value 1 and the value 0, respectively
- then divide both counts by the total, and multiply the result by 100 to get the percentage of employees who left and stayed

In [19]:
# Use len() function to get the total number of observations and save it as the number of employees
n_employees = len(data)

# Print the number of employees who left/stayed
print(data.churn.value_counts())

# Print the percentage of employees who left/stayed
print(data.churn.value_counts()/n_employees*100)

0    11428
1     3571
Name: churn, dtype: int64
0    76.191746
1    23.808254
Name: churn, dtype: float64


**Separating Target and Features**

In order to make a prediction (in this case, whether an employee would leave or not), one needs to separate the dataset into two components:

- the dependent variable or target which needs to be predicted
- the independent variables or features that will be used to make a prediction

In [20]:
# Set the target and features

# Choose the dependent variable column (churn) and set it as target
target = data.churn

# Drop column churn and set everything else as features
features = data.drop("churn",axis=1)

In [24]:
# Create the splits both for target and for features
# Set the test sample to be 25% of your observations
target_train, target_test, features_train, features_test = train_test_split(target,features,test_size=0.25,random_state=42)

In [26]:
# Initialize it and call model by specifying the random_state parameter
model = DecisionTreeClassifier(random_state=42)

# Apply a decision tree model to fit features to the target
model.fit(features_train, target_train)

DecisionTreeClassifier(random_state=42)

In [28]:
# Apply a decision tree model to fit features to the target in the training set
model.fit(features_train,target_train)

# Check the accuracy score of the prediction for the training set
print('training accuracy: ', model.score(features_train,target_train)*100)

# Check the accuracy score of the prediction for the test set
print('test accuracy: ', model.score(features_test,target_test)*100)

training accuracy:  100.0
test accuracy:  97.22666666666666


This looks like the model is overfitted and it migh fail with unseen data. This is beacuse the tree has grown in deep and min leaves in each node that there are so many rules to classify target.

In [31]:
# Export the tree to a dot file and preview it in webgraphviz.com
export_graphviz(model,"tree.dot")

### Pruning the tree
Overfitting is a classic problem in analytics, especially for the decision tree algorithm. Once the tree is fully grown, it may provide highly accurate predictions for the training sample, yet fail to be that accurate on the test set. For that reason, the growth of the decision tree is usually controlled by:

- “Pruning” the tree and setting a limit on the maximum depth it can have.
- Limiting the minimum number of observations in one leaf of the tree.

In [32]:
# Initialize the DecisionTreeClassifier while limiting the depth of the tree to 5
model_depth_5 = DecisionTreeClassifier(max_depth=5, random_state=42)

# Fit the model
model_depth_5.fit(features_train,target_train)

# Print the accuracy of the prediction for the training set
print('training accuracy: ', model_depth_5.score(features_train,target_train)*100)

# Print the accuracy of the prediction for the test set
print('test accuracy: ', model_depth_5.score(features_test,target_test)*100)

training accuracy:  97.71535247577563
test accuracy:  97.06666666666666


This looks like a more reasonable and realistic model. The gap between both accuracy is very little.

### Limiting the sample size
Another method to prevent overfitting is to specify the minimum number of observations necessary to grow a leaf (or node), in the Decision Tree.

In [33]:
# Initialize the DecisionTreeClassifier while limiting the sample size in leaves to 100
model_sample_100 = DecisionTreeClassifier(min_samples_leaf=100, random_state=42)

# Fit the model
model_sample_100.fit(features_train,target_train)

# Print the accuracy of the prediction (in percentage points) for the training set
print('training accuracy: ', model_sample_100.score(features_train,target_train)*100)

# Print the accuracy of the prediction (in percentage points) for the test set
print('test accuracy: ', model_sample_100.score(features_test,target_test)*100)

training accuracy:  96.57747355320473
test accuracy:  96.13333333333334


### Calculating accuracy metrics: precision
The Precision score is an important metric used to measure the accuracy of a classification algorithm. It is calculated as the fraction of True Positives over the sum of True Positives and False Positives, or
 
- we define True Positives as the number of employees who actually left, and were classified correctly as leaving
- we define False Positives as the number of employees who actually stayed, but were wrongly classified as leaving
- If there are no False Positives, the precision score is equal to 1. If there are no True Positives, the recall score is equal to 0.

In [37]:
# Predict whether employees will churn using the test set
prediction = model.predict(features_test)

# Calculate precision score by comparing target_test with the prediction
print('Precision: ', precision_score(target_test, prediction))

Precision:  0.9240641711229947


### Calculating accuracy metrics: recall
The Recall score is another important metric used to measure the accuracy of a classification algorithm. It is calculated as the** fraction of True Positives over the sum of True Positives and False Negatives**

If there are no False Negatives, the recall score is equal to 1. If there are no True Positives, the recall score is equal to 0.

In [55]:
# Calculate recall score by comparing target_test with the prediction
print('Recall: ', recall_score(target_test, prediction))

Recall:  0.9632107023411371


### Calculating the ROC/AUC score
While the Recall score is an important metric for measuring the accuracy of a classification algorithm, it puts too much weight on the number of False Negatives. On the other hand, Precision is concentrated on the number of False Positives.

The combination of those two results in the ROC curve allows us to measure both recall and precision. The area under the ROC curve is calculated as the AUC score.

In [48]:
# Calculate ROC/AUC score by comparing target_test with the prediction
print('ROC: ', roc_auc_score(target_test, prediction))

ROC:  0.9691623087590718


### Balancing classes
It can significantly affect prediction results, as shown by the difference between the recall and accuracy scores. To solve the imbalance, equal weights are usually given to each class. Using the class_weight argument in sklearn's DecisionTreeClassifier, one can make the classes become "balanced".

In this dataset, target variable is distributed:
- class 0: 76% stayed
- class 1: 24% left

In [49]:
# Initialize the DecisionTreeClassifier 
model_depth_5_b = DecisionTreeClassifier(max_depth=5,class_weight="balanced",random_state=42)

# Fit the model
model_depth_5_b.fit(features_train,target_train)

# Print the accuracy of the prediction (in percentage points) for the test set
print('Accuracy with balanced classes: ', model_depth_5_b.score(features_test,target_test)*100)

Accuracy with balanced classes:  93.70666666666668


Here we are trading off some accuracy compared to the previous model with 97%. Let's compare models with unbalanced and balance weigths and the effect on precision (False positives).

In [56]:
# Previous precision and AUC
print('Unbalanced Precision: ', precision_score(target_test, prediction))
print('Unbalanced AUC: ', roc_auc_score(target_test, prediction), '\n')

# Initialize the model
model_depth_7_b = DecisionTreeClassifier(max_depth=7, class_weight='balanced', random_state=42)
# Fit it to the training component
model_depth_7_b.fit(features_train,target_train)
# Make prediction using test component
prediction_b = model_depth_7_b.predict(features_test)
# Print the precision score for the balanced model
print('Balanced Precision: ', precision_score(target_test, prediction_b))
# Print the ROC/AUC score for the balanced model
print('Balanced AUC: ', roc_auc_score(target_test, prediction_b))

Unbalanced Precision:  0.9240641711229947
Unbalanced AUC:  0.9691623087590718 

Balanced Precision:  0.9598163030998852
Balanced AUC:  0.959863876199084


### Cross-validation using sklearn
Overfitting the dataset is a common problem in analytics. This happens when a model has learned the data too closely: it has great performances on the dataset it was trained on, but fails to generalize outside of it.

While the train/test split technique ensures that the model does not overfit the training set, hyperparameter tuning may result in overfitting the test component, since it consists in tuning the model to get the best prediction results on the test set. Therefore, it is recommended to validate the model on different testing sets. K-fold cross-validation allows us to achieve this:

- it splits the dataset into a training set and a testing set
- it fits the model, makes predictions and calculates a score (you can specify if you want the accuracy, precision, recall…)
- it repeats the process k times in total
- it outputs the average of the 10 scores

In [61]:
# Use that function to print the cross validation score for 10 folds - unbalanced model
print(cross_val_score(model,features,target,cv=10))

[0.98533333 0.98533333 0.974      0.96533333 0.96       0.97933333
 0.99       0.99333333 1.         1.        ]


### Setting up GridSearch parameters
A hyperparameter is a parameter inside a function. For example, max_depth or min_samples_leaf are hyperparameters of the DecisionTreeClassifier() function. Hyperparameter tuning is the process of testing different values of hyperparameters to find the optimal ones: the one that gives the best predictions according to your objectives. In sklearn, you can use GridSearch to test different combinations of hyperparameters. Even better, you can use GridSearchCV() test different combinations and run cross-validation on them in one function!

In [87]:
# Generate values for maximum depth
depth = [i for i in range(8,21,1)]

# Generate values for minimum sample size
samples = [i for i in range(80,300,20)]

# Create the dictionary with parameters to be checked
parameters = dict(max_depth=depth, min_samples_leaf=samples, class_weight=['balanced'])

# initialize the param_search function using the GridSearchCV function, initial model and parameters above
param_search = GridSearchCV(model, parameters)

# fit the param_search to the training dataset
param_search.fit(features_train, target_train)

# print the best parameters found
print(param_search.best_params_)

{'class_weight': 'balanced', 'max_depth': 8, 'min_samples_leaf': 120}


It looks like the values that give you the best score are a minimum of samples per leaf of 120 and a maximum depth of 8.

In [89]:
model_best = DecisionTreeClassifier(max_depth=8, min_samples_leaf=120, class_weight='balanced', random_state=42)
model_best.fit(features_train, target_train)
prediction_tunned = model_best.predict(features_test)
# metrics
print('Acuraccy CV: ', cross_val_score(model_best,features,target,cv=10).mean())
print('Precision: ', precision_score(target_test, prediction_tunned))
print('Recall: ', recall_score(target_test, prediction_tunned))
print('AUC: ', roc_auc_score(target_test, prediction_tunned))

Acuraccy CV:  0.9483285301311986
Precision:  0.8865096359743041
Recall:  0.9230769230769231
AUC:  0.9429615249804524


### Sorting important features
Among other things, Decision Trees are very popular because of their interpretability. Many models can provide accurate predictions, but Decision Trees can also quantify the effect of the different features on the target. Here, it can tell you which features have the strongest and weakest impacts on the decision to leave the company. In sklearn, you can get this information by using the feature_importances_ attribute.

In [90]:
# Calculate feature importances
feature_importances = model_best.feature_importances_

# Create a list of features: done
feature_list = list(features)

# Save the results inside a DataFrame using feature_list as an index
relative_importances = pd.DataFrame(index=feature_list, data=feature_importances, columns=["importance"])

# Sort values to learn most important features
relative_importances.sort_values(by="importance", ascending=False)

Unnamed: 0,importance
satisfaction,0.503552
time_spend_company,0.3822
evaluation,0.086273
number_of_projects,0.016207
average_montly_hours,0.010423
technical,0.000806
salary,0.000539
promotion,0.0
work_accident,0.0
RandD,0.0


It seems that satisfaction is by far the most impactful feature on the decision to leave the company or not.

In [91]:
# Selecting importance features

# select only features with relative importance higher than 1%
selected_features = relative_importances[relative_importances.importance>0.01]

# create a list from those features: done
selected_list = selected_features.index

# transform both features_train and features_test components to include only selected features
features_train_selected = features_train[selected_list]
features_test_selected = features_test[selected_list]

# Print results
print(list(features_train_selected))
print(list(features_test_selected))

['satisfaction', 'evaluation', 'number_of_projects', 'average_montly_hours', 'time_spend_company']
['satisfaction', 'evaluation', 'number_of_projects', 'average_montly_hours', 'time_spend_company']


Great! As you can see, only 5 features have been retained out of the 17 original ones: ['satisfaction', 'evaluation', 'number_of_projects', 'average_montly_hours', 'time_spend_company']. 

You’ve made sure to keep only these in your training and testing sets.

### Develop and test the best model with less variables

In [99]:
# Initialize the best model using parameters provided in description
model_best = DecisionTreeClassifier(max_depth=8, min_samples_leaf=120, class_weight='balanced', random_state=42)

# Fit the model using only selected features from training set: done
model_best.fit(features_train_selected, target_train)

# Make prediction based on selected list of features from test set
prediction_best = model_best.predict(features_test_selected)

# Print the general accuracy of the model_best
print('Accuracy: ', model_best.score(features_test_selected, target_test) * 100)
# Print the precision score of the model predictions
print('Precision: ', precision_score(target_test, prediction_best) * 100)
# Print the recall score of the model predictions
print('Recall: ', recall_score(target_test, prediction_best) * 100)
# Print the ROC/AUC score of the model predictions
print('AUC: ', roc_auc_score(target_test, prediction_best) * 100)

Accuracy:  95.33333333333334
Precision:  88.6509635974304
Recall:  92.3076923076923
AUC:  94.29615249804525


We can see that identifying important features (only 5), the model is able to predict with high level of Recall employee churn without having too many false positives.