## Employee Attrition Prediction || Comparing different classification algorithms

According to:
* **Bureau of Labour Statistics** 4.9 million monthly separations (employee turnover) occurred between August and December in 2016.
* **a research conducted by The Center for American Progress**, the median cost of turnover was 21% of an employee’s annual salary.
* **Society for Human Resource Management studies**, every time a business replaces a salaried
employee, it costs 6 to 9 months’ salary on average. For a manager making \$40,000 a year, that's
\$20,000 to \$30,000 in recruiting and training

This notebook is structured as follows:
   1. **Exploratory Data Analysis (EDA)**: It is an approach to analyze data sets and summarize their characteristics. This section outlines the differents statistical analyses performed.
   2. **Data balacing**: In this section seviral tachniques performed to balance data. 
   3. **Data pre-processing**: is a common process in machine learning projects, it had used to handle datasets that contain missing entries, varying degrees of noise and substantial differences in scale per feature
   4. **Feature selection**
   4. **Applay suprvised ML algorithims**: differents classification algorithms applayed and compared.
   5. **Models interpretability**: the permutation feature importance technique was used to measure how much a feature is important to predict employee attrition.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

## 1. Exploratory Data Analysis (EDA) 
Firstly, lets take a quik look into dataset.

In [None]:
df = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()

We can see that we have 34 features(categorical and numerical) and a categorical target column 'Attrition' with tow value; 'Yes' if employyee exit or 'No' if stayed.

However that was just an overview of dataset. Now, lets deeve more into EDA to determine the quality of features and their predictive power in contrast with target value or label. The exploration of the data is conducted from two different angles: descriptive and correlative. 

In [None]:
df_copy = df.copy()
target_map = {'Yes':1, 'No':0}
y = df_copy['Attrition'].apply(lambda value: target_map[value]) # Encode target column
X = df_copy.drop(['Attrition'], axis = 1) # Separate predictor variables from predicted value

# Devide our dataframe into numerical dataframe and categorical dataframe
num_df = X.select_dtypes(exclude=object)  
cat_df = X.select_dtypes(include=object)  

### Descriptive Analysis
Descriptive analysis (or univariate analysis) provides an understanding of the characteristics of each attribute of the dataset. It also offers important evidence for feature selection in a later state. in this section we will cheeck:
* If there are any missing values
* Data type of each attribute
* Data distribution for each attribute

The code bellow cheecks if there is any NaN value in dataset

In [None]:
df.isnull().values.any()

Fortunally, the are no missing values for this dataset !

The code bellow cheecks data type for each columns.

In [None]:
df.dtypes

So, there are two data types: object and int64. object columns contains non numerical data and should be encoded into numric format, we will talk about this later.

The code bellow generate show underlying frequency distribution of each numerical variable.

In [None]:
plt.figure(figsize=(20,20))
for i, col in enumerate(num_df.columns, 1):
    plt.subplot(5, 6, i)
    sns.violinplot(x = df[col])

We can clearly see that 'EmployeeCount' and 'StandardHours' have the same value within all dataset, so they should be taken into consideration when performing feature selection.

Furthemore, We can observe that there is continue variables (like Age, HourlyRate and monthlyRate) and discreet variables(like Education and JobSatisfaction). Also, all variables have different data distributions, which the majors are Skewed Right Distributions with long-tail shape (like Years at Company and Years since last promotion) . This is a problem because statistical models like SVM, KNN and LR will considered the tail region as outliers. and we know that outliers adversely affect the model’s performance. Actually, There are statistical models that are robust to outlier like a Tree-based models but it will limit the possibility to try other models. Therefore, a transformation to normal distribution should be performed when preprocessing data.

Now, lets see how distributions of categorical varibles looks like. 

In [None]:
plt.figure(figsize=(20,20))
for i, col in enumerate(cat_df.columns, 1):
    plt.subplot(3, 3, i)
    sns.countplot(x = df[col])

We can observe that 'Over18' columns have one unique value and should be excluded too from dataset.

### Correlation analysis
Correlation analysis (or bivariate analysis) examines the relationship between two attributes, and determines whether the two are correlated. it devided into tow sections:
* Numerical columns versus target
* Categorical verus target


The code below generate distribution of numerical variables in terms of attrition. Rapprochement observed between negative and positive classes with all variables, especially with 'PercentSalaryHike' and 'NumCompaniesWorked' variables.

In [None]:
plt.figure(figsize=(30,30))
for i, col in enumerate(num_df.columns, 1):
    plt.subplot(5, 6, i)
    sns.violinplot(x = df['Attrition'], y= df[col])


For correlation between categorical attributes and target attribute, the figure in the following can be generated. We can observe that all categorical attributes have values with different attrition rate, except gender columns, it seems less correlated to attrition. In addition, we can observe that employees who work overtime are more likely to exit.

In [None]:
plt.figure(figsize=(30,30))
for i, col in enumerate(cat_df.columns, 1):
    plt.subplot(3, 3, i)
    sns.barplot(x = df[col], y= y)

## 2. Data pre-processing
Data preprocessing techniques performed within this project described as follow:
1. Data Type Conversion
2. Feature Scaling and Log transformation

In order to simplify those pre-processing tasks, piplines used according to its sevreral benifits:

* **Convenience and encapsulation**: You only have to call fit and predict once on your data to fit a whole sequence of estimators.
* **Joint parameter selection**: You can grid search over parameters of all estimators in the pipeline at once.
* **Safety**: Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.


### Data Type Conversion
One of the most important preprocessing procedures is convert categorical features to numerical format. Some algorithms such as logistic regression, K-nearest neighbor, SVM and neural networks are not able to work with non-numeric data.

The code bellow declare a pipline to apply ordinal encoder.

In [None]:
encoder = Pipeline(steps=[
    ('encoder', OrdinalEncoder())])

### Feature Scaling and Log transformation
Feature scaling is essential prepressing step when features has disparate scales. It may help some machine learning classifiers perform better, because significant scale gaps among features are generally not favored within the optimization stage of these algorithms. Standardization and normalization are two techniques to handle such a problem. Dataset's features has a disparate scale. For example, daily rate range between 100 to 1500, whereas job level range between 1 to 5. normalization performed to adjust range of features and reduce disparate feature scales in IBM HR dataset.

The log transformation can be used to make highly skewed distributions less skewed like in our case, in which most numerical features are right skewed. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics.

The code bellow declare a pipline to apply normalization and log transformation.

In [None]:
scaler_transformer = Pipeline(
    steps=[
        ('scaler', MinMaxScaler()),
        ('transfomer', FunctionTransformer(np.log1p, validate=True))
    ]
)

To join `encoder` and `scaler_transfomer` piplines `ColumnTransformer` object used

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', encoder, cat_df),
        ('cat', scaler_transformer, num_df)
    ]
)

## 3. Feature selection
The feature selection techniques are often used to further improve the classifier’s predictive capability by selecting the relevant features. Feature selection is primarily focused on removing non-informative or redundant predictors from the model. We will use the following techniques to perform feature selection:
* Statistical-based method
* Feature importance based method


In [None]:
# drop unused columns
X.drop(['EmployeeCount','Over18','StandardHours','EmployeeNumber'], axis = 1, inplace = True) 
num_df = X.select_dtypes(exclude = object)
cat_df = X.select_dtypes(include = object)

### Statistical-based method

In [None]:
sns.set(style="white")
mask = np.zeros_like(num_df.join(y).corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(15,10))
cmap = sns.diverging_palette(255, 10, as_cmap=True)
sns.heatmap(num_df.join(y).corr().round(2), mask=mask, annot=True,
            cmap=cmap , vmin=-1, vmax=1, ax=ax)

![](http://)

I'm still updating this notebook ...