## Introduction and Agenda

Hi! Today I am working on the IBM Attrition Dataset. I want to show different methods to develop different models for prediction and discuss different ways to calculate feature importance of different models.

**If you like my work, please UPVOTE :)**


**Agenda:**

1. Loading Data
2. Standard EDA
3. Feature Selection

    3.1.     Preparing Likert-scale features
    
    3.2.     Dummy-Encoding
    
    3.3.     Elimination of Multicolinearity
    
4. Oversampling
5. Training of various Models with CV and Parameter Tuning
6. Testing the Models and Feature Importance
7. Short Conclusion


### 1. Loading the Data

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv("../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.tail()

### 2. Standard EDA

We can see that some features dont have any variance and thus no value for our ML, this will be confirmed with the next cod

In [None]:
print("Number of rows:{} \nNumber of columns:{}".format(df.shape[0], df.shape[1]))
df.describe()

In [None]:
# I like somthing like that, so I have a rough insight about the variation in information, type of data (continuous, numerical, categorical), ect.
for i in df:
    print(i)
    print(df[i].unique())
    print("#"*40)

Some columns will be dropped because no they lack variation and thus also information.

In [None]:
# you can do similar automated things with codes like,
    #from sklearn.feature_selection import VarianceThreshold
    #VarianceThreshold(0.1).fit_transform(X)

df.drop(columns = {"Over18", "EmployeeCount", "EmployeeNumber", "StandardHours"}, inplace = True)
print("Number of variables after VarianceThreshold:", df.shape[1])

In [None]:
# no columns with high ration of missing values so no column to drop or to fill in (:
df.isnull().sum()

In [None]:
# seperate categorical from numerical features

numerical = df.dtypes[df.dtypes != "object"].index
print("numerical features:\n", numerical)
print("+"*80)
categorical = df.dtypes[df.dtypes== "object"].index
print("categorical features:\n", categorical)

In [None]:
#we have a highly imbalanced dependant variable and skewness will be high as well

df["Attrition"].value_counts().plot(kind = "bar", x = df["Attrition"])


I think a very simple vizualisation of features can be helpful at times.
Though, in this case, it is a very hard to interpret some features especcially the Rates (there is a disscussion on Kaggle about the meaning of those).
Lets take "Overtime". I think it is not clear what that means. I interpret this feature as "Is making overtime a regular basis in your job?". If so it seems to me that of attrition-group, the overtime-rate is much higher
Singles seem to be more often affected by attrition.
Healthcare Representatives seem to suffer less often under attrition due to unknown reasons.
Sales employees seem to suffer more often from attrition.
Employees who travel often seem to be affected more often from attrition.
Other Features as "Gender" and "Educational Field" is not conclusive just by eyesight from my graph, which does not mean that this does not play a role in the models to come.

In [None]:
# simple Visulazation of the categorical features
Cat = categorical.tolist()
type(Cat)
print(Cat)

for x in Cat:
    if x != "Attrition":
        df[categorical].groupby(["Attrition", x])[x].count().unstack().plot(kind = "bar", stacked = True) 

In [None]:
# The visualization of the numerical features is not as easy as I first thought.
# Some features are in reality true categorical features with no intervall or true zero (Education etc.) which makes it harder to interpret
# Other numerical features like "YearsSinceLastPromotion" are not as conclusive as I would have suspected
# Younger people seem to suffer more under attrition.
# People with greater "DistanceFromHome" is a candidate to influence attrition.
# A higher Attrition-rate seems to be connected to lower "JobLevel" and lower "JobSatisfaction" and lower "MonthlyIncome" and so on.
# The lack on possible StockOptions and a lower "MonthlyIncome" seem to be good indicators, which is quite reasonable.

import seaborn as sns
import matplotlib.pyplot as plt

Num = numerical.tolist()

for j in Num:
    fig, ax = plt.subplots()
    ax = sns.boxplot(y = df[j], x = df["Attrition"], data = df)
    median = df.groupby(['Attrition'])[j].median().values
    median_labels = [str(np.round(s, 2)) for s in median]
    pos = range(len(median))
    for i in pos:
        ax.text(pos[i-1], median[i], median_labels[i], 
        horizontalalignment='center', size='x-small', color='w', weight='semibold')

### 3. Feature Selection

In this section, I will first map all numerical features wich are true categorical features back to categorical features.
Second, I will dummy_encode the old and new categorical features.
Third, I create an artificial dataset and upsample it, so the skewness will not dimish the picture of correlation between the target and other features.
Fouth, I create a heatmap and eliminate features with high correlation to each other as a feature selection measure.
The further Feature Selectoin will be done by correlation between the features.

https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947eb

### 3.1 Preparing Likert-scale features

In [None]:
Education_dict = {1: 'Below College', 2: 'College', 3: 'Bachelor', 4: 'Master', 5: 'Doctor'}

EnvironmentSatisfaction_dict = {1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'}

JobInvolvement_dict = {1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'}

JobSatisfaction_dict = {1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'}

PerformanceRating_dict = {1: 'Low', 2: 'Good', 3: 'Excellent', 4: 'Outstanding'}

RelationshipSatisfaction_dict = {1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'}

WorkLifeBalance_dict = {1:'Bad', 2: 'Good', 3: 'Better', 4: 'Best'}

df["Education"] = df["Education"].map(Education_dict)
df["EnvironmentSatisfaction"] = df["EnvironmentSatisfaction"].map(EnvironmentSatisfaction_dict)
df["JobInvolvement"] = df["JobInvolvement"].map(JobInvolvement_dict)
df["JobSatisfaction"] = df["JobSatisfaction"].map(JobSatisfaction_dict)
df["PerformanceRating"] = df["PerformanceRating"].map(PerformanceRating_dict)
df["RelationshipSatisfaction"] = df["RelationshipSatisfaction"].map(RelationshipSatisfaction_dict)
df["WorkLifeBalance"] = df["WorkLifeBalance"].map(WorkLifeBalance_dict)

### 3.2 Dummy-Encoding

In [None]:
df = pd.get_dummies(data = df, columns = ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender','JobRole', 'MaritalStatus', 'OverTime', "Education", "EnvironmentSatisfaction", "JobInvolvement", "JobSatisfaction", "PerformanceRating", "RelationshipSatisfaction", "WorkLifeBalance"], drop_first = True)
df.head()

In [None]:

# creation of the artificial dataset and heatmap

from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy = "minority", random_state = 10)
X_art, y_art = sm.fit_sample(df.drop(columns = {"Attrition_Yes"}), df["Attrition_Yes"])
artificial_df = pd.concat([pd.DataFrame(X_art), pd.DataFrame(y_art)], axis = 1)
artificial_df.columns = [df.columns]


fig, ax = plt.subplots(figsize = (10,10))
ax = sns.heatmap(artificial_df.corr())
ax.set_title("Correlation after SMOTE")
plt.show()

### 3.3 Elimination of Multicolinearity

To gather sense of a heatmap with so many columns is rather dificult.
To make it easier I will write a code that eliminate mulitcolinearity.
I will use a threshold of 0.7 which indiates that two columns contain quite similar information.

In [None]:
corr = artificial_df.corr()

print("before Multi-check:", corr.shape)
for vars in corr:
    mask = (corr[[vars]] > 0.7) & (corr[[vars]] < 1) | (corr[[vars]] < -0.7) 
    corr[mask] = np.nan
corr.dropna(inplace = True)
print("after Multi-check:", corr.shape)
# 9x rows got eliminated


In [None]:
# dont know why df[corr.index] does not work? any explanation is welcome!
# anyway, now we got a dataframe ready!
df = df[corr.index.get_level_values(0)]

## 4. Oversampling
Now after cleaning the dataset from features with variance = zero, relabeling some "numerical" features back to true categorical features, dummy-encoding the categorical features (with n-1 new columns) and minimizing the problem of multicollinearity, for the next part, I want to oversample my dataset.
It is important to note that oversampling must be used carefully to avoid data leakage!
If you guys think you found something foul, contact me!

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import (classification_report, recall_score)
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier


X = df.drop(columns = {"Attrition_Yes"})
y = df["Attrition_Yes"]

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state = 7, stratify = y, test_size = 0.15)

X_trainval_over, y_trainval_over = sm.fit_sample(X_trainval, y_trainval)

print("Size of trainval:{}\nSize of test:{}".format(X_trainval_over.shape, X_test.shape))


## 5. Training of various Models with CV and Parameter Tuning
I tried various boundaries for the parameters in my algorithms, so I will only show you the intervals of interst

### Gradient Boosting Classifier
* The Baseline Model Recall is 0.87 and highly overfitting. Lets regulate the model a little bit.

In [None]:
all_scores = []
learning_rate = [0.025, 0.05, 0.1, 0.15, 0.2]
for lr in learning_rate:
    GBM = GradientBoostingClassifier(random_state = 10, learning_rate = lr, max_features = "sqrt")
    score = cross_val_score(GBM, X_trainval_over, y_trainval_over, cv = 5, scoring = "recall").mean()
    all_scores.append(score)
fig = plt.figure()

plt.scatter(x = learning_rate, y = all_scores)
plt.xlabel("learning_rate")
plt.ylabel("Recall-Score")
plt.title("GBM with different learning rate")

In [None]:
all_scores = []
n_estimators = [100, 200, 300, 400, 500]
for est in n_estimators:
    GBM = GradientBoostingClassifier(random_state = 10, n_estimators = est, max_features = "sqrt")
    score = cross_val_score(GBM, X_trainval_over, y_trainval_over, cv = 5, scoring = "recall").mean()
    all_scores.append(score)
fig = plt.figure()

plt.scatter(x = n_estimators, y = all_scores)
plt.xlabel("n_estimators")
plt.ylabel("Recall-Score")
plt.title("GBM with different numbers of estimators")

In [None]:
learning_rate = [0.2, 0.15, 0.1, 0.5, 0.025]
n_estimators = [100, 200, 300, 400, 500]
best_score = 0
for est in n_estimators:
    for lr in learning_rate:
        GBM = GradientBoostingClassifier(random_state = 10, n_estimators = est, learning_rate = lr, max_features = "sqrt")
        score = cross_val_score(GBM, X_trainval_over, y_trainval_over, cv = 5, scoring = "recall").mean()
        if score > best_score:
            best_score = score
            best_parameters = {"n_estimator": est, "learning_rate": lr}
print("best recall:{} with best parameters:{}".format(best_score,best_parameters))

### RandomForest
Partially, I will take other tuning parameters than with the DecisionTree. Baseline-Score with CV is 82.9. 
For more info on Parameter Tuning (or in this case Pruning), you can check out : https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d

In [None]:
all_scores = []
estimators = range(400,900,100)
for est in estimators:
    RF = RandomForestClassifier(random_state = 10, n_jobs = -1, n_estimators = est)
    score = cross_val_score(RF, X_trainval_over, y_trainval_over, cv = 5, scoring = "recall").mean()
    all_scores.append(score)
fig = plt.figure()

plt.scatter(x = estimators, y = all_scores)
plt.xlabel("estimators")
plt.ylabel("Recall-Score")
plt.title("RandomForests with different numbers of estimators")

In [None]:
all_scores = []
max_depth = range(1,10,1)
for depth in max_depth:
    RF = RandomForestClassifier(random_state = 10, n_jobs = -1, max_depth = depth)
    scores = cross_val_score(RF, X_trainval_over, y_trainval_over, cv = 5, scoring = "recall").mean()
    all_scores.append(scores)
fig = plt.figure()
plt.scatter (x = max_depth, y = all_scores)
plt.xlabel("max_depth")
plt.ylabel("Recall-Score")
plt.title("RandomForests with different tree depth")

In [None]:
max_depth = range(1,10,1)
estimators = range(400,900,100)
best_score = 0
for depth in max_depth:
    for est in estimators:
        RF = RandomForestClassifier(random_state = 10, max_depth = depth, n_estimators = est, n_jobs = -1)
        scores = cross_val_score(RF, X_trainval_over, y_trainval_over, cv = 5, scoring = "recall").mean()
        if scores > best_score:
            best_score = scores
            best_parameters = {"depth": depth, "estimators": est}
print("best recall:{} with best parameters:{}".format(best_score,best_parameters))

### LogisticRegression
Base Line Model Score is 79.6

In [None]:
#the liblinear-solver (defualt solver for LR), can handle both MAE and MSE penalty

penalty = ["l1", "l2"]
all_score = []
for p in penalty:
    LR = LogisticRegression(penalty = p)
    scores = cross_val_score(LR, X_trainval_over, y_trainval_over,cv = 5, scoring = "recall").mean()
    all_score.append(scores)
fig = plt.figure()
plt.scatter(x = penalty, y = all_score)
plt.xlabel("penalty")
plt.ylabel("Recall-Score")
plt.title("LogisticRegression with MAE and MSE as penalty")

In [None]:
C_value = [0.001, 0.01, 0.1, 1, 10, 100]                       
all_score = []
for C in C_value:
    LR = LogisticRegression(C = C, n_jobs = -1)
    scores = cross_val_score(LR, X_trainval_over, y_trainval_over,cv = 5, scoring = "recall").mean()
    all_score.append(scores)
fig = plt.figure()
plt.scatter (x = C_value, y = all_score)
plt.xlabel("C")
plt.ylabel("Recall-Score")
plt.title("LogisticRegression with different regularization Values for C")

In [None]:
C_value = [0.001, 0.01, 0.1, 1, 10, 100] 
penalty = ["l1", "l2"]
best_score = 0

for C in C_value:
    for p in penalty:
        LR = LogisticRegression(C = C, penalty = p, n_jobs = -1)
        scores = cross_val_score(LR, X_trainval_over, y_trainval_over, cv = 5, scoring = "recall").mean()
        if scores > best_score:
            best_score = scores
            best_parameters = {"p": p, "C": C}
print("best recall_score:\n{}".format(best_score))
print("with given parameters:\n{}".format(best_parameters))

#### Support Vector Machine 
Base Line Model Score is 0.792.
SVMÂ´s perform much better when the dataset is scaled, so I used the MinMaxScaler to achive a better result.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_trainval_over)
X_train_scaled = scaler.transform(X_trainval_over)
SVM = SVC(random_state = 10, kernel = "rbf")
score = cross_val_score(SVM, X_train_scaled, y_trainval_over, cv = 5, scoring = "recall").mean()
print("BaseLineScore:", score)

In [None]:
# low values for C means that values far away from a imaginary decision boundry will be included in the calculation
# vice versa, high values for C ignore these far away values and focus only on values near the imaginary decision boundry
all_scores = []
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
for C in C_values:
    SVM = SVC(random_state = 10, C = C, kernel = "rbf")
    score = cross_val_score(SVM, X_train_scaled, y_trainval_over, cv = 5, scoring = "recall").mean()
    all_scores.append(score)
fig = plt.figure()
plt.scatter(x = C_values, y = all_scores)
plt.xlabel("Parameter C")
plt.ylabel("Recall_score")
plt.title("SVM with different C values")

In [None]:
all_scores = []
gamma = [0.001, 0.01, 0.1, 1, 10, 100]
for g in gamma:
    SVM = SVC(random_state = 10, gamma = g, kernel = "rbf")
    score = cross_val_score(SVM, X_train_scaled, y_trainval_over, cv = 5, scoring = "recall").mean()
    all_scores.append(score)
fig = plt.figure()
plt.scatter(x = gamma, y = all_scores)
plt.xlabel("Parameter Gamma")
plt.ylabel("Recall_score")
plt.title("SVM with different C values")

In [None]:
C_Values = [0.001, 0.01, 0.1, 1, 10, 100]
Gamma = [0.001, 0.01, 0.1, 1, 10, 100]
best_score = 0
for C in C_Values:
    for g in Gamma:
        SVM = SVC(random_state = 10, C = C, gamma=g, kernel = "rbf")
        score = cross_val_score(SVM, X_train_scaled, y_trainval_over, cv = 5, scoring = "recall").mean()
        all_scores.append(score)
        if score > best_score:
            best_score = score
            best_parameters = {"Gamma": g, "C": C}
print("best recall:{} with best parameters:{}".format(best_score,best_parameters))

## 6. Testing the models and Feature Importance

I will now test the models and (for me) more important, evaluate the feature importance of the various models.
Now, somethings in advance:
The measured feature importances for the different models are not directly comparable, because of the method of measurement or different scale. But, we can for example compare the feature importance between the GradientBoostingClassifer and the RandomForestClassifier but not with LogisticRegression.
What I think we can do, is to compare which features reappear in the top10 in the models, although this gives us also limited information.
I try to explain what feature importance means in the various cases.

https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-identify-the-most-important-predictor-variables-in-regression-models

### Gradient Boosting Classifier
From what I got from the docs, the feature importance should be calculated the very similar as with the other tree based ensembles as forest. Basically, the more often a feature is used in for split, the more important it is. The aggregated Gini Importance (aka weighted impurity decrease) of a feature is averaged over the number of trees. 

https://sklearn.org/modules/ensemble.html

In [None]:
GBM = GradientBoostingClassifier(random_state = 10, n_estimators = 500, learning_rate = 0.2, max_features = "sqrt")
GBM.fit(X_trainval_over, y_trainval_over)
GBM_predict = GBM.predict(X_test)
print(round(recall_score(y_test, GBM_predict, average = "micro"),2))
print(classification_report(y_test, GBM_predict))

In [None]:
feature_importance_GBM = pd.DataFrame(dict(Column = np.array(X.columns), Importance = GBM.feature_importances_)).sort_values(by = "Importance", ascending = False)
feature_importance_GBM
fig = plt.figure(figsize = (14,4))
plt.bar(x = feature_importance_GBM.iloc[:10, 0], height = feature_importance_GBM.iloc[:10, 1])
plt.xticks(rotation = 75)

### RandomForest

A tree uses the Gini-Impurity as its measure to decide which characteristic to use for every split. For every feature, respectivly for every value of the feature, the model sums the decrease in Gini-Impurity whenever a Node uses this value or feature. The calculated sum is then divided by the number of trees in the forest that use it. Luckily, a scale is not necessary for this calculation.

From the docs we get: "The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature." Also the formula for the Gini Importance (aka weighted impurity decrease) is: 
"N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child."
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier https://stackoverflow.com/questions/49170296/scikit-learn-feature-importance-calculation-in-decision-trees https://towardsdatascience.com/model-based-feature-importance-d4f6fb2ad403

The feature importance of such ensemble algorithms do not tell us something about the magnitude or the direction of the relationship. All we can tell is, that featureX was helpful (or not) in predicting the outcome.
I find several things interesting when we compare the importance of the features between the forest and the GBM.
Within the top10 features we find an intersection of 9 features, which I interpret as some sort of "confirmation" in the consistency of my models.
2 monetary motives are in the intersection (Monthly_Rate, Stockoption) and the forest takes also "DailyRate" as another important feature. Medics and Life_Science are relevant to predict whether a person suffers under attrition. Further interviews are needed here to conclude the direction of impact, if no other data is provided. 
The private life of the employees seem to play an important role in both models (Work_Life_Balance, Martial_Status) which is not suprising. From this point on, I am a little suprised that a classical attribute for attrition like "Overtime" is not in the top10 of both models.
Other soft indicators such "Jobsatisfaction" or "Environmentsatisfaction" seem to play a role too.
 

In [None]:
RF = RandomForestClassifier(random_state = 10, n_jobs = -1, n_estimators = 400, max_depth = 9)
RF.fit(X_trainval_over, y_trainval_over)
RF_predict = RF.predict(X_test)
print(round(recall_score(y_test, RF_predict, average = "micro"), 2))
print(classification_report(y_test, RF_predict))
#print(RF_predict)

In [None]:
feature_importance_RF = pd.DataFrame(dict(Column = np.array(X.columns), Importance = RF.feature_importances_)).sort_values(by = "Importance", ascending = False)
feature_importance_RF
fig = plt.figure(figsize = (14,4))
plt.bar(x = feature_importance_RF.iloc[:10,0] , height = feature_importance_RF.iloc[:10,1])
plt.xticks(rotation = 75)

### LogisticRegression
Feature importance with logistic regression is different than with trees and forests.
Here, the feature-importance is measured by the regression-coefficient. For categorical features it is a little easier to interpret. The effect of "OverTime_Yes" on the target "Attrition" is more than twice as big as "JobInvolement_Low". 
For numeric data, it is a little more complicated:
In the Model, the coefficient for a numeric predictor variable shows the effect on the target variable with a 1-unit change in the predictor variable.
Statistically, a positive sign (positive coefficient) for a feature means that all else being equal, the chances for attrition are higher, e.g.: A Person working overtime is more likely to result in attrition. In contrast, a person with an educational background in Medical or LifeScience or good WorkLifeBalance or high JobSatisfaction seem to be less likely to suffer under attrition.
In my opinion, the easily understandable coefficients aka feature importance of a linear model makes it a wonderful candidate to chose, if comparing the features and researching the impact of features is important to the task.

For more I highly recommend the work of Tim Bock:
https://www.displayr.com/how-to-interpret-logistic-regression-coefficients/?utm_referrer=https%3A%2F%2Fwww.google.com%2F

Interestingly the 5 most negative coefficients have a way higher impact than the 5 most positive coefficients.

In [None]:
LR = LogisticRegression(C = 0.1, penalty = "l2", n_jobs = -1)
LR.fit(X_trainval_over, y_trainval_over)
LR_predict = LR.predict(X_test)
print(round(recall_score(y_test, LR_predict, average = "micro"), 2))
print(classification_report(y_test, LR_predict))
#print(LR_predict)

In [None]:
# nice hack for first n rows and last n rows in a graph
feature_importance_LR = pd.DataFrame(dict(Column = np.hstack(np.array([X.columns])), Importance = np.hstack(LR.coef_))).sort_values(by = "Importance", ascending = False)
feature_importance_LR
fig = plt.figure(figsize = (14,4))
plt.bar(x = pd.concat([feature_importance_LR.iloc[:5,0],feature_importance_LR.iloc[-5:,0]]), height = pd.concat([feature_importance_LR.iloc[:5,1],feature_importance_LR.iloc[-5:,1]]))
plt.xticks(rotation = 75)

### Support Vector Machine
I used a SVC with a Radial Basis Function, not only because it often gives better results but also to show another way to compute feature importance.

The main idea is to measure the drop in accuracy when shuffling a single column of the test-set. A copy-dataset will ensure not to waste the original test-set. By shuffling, we detach the affected feature from the target variable and hence decrease its information value. This process is ofcourse iterated for all features. Before all of that a benchmark with all features intact is created 

Interestingly the calculated feature importances are pretty small. I only can imagine that there is still much multicolinearity left among the features, although I eliminated everything with correlation > 0.7 and < -0.7. Many features have no importance for the rbf kernel at all. 
As with feature importance from Forests or Gradient Boosters, there is no way to say something about the direction or magnitude of certain features. 
By the way, this is also a nice method for feature selection and thus faster models.

In [None]:
SVM = SVC(random_state = 10,gamma = 0.001, C = 0.001, kernel = "rbf")
SVM.fit(X_trainval_over, y_trainval_over)
scaler.fit(X_trainval_over)
X_test_scaled = scaler.transform(X_test)
SVM_predict = SVM.predict(X_test_scaled)
#print(SVM_predict)
clean_score = round(recall_score(y_test, SVM_predict, average = "micro"), 5)
print("Recall without Shuffling:",clean_score)
print(classification_report(y_test, SVM_predict))

all_scores = []
for i in range(X_test.shape[1]):
    X_test_noisy = X_test_scaled.copy()
    np.random.shuffle(X_test_noisy[:, i])                 
    noisy_predict = SVM.predict(X_test_noisy)
    noisy_score = round(recall_score(y_test, noisy_predict, average = "micro"), 5)
    feature_imp = clean_score-noisy_score
    all_scores.append(feature_imp)
#print(all_scores)

In [None]:
feature_importance_SVM = pd.DataFrame(dict(Column = np.array(X.columns), Importance = np.array(all_scores))).sort_values(by = "Importance", ascending = False)
fig = plt.figure(figsize=(16,4))
plt.bar(x = feature_importance_SVM.iloc[:11, 0], height = feature_importance_SVM.iloc[:11, 1])
plt.title("Feature Importance for SVM with rfb Kernel")
plt.xticks(rotation = 75)

## 7. Short Conclusion

Hope you enjoyed some parts of my work.
From what I took from the various models, I think I would stick with the GBClassifier. 
Not only where its results regarding feature importance very consistent with the RandomForest, but it also had the highest recall score on correctly predicted attrition cases (see at the classification_report).
