# INTRODUCTION

Hi, this kernel will try to find what are the most important features when it come to Employee Attrition.

To be able to do so, I will perform a RandomForestClassifier before building a bar plot of the features importance. 

# IMPORTING MODULES

In [None]:
#Basic module
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Preparation module
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif

#Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# DATA EXPLORATION

In [None]:
df= pd.read_csv('/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()

In [None]:
#Info
df.info()

In [None]:
#numerical variables summary statistics
df.describe()

In [None]:
#Check for null values in the dataset
df.isnull().sum()

## Null model

This null model tell us that if we had to random guess if an employee will leave or not, we will be right 83.87% of the  time. We will compare this results with the prediction from our model.

In [None]:
#return the random guess if we had to predict if and employee will leave or not
random_guess = 1-len(df[df['Attrition']=='Yes'])/df.shape[0]
random_guess

#### EDA takeaways

* the dataset has 1470 observation and 35 features. The small number of row could be problematic.
* There are 26 features of type integers and 9 of type Object.
* The dataset has no missing data.
* The model should be higher then 83% if we want it to be good or accepteble.


# VISUALISATION

In [None]:
#plot the object features #
plt.style.use('ggplot')
#Create a loop that print all categorical variable against the attrition variable
for col in df.select_dtypes('object'):
    plt.figure(figsize=(8,6))
    sns.countplot(x=col,hue='Attrition',data=df)

In [None]:
#plot the int features 
for col in df.select_dtypes('int64'):
    plt.figure(figsize=(10,8))
    sns.boxplot(x='Attrition', y=col,data=df)

In [None]:
#Create heat map to see correlated features
plt.figure(figsize=(10,8))
sns.heatmap(df.corr())

# FEATURES SELECTION

In [None]:
#Look for variables with low variances
df.var(axis=0)

From the graphics above, we can already delete some columns that are usesless.
* Employee count
* Employee number
* StandardHours
* Over 18
* Perfomance rating
* Stock Option level
* Job Involvement


# Prepare the data

In the next section, we will delete a few number of rows with low variance and then split the data into categorical and numerical features.
The purpose of this step is to Encode and Scale the features. Finaly we transform everything back to a dataframe.

In [None]:
#Drop the features with low variance
to_drop = ['StandardHours','EmployeeCount','EmployeeNumber','Over18','PerformanceRating','StockOptionLevel','JobInvolvement']
df.drop(to_drop,axis=1,inplace=True)

In [None]:
#Split X and y 
X= df.drop('Attrition',axis=1)
y=df['Attrition'].replace({'Yes':1,'No':0})

#split categorical , numerical and ordinal features
categorical = list(X.columns[X.dtypes=='object'])
ordinal = ['Education','EnvironmentSatisfaction','JobLevel','JobSatisfaction','WorkLifeBalance','RelationshipSatisfaction']
numerical = list(X.drop(categorical + ordinal,axis=1))

#Transform numerical and categorical features
X_cat = pd.get_dummies(X[categorical]) #Transform categorical into 0 and 1
X_num = StandardScaler().fit_transform(X[numerical])
X_num = pd.DataFrame(X_num,columns=X[numerical].columns) #Transform the array back to a dataframe for future use

#Create the new X object and look at it
X_new = pd.concat([X_num,X_cat],axis=1)
X_new

So far so good, everything has been transform with no problem. We now have 42 features.
I only scale the reel numerical features. Those in the ordinal features have a rank among them. So it shouldn't be a problem for the model to compare these variables.

## Split the data in train and test set

In [None]:
X_train,X_test,y_train,y_test= train_test_split(X_new,y,test_size=0.40,shuffle=True)

print('X_train shape',X_train.shape)
print('X_test shape',X_test.shape)
print('y_train shape',y_train.shape)
print('y_test shape',y_test.shape)

#I decided to put the test_size at 40% because the dataset have very few observation.

## Dimensionality reduction using PCA

In [None]:
# PCA to reduce the dimension and plot the graph
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d=np.argmax(cumsum>0.95)+1

In [None]:
#Plot the variance curve against the number of features
plt.plot(cumsum)
plt.xlabel('Number of features')
plt.ylabel('Explained Variance')
plt.plot(d,0.95,marker='d')

In [None]:
#check the number of features to keept
d

This section of code extracted the number of features that keeps 95% variance. In other words, 22 features could predict 95% of the attrition prediction. Although PCA algorithm normaly change the features and comprese those with high correlation together. I used it to simply show me how many feature I should keep. The next lines of code will use a SelectKBest model to choos 22 features among the 42 that are the best.

In [None]:
k=21
#Changing the Train set
selector = SelectKBest(f_classif,k=k)
selector.fit(X_train,y_train)

# Keep only the selected features into a new variable X_train_reduced
col=selector.get_support(indices=True)
X_train_reduced = X_train.iloc[:,col]

#Changing the Test set
selector.fit(X_test,y_test)

#Same as above
col=selector.get_support(indices=True)
X_test_reduced = X_test.iloc[:,col]

In [None]:
#Quick look at the new data
X_train_reduced

## Random Forest Classifier

In [None]:
#Create fit and score the model
rfc = RandomForestClassifier(n_estimators=700,max_depth=10,n_jobs=-1,random_state=123)
rfc_model = rfc.fit(X_train_reduced,y_train)

rfc_scores = cross_val_score(rfc,X_train_reduced,y_train,scoring='accuracy',cv=5)
print('This is train score',rfc_scores.mean())

In [None]:
#Predict the model
y_pred_rfc = rfc_model.predict(X_test_reduced)
print('This is test score: ',accuracy_score(y_pred_rfc,y_test))

#Print the confusion_matrix
print('Confusion matrix:')
print(confusion_matrix(y_test,y_pred_rfc))

In [None]:
print(classification_report(y_test,y_pred_rfc))

Sadly, my model have accuracy of 85 which is pretty close to the null model. I am still happy with the precision score of it. I few things could still be done to enhance the model performance. 
* First the target feature is imbalance and a SMOTE algorithm could fix that problem. 
* Get more data because 1470 observation is pretty small.
* Remove Outlier from the dataset.

# Features Importance

In this section we will finaly find wich features are responsible for attrition, rank by importance.

## Plot the importance of each column in attrition rate

In [None]:
# Create model that 
def plot_feature_importance(importance,names,model_type): 
    feature_importance = np.array(importance)
    feature_names = np.array(names)
    
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data) 
    
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True) 
    
    plt.figure(figsize=(10,8))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + ' FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
plot_feature_importance(rfc.feature_importances_,X_train_reduced.columns,'RANDOM FOREST')

# CONCLUSION

#### From the features importance graph above, the companie would be able to dig deeper into the most relevent features or attributes related to attrition among their employee. Starting probably with the top 5 causes.



Please upvote if find this kernel usefull. Feel free to comment my code if you thing something could have been done differently.

Thank you.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session