## Preface

Hello, I am an aspiring HR Analyst who has been self-learning python for a few months now. I am hoping that through creating these notebooks for datasets that I find on Kaggle I can improve my python, data analysis, and machine learning algorithm skills. 

Any feedback would be greatly appreciated.

Now onto the actual project.

## Introduction

Most, if not all companies invest significant resources into acquiring and training their employees, thus it is always important for them to retain the talent they have invested so much in. Having employees leave your company would mean that even more resources need to be poured into their replacements, who also face the same risks. 

Thus, it is important for companies to be able to predict employee attrition in order to develop strategies to reduce the phenomena. 

In this kaggle notebook, we will do the following:

* **Exploratory Data Analysis** - Exploring the data and how features correlate to one another.
* **Feature Engineering** - in order to prepare our categorical data for our machine learning model.
* **Machine Learning Model Implementation** - Implementing a Random Forest Classifier model for our data.




In [None]:
#importing the usual libraries for EDA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### 1. Exploratory Data Analysis

The first step when tackling any dataset. We must first take a look at our data, explore the relationship between its features, and make some observations.

In [None]:
#loading the data into a DataFrame
df = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')

In [None]:
#taking a peek into the DataFrame
df.head()

In [None]:
#Getting some more information about the dataset
df.describe().transpose()

In [None]:
#checking for null/missing values
df.isnull().sum()

In [None]:
df.info()

#seems that there are some categorical columns in this df, let's explore them

In [None]:
#Exploring the target feature
df['Attrition'].unique()

### Feature Engineering

Before carrying on with EDA, I would like to convert categorical features into numerical ones through one of the many methods for doing so. This would help give a clearer idea of what's going on with the data.

In [None]:
#let's assign 1s and 0s to the Attrition column
df['Attrition'].replace(to_replace = dict(Yes = 1, No = 0), inplace = True)

In [None]:
#Assigning categorical features to 'categorical_cols'
categorical_cols = []
for col, value in df.iteritems():
    if value.dtype == 'object':
        categorical_cols.append(col)

In [None]:
#storing these columns in a new dataframe called df_cat
df_cat = df[categorical_cols]
df_cat.head()

In [None]:
#taking a peek at the unique values in each of the categorical columns
for column in categorical_cols:
    print(f"{column} : {df[column].unique()}")
    print("-"*40)

In [None]:
#assigning numerical variables to our categorical data through sklearn's LabelEncoder
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()
for column in categorical_cols:
    df[column] = label.fit_transform(df[column])

In [None]:
#checking our new DataFrame with numerical values
df.head()

Now, we can carry on with our EDA

In [None]:
df.hist(figsize=(20, 20));

from looking at the histogram of the attrition feature, we can quickly notice that attrition is heavily skewed towards 0, meaning there is a lot less people that leave the company. Nonetheless, It is still important to learn why those that leave do so in order to develop strategies to retain them. 

In [None]:
#plotting some countplots, splitting on Attrition
plt.figure(figsize=(20,20))

plt.subplot(421)
sns.countplot(x='Age',data=df,hue='Attrition')
plt.subplot(422)
sns.countplot(x='OverTime', data=df, hue='Attrition')
plt.subplot(423)
sns.countplot(x='MaritalStatus', data=df, hue='Attrition')
plt.subplot(424)
sns.countplot(x='JobRole', data=df, hue='Attrition')
plt.subplot(425)
sns.countplot(x='JobLevel', data=df, hue='Attrition')
plt.subplot(426)
sns.countplot(x='JobSatisfaction', data=df, hue='Attrition')
plt.subplot(427)
sns.countplot(x='TotalWorkingYears', data=df, hue='Attrition')
plt.subplot(428)
sns.countplot(x='WorkLifeBalance', data=df, hue='Attrition')

plt.show()

Some observations that can be made:

 1. Almost 50% of employees who work overtime end up leaving the company.
 2. JobLevel = 1 has the highest percentage of attrition (Approx. 26%).
 3. Almost 50% of employees with TotalWorkingYears = 1 end up leaving the company.
 4. Laboratory Technicians have the highest percentage of attrition.
 5. Single employees are more likely to leave the company.
 6. Employees with a lower JobSatisfaction level are more likely to leave the company.

In [None]:
#let's get rid of the StandardHours, EmployeeCount and Over18 column, as all rows have the same value.

df.drop(['StandardHours','Over18','EmployeeCount','EmployeeNumber'],axis=1,inplace=True)

In [None]:
#this will be quite a large heatmap, but will be worth taking a look at to spot correlated features
plt.figure(figsize=(25,25))
sns.heatmap(df.corr(),annot=True,cmap='coolwarm')

From the heatmap, we remark the following:

- Age is correlated with several features, including: NumCompaniesWorked, MonthlyIncome, JobLevel, Education, and other more obvious features such relating to seniority.

- Attrition has some negative correlation with the following features: YearsWithCurrManager, YearsInCurrentRole, YearsAtCompany, TotalWorkingYears, StockOptionLevel, MonthlyIncome, JobLevel, JobInvolvement, EnvironmentSatisfaction, and Age. Attrition is also correlated with OverTime

- As expected, JobLevel is perfectly correlated with monthly income. It is also highly correlated TotalWorkingYears, i.e. work experience.

- Job satisfaction seems to have no correlation with any of the other features.

- Performance Rating is highly correlated with PercentSalaryHike, i.e. high performance earn better raises.

In [None]:
#Splitting the dataset
df_final = df.drop('Attrition',axis=1)
y = df['Attrition']

In [None]:
# Scaling the data
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
X = scaler.fit_transform(df_final)

### Machine Learning Model Implementation

Now that we've explored our data and converted categorical data into numerical data, we can now move forward with the implementation of our ML model.

for the puropose of this classification task, I've opted for the implementation of a Random Forest Classifier, as it combines the predictive powers of decision trees in order to create a more accurate model.

In [None]:
#import the train_test_split model
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=71)

In [None]:
#importing RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
#initializing the RFC object
rfc = RandomForestClassifier(n_estimators=1000)

In [None]:
#fitting the data
rfc.fit(X_train,y_train)

In [None]:
#making the predictions
predictions = rfc.predict(X_test)

In [None]:
#importing some reporting tools
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
print('classification report: ')
print('='*40)
print(classification_report(y_test,predictions))
print('\n')
print('confusion matrix: ')
print('='*40)
print(confusion_matrix(y_test,predictions))

we can see that our model did an alright job with an accuracy of 85%.

As noted before, there exists a significant imbalance between the count of each of the two attrition values. Let's see if we can improve our model using SMOTE. 

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=71)

In [None]:
sm = SMOTE(random_state=71)
X_train, y_train = sm.fit_sample(X_train, y_train)

In [None]:
rfc = RandomForestClassifier(n_estimators=1000)

In [None]:
rfc.fit(X_train,y_train)

In [None]:
predictions = rfc.predict(X_test)

In [None]:
print('classification report: ')
print('='*40)
print(classification_report(y_test,predictions))
print('\n')
print('confusion matrix: ')
print('='*40)
print(confusion_matrix(y_test,predictions))

Here we observe that using SMOTE actually made our accuracy worse and only slightly improved our True Negative classifications. I am not sure why this is the case, but perhaps a more experienced/knowledgeable individual can point me in the right direction! 

- 17/Jul/2020 - EDIT_1: Scaled the data. This seemed to help SMOTE improve the model, but not by much.

### Conclusion

In this notebook, we implemented a  simple pipeline of predicting employee attrition. We went over some EDA, Feature Engineering, and implemented a straightforward Random Forest Classifier with an 85% accuracy score (though I'm sure it can be improved).

On that note, more features can be derived from the data that might also help improve the model. I will be coming back to this notebook to give it another go as I improve my python skills.