# HR ANALYTICS

The aim of this notebook is to predict the Employee attrition rate based on the given features in the dataset. **Logistic Regression, Decision tree, Random Forest algorithm, Support Vector Machine and Gradient Boosting classification algorithms**  are used for prediction and the ML models are implemented using **Pipelines**. Each feature is analysed and the features that influence the attrition rate is chosen for prediction. 

## Table of contents
1. Data Gathering
2. Feature Selection
    <br>2.1 Categorical features analysis
    <br>2.2 Quantitative features analysis
3. Data Transformation
4. Maching Learning Models Implementation
5. Conclusion

# 1. DATA GATHERING

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

In [None]:
#loading the dataset
retent_df = pd.read_csv('/kaggle/input/hr-analytics/HR_comma_sep.csv')
retent_df.head()

In [None]:
#check for null values
retent_df.isna().sum()

# 2. FEATURE SELECTION

Categorical columns - Work_accident, Department, salary
<br>Quantitative columns - satisfaction_level, last_evaluation, number_project, average_monthly_hours, time_spend_company, promotion_last_5years

In [None]:
#analysing the correlation between features
retent_df.corr()

In [None]:
sns.heatmap(retent_df.corr())

In [None]:
sns.pairplot(retent_df.corr())

In [None]:
retent_df.describe()

In [None]:
#checking null values
retent_df.isna().sum()

## 2.1 CATEGORICAL FEATURES ANALYSIS 

In [None]:
pd.crosstab(retent_df.Department, retent_df.left).style.background_gradient(cmap='summer_r')

From the graph, it is understood that employees from Sales Department has left the organisation in greater numbers.

In [None]:
pd.crosstab(retent_df.salary, retent_df.left).plot(kind='bar')
plt.title('Salary vs Employee Attrition')

It seems like Low income employees has left the most from the organisation.

In [None]:
pd.crosstab(retent_df.Work_accident, retent_df.left).plot(kind='bar',cmap='copper')
plt.title('Work accident vs Employee retention')
plt.xticks([0,1],['No','Yes'])

Work accident does not influence Employee retention rate.

## 2.2 QUANTITATIVE FEATURES ANALYSIS 

In [None]:
retent_df.groupby('left').mean()

From the above table, it can be seen that satisfaction level and promotion rate is low in employees who has left the organisation. Average working hours and time spent in the company is higher for the resigned employees. These 4 features can be chosen for prediction.

In [None]:
#satisfaction level vs employee attrition
sns.violinplot(x='left', y='satisfaction_level',data=retent_df)
plt.title('Satisfaction level vs Employee attrition')
plt.tight_layout()
plt.xticks([0,1],['No','Yes'])

It seems like the employees with lower satisfaction rate has left the most.

In [None]:
#promotion vs Employee attrition
pd.crosstab(retent_df.promotion_last_5years, retent_df.left).style.background_gradient(cmap='summer_r')

Employees who received no promotions has left the organisation in greater numbers.

In [None]:
retent_df.columns

In [None]:
sns.boxplot(x='left', y='average_montly_hours', data=retent_df)
plt.title('Average monthly hours vs Employee attrition')
plt.xticks([0,1],['No','Yes'])

Employees who spent more time in working has left in greater numbers.

In [None]:
#Time spent in the company vs Employee attrition
sns.boxplot(x='left', y='time_spend_company', data=retent_df)
plt.title('Time spent in the company vs Employee attrition')
plt.xticks([0,1],['No','Yes'])

Thus, it is safe to conclude that Department, Salary, satisfaction level, promotion, average working hours in a month and time spent in the company influences the Employee attrition rate. Hence these features are chosen for prediction.

# 3. DATA TRANSFORMATION

In [None]:
#save the 6 features selected from feature analysis in a separate dataframe
df = retent_df[['Department','salary','satisfaction_level','promotion_last_5years','average_montly_hours','time_spend_company']]
df.head()

In [None]:
#transform the data to numerical so that it can be passed to the model
salary_df = pd.get_dummies(df.salary, prefix='salary')
salary_df

In [None]:
#transform department
dept_df = pd.get_dummies(df.Department, prefix='dept')
dept_df.head()

In [None]:
#concatenate salary with the main dataframe
transform_df = pd.concat([df,salary_df], axis='columns')
transform_df.head()

In [None]:
#concatenate Department with the main dataframe
transform_df = pd.concat([transform_df,dept_df], axis='columns')
transform_df.head()

In [None]:
#drop columns Department and salary
transform_df.drop(['Department','salary'], axis=1, inplace=True)
transform_df.head()

# 4. MACHINE LEARNING MODELS IMPLEMENTATION

In [None]:
#split the dataset into training and test data
x_train, x_test, y_train, y_test = train_test_split(transform_df, retent_df.left, test_size=0.2)

In [None]:
#create a pipeline for Logistic Regression
pipeline_log_reg = Pipeline([('Logistic Regression',LogisticRegression(solver='lbfgs', max_iter=1000))])

#create a pipeline for decision tree
pipeline_dec_tree = Pipeline([('Decision tree', DecisionTreeClassifier())])

#create a pipeline for Random Forest Classifier
pipeline_random_forest = Pipeline([('Random Forest', RandomForestClassifier())])

#create a pipeline for Support Vector Machine
pipeline_svm = Pipeline([('Support Vector Machine', SVC(C=2.0))])

#create a pipeline for Gradient Boosting Classifier
pipeline_gradient_boost = Pipeline([('Gradient Boosting Classifier', GradientBoostingClassifier(learning_rate=0.1))])

In [None]:
#create a list and dictionary for pipelines
pipelines = [pipeline_log_reg, pipeline_dec_tree, pipeline_random_forest, pipeline_svm, pipeline_gradient_boost]
pipelines_dict = {0:'Logistic Regression',
                 1:'Decision Tree',
                 2:'Random Forest',
                 3:'Support Vector Machine',
                 4:'Gradient Boosting Classifier'}

In [None]:
#predict and display the accuracy of models
for i,pipe in enumerate(pipelines):
    pipe.fit(x_train,y_train)
    y_pred = pipe.predict(x_test)
    print("Accuracy of {} is {}".format(pipelines_dict[i], pipe.score(x_test, y_test)))

# 5. CONCLUSION

From the analysis and prediction, the accuracy of the models are listed as follows:
<br>Logistic Regression - 77%
<br>Decision Tree algorithm - 97% 
<br>Random Forest algorithm - 98% 
<br>Support Vector machine - 78%
<br>Gradient Boosting Classifier - 96%