# Exploratory Data Analysis on Employee Attrition Data

# Dataset Description

For this project, I am using IBM HR Analytics Employee Attrition & Performance dataset that can be obtained on https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset . This dataset is about employee attrition data. Employee Attrition is the gradual reduction in staff numbers that occurs as employees retire or resign and are not replaced. Employee attrition can be costly for businesses. The company loses employee productivity, and employee knowledge. This is a fictional dataset created by IBM data scientists. We can see the preview of the dataset below.

In [None]:
import pandas as pd # Import Pandas library for exploratory data analysis
data = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv') # Put the dataset into dataframe form
data.head() # Get first 5 rows of the dataset

In [None]:
# Get the shape of the dataset
data.shape

There are 35 columns and 1470 rows in this dataset. The columns refer to the attributes such as Age, Attrition, Department, Education, etc. For several attributes such as Education, each datapoint is a representative for description as follows:

Education
1.   'Below College'
2.   'College'
3.   'Bachelor'
4.   'Master'
5.   'Doctor'

EnvironmentSatisfaction
1.   'Low'
2.   'Medium'
3.   'High'
4. 'Very High'

JobInvolvement
1.   'Low'
2.   'Medium'
3.   'High'
4. 'Very High'

JobSatisfaction
1.   'Low'
2.   'Medium'
3.   'High'
4. 'Very High'

PerformanceRating
1.   'Low'
2.   'Good'
3.   'Excellent'
4. 'Outstanding'

RelationshipSatisfaction
1.   'Low'
2.   'Medium'
3.   'High'
4. 'Very High'

WorkLifeBalance
1.   'Bad'
2.   'Good'
3.   'Better'
4. 'Best'

The detail of the data types of those attributes can be viewed below.

In [None]:
# Get the data types of each column
data.dtypes

# Data Exploration

For further exploration on this dataset, I am checking if there are any missing values.

In [None]:
# Count every missing value in each column
data.isna().sum()

In [None]:
# Double-checking for the missing values, value 'True' indicates that missing value is exist, while 'False' indicates there are no missing value
data.isnull().values.any()

From the codes above we can see that there are no missing values in this dataset. Now we get the descriptive statistics of the numerical attributes in the dataset below.

In [None]:
# Get descriptive statistics of the dataset
data.describe()

Now we explore the number of employee by the categories in this dataset.



In [None]:
#Print all of the object data types and their unique values
for column in data.columns:
    if data[column].dtype == object:
        print(str(column) + ' : ' + str(data[column].unique()))
        print(data[column].value_counts())
        print("_________________________________________________________________")

In 'Attrition' attribute, there are 2 values that applicable. 'Yes' means the employee left the company, while 'No' means the employee stayed. We get the number of employee who left and stayed below.

In [None]:
# Get a count of the number of employee attrition
data['Attrition'].value_counts()

In [None]:
# Visualisation of 'Attrition' attribute
import seaborn as sns
sns.countplot(data['Attrition'])

The result shows that 237 employees left and 1233 employees stayed. Now we see the number of the employees who left and stay by its several categorical attributes.

First, by age.

In [None]:
#Show the number of employees that left and stayed by age
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(x='Age', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1));

From the chart above we can see that the employee who left mostly are employees on their late 20s.

Here is the number of employees that left and stayed by their frequency of business travel.

In [None]:
#Show the number of employees that left and stayed by business travel
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(x='BusinessTravel', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1));

Every category has the number of the employees who left with roughly the same comparison between 'Yes' or 'No'.

Here is the number of employees that left and stayed by their education field.

In [None]:
# Show the number of employees that left and stayed by education field
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(x='EducationField', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1));

As we can see from the chart above, most of the employees who left has Life Sciences education background.

Now let's see the number of the employees who left and stay by its Environment Satisfaction level.

In [None]:
# Show the number of employees that left and stayed by environment satisfaction level
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(x='EnvironmentSatisfaction', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1));

As explained before, this attributes has several values that represented by number as follows:

1.  'Low'
2.  'Medium'
3.  'High'
4.  'Very High'

From the chart above we can see that every category has the number of the employees who left, and the number is not having a huge difference between each category.

Now let's see the number of the employees who left and stay by Gender.

In [None]:
# Show the number of employees that left and stayed by gender
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(x='Gender', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1));

From the chart above we can see that every category has the number of the employees who left with roughly the same comparison between 'Yes' or 'No'.

Here's the number of employees who left and stayed by its job level.

In [None]:
# Show the number of employees that left and stayed by job level
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(x='JobLevel', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1));

As we can see, most of the employees who left the company has level 1 job.

Now let's see the number of employees who left and stayed by its job role.

In [None]:
# Show the number of employees that left and stayed by job role
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(y='JobRole', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1), orient='v');

As we can see, most of the employees who left the company is the Lab Tech.

Now let's see the number of employees who left and stayed by its job satisfaction.

In [None]:
# Show the number of employees that left and stayed by job satisfaction
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(x='JobSatisfaction', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1));

From the chart above we can see that every category has the number of the employees who left, and the number is not having a huge difference between each category.

Now let's see the number of the employees who left and stay by their performance rating.

In [None]:
# Show the number of employees that left and stayed by performance rating
import matplotlib.pyplot as plt
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)

#ax = axis
sns.countplot(x='PerformanceRating', hue='Attrition', data = data, palette="colorblind", ax = ax,  edgecolor=sns.color_palette("dark", n_colors = 1));

As explained before, this attributes has several values that represented by number as follows:

1.  'Low'
2.  'Good'
3.  'Excellent'
4.  'Outstanding'

From the chart above we can see that most employees who left has 'Excellent' performance rating.

# Hypothesis formulation

From the analysis before, we got some insight about the data. Since this data can be used to make a prediction model whether an employee would left the company or not, we have to know the attributes that has positive or negative and strong or weak correlation with employee attrition. Here I made several hypothesis regarding to this matter.

* First hypothesis

H0 = Age has a positive correlation with employee attrition.

Ha = Age has a negative correlation with employee attrition.

* Second hypothesis

H0 = Job level has a positive correlation with employee attrition.

Ha = Job level has a negative correlation with employee attrition.

* Third hypothesis

H0 = Job role has a positive correlation with employee attrition.

Ha = Job role has a negative correlation with employee attrition.

# Hypothesis Testing and Result

To test the hypothesis, I made a new dataframe consists of several attributes selected and get its correlation data.

In [None]:
# Make a new dataframe and get its preview
new_data = pd.DataFrame({'Age': data['Age'], 'Job Level': data['JobLevel'], 'Job Role': data['JobRole'], 'Attrition': data['Attrition']})
new_data.head()

As we can see in the new dataframe, Job Role and Attrition value is not a numerical value. So I transformed them into numerical value first before getting it's correlation value.

In [None]:
# Transform non-numeric columns into numerical columns
import numpy as np
from sklearn.preprocessing import LabelEncoder

for column in new_data.columns:
        if new_data[column].dtype == np.number:
            continue
        new_data[column] = LabelEncoder().fit_transform(new_data[column])

In [None]:
new_data.head()

As you can see, we successfully transformed the data. Now we get the correlation table below.

In [None]:
# Get the correlation table
new_data.corr()

We also can get the visualization of the correlation using heatmap.

In [None]:
# Visualize the correlation
plt.figure(figsize=(7,7))
sns.heatmap(new_data.corr(), annot=True, fmt='.0%')

Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1, where:

*  1 indicates a strong positive relationship.
*  -1 indicates a strong negative relationship.
*  A result of zero indicates no relationship at all.

As we can see from the correlation table and heatmap above, we can conclude that:

* Age and Attrition has a negative, weak correlation.
* Job level and Attrition has a negative, weak correlation.
* Job role and Attrition has a positive, weak correlation.

For further analysis, I suggest that we can do more insight finding in each variable by visualization and feature engineering. 


This dataset is a good dataset to learn about regression model. Because it has no missing values and the attributes is very clear, we don't have to do a lot of data processing when using this dataset.

*Pratiwi Fitriana Haris*

*Coursera IBM Exploratory Data Analysis for Machine Learning*