# IBM_HR_Analytics_Employee_Attrition_Performance 
# Exploratory Data Analysis (EDA)

### The dataset is about employee attrition. This analysis can discover if any particular factors or patterns that lead to attrition. If so, employers can take certain precausion to prevent attrition which in employer of view, employee attrition is a loss to company, in both monetary and non-monetary. 

### **Importing the packages**

In [None]:
##Importing the packages
#Data processing packages
import numpy as np 
import pandas as pd 

#Visualization packages
import matplotlib.pyplot as plt 
import seaborn as sns 

import warnings
warnings.filterwarnings('ignore')

### **Importing the data**

In [None]:
#Import Employee Attrition data
data=pd.read_csv('../input/employee-atrition/employee_attrition_data.csv')

### **Basic Analysis**

In [None]:
#Find the size of the data Rows x Columns
data.shape

**COMMENTS:** The data consists of 1470 rows and 35 columns

In [None]:
#Display first 5 rows of Employee Attrition data
data.head()

In [None]:
#Find Basic Statistics like count, mean, standard deviation, min, max etc.
data.describe()

**COMMENTS:** 
1. Count of 1470 for all the fields indicates that there are no missing values in any of the field
2. Standard deviation (std) is ZERO for fields "EmployeeCount" and "StandardHours".  This indicates that all the values in the given field are same.
3. Minimum(min) and Maximum(max) defines the range of values for that field.
4. Mean(mean) indicates average of all the values in the field.  There is large variation of mean values of the fields so we need to scale the data.
5. 25%, 50%, 75% percentiles indicates the distribution of data

In [None]:
#Find the the information about the fields, field datatypes and Null values
data.info()

**Category Columns**

In [None]:
cat_cols = data.columns[data.dtypes=='object']
data_cat = data[cat_cols]
print(cat_cols)
print(cat_cols.shape)
data_cat.head()

In [None]:
#A lambda function is a small anonymous function.
#A lambda function can take any number of arguments, but can only have one expression.
data['Attrition']=data['Attrition'].apply(lambda x : 1 if x=='Yes' else 0)

In [None]:
data.head()

In [None]:
data[data.Attrition == 1].head()

**Numerical Columns**

In [None]:
num_cols = data.columns[data.dtypes!='object']
data_num = data[num_cols]
print(num_cols)
print(num_cols.shape)
data_num.head()

In [None]:
data.corrwith(data.Attrition, axis = 0).abs().sort_values(ascending = False).head(10)

In [None]:
data.corrwith(data.Attrition, axis = 0).sort_values(ascending = False).head()

**COMMENTS:**  Attrition has positive correlation with DistanceFromHome and NumCompaniesWorked 

In [None]:
data.corrwith(data.Attrition, axis = 0).sort_values(ascending = True).head()

**COMMENTS:**  Attrition has negative correlation with JobLevel, YearsInCurrentRole and MonthlyIncome

In [None]:
sns.countplot(data.TotalWorkingYears, hue=data.Attrition)

**COMMENTS:**  Attrition is more if the TotalWorkingYears are less

In [None]:
sns.countplot(data.DistanceFromHome, hue=data.Attrition)

**COMMENTS:**  Attrition is more if the DistanceFromHome is less

In [None]:
data.JobLevel.value_counts().plot.bar()

**COMMENTS:**  Most of the employees belong to JobLevel 1 and 2

In [None]:
sns.countplot(data.JobLevel, hue=data.Attrition)

In [None]:
data[data.Attrition==1].TotalWorkingYears.value_counts(normalize=True, sort=False).plot.bar()

**COMMENTS:**  The attrition is more if the total working years is less.

In [None]:
data[data.Attrition==1].DistanceFromHome.value_counts(normalize=True, sort=False).plot.bar()

**COMMENTS:**  Attrition is more if the employee is closer to office

In [None]:
plt.figure(figsize=(20, 20)) ; sns.heatmap(data_num.corr(), annot=True)

In [None]:
g = sns.pairplot(data_num.loc[:,'Age':'DistanceFromHome']); g.fig.set_size_inches(15,15)
#data_num.loc[:,'Age':'DistanceFromHome']

In [None]:
g = sns.pairplot(data_num.loc[:,'Education':'HourlyRate']); g.fig.set_size_inches(15,15)

In [None]:
g = sns.pairplot(data_num.loc[:,'JobInvolvement':'MontlyRate']); g.fig.set_size_inches(15,15)

In [None]:
g = sns.pairplot(data_num.loc[:,'NumCompaniesWorked':'StandardHours']); g.fig.set_size_inches(15,15)

In [None]:
g = sns.pairplot(data_num.loc[:,'StockOptionLevel':'YearsAtCompany']); g.fig.set_size_inches(15,15)

In [None]:
g = sns.pairplot(data_num.loc[:,'YearsInCurrentRole':'YearsWithCurrManager']); g.fig.set_size_inches(15,15)

In [None]:
g = sns.pairplot(data_num); g.fig.set_size_inches(15,15)

In [None]:
data_num.hist(layout = (9, 3), figsize=(24, 48), color='blue', grid=False, bins=15)

### **Visualizing the impact of Categorical Features on the Target**

In [None]:
#Find attrition size (Values)
data['Attrition'].value_counts()

**COMMENTS:**  237 employees left the company out of total 1470 employees

In [None]:
pd.crosstab(data.BusinessTravel, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Frequent travelers  are more likely(25%) to leave the company as compared to Non Travellers

In [None]:
pd.crosstab(data.Department, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  The employees in "Sales" department are more likely(21%) to leave the company as compared to the employees of other department

In [None]:
pd.crosstab(data.Education, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  The employees who are least educated are more likely(18%) to leave the company and the employees who are highly qualified are less likely (10%) to leave the company.

In [None]:
pd.crosstab(data.EducationField, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  The employees in "Human Resource" education field are more likely(26%) to leave the company.  Next in the line are from the "Technical" field.

In [None]:
pd.crosstab(data.EnvironmentSatisfaction, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Lower the "Environment Satisfaction" higher the attrition rate(25%)

In [None]:
pd.crosstab(data.Gender, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Male employees have slightly higher attrition rate (17%) as compared to female employees.

In [None]:
pd.crosstab(data.JobInvolvement, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Lower the "JobInvolvement", higher the Attrition rate (34%)

In [None]:
pd.crosstab(data.JobLevel, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Lower the "JobLevel", higher the Attrition rate (26%)

In [None]:
pd.crosstab(data.JobRole, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Employees with with "Sales Representative" JobRole have the higher attrition rate (40%) as compared to others.

In [None]:
pd.crosstab(data.JobSatisfaction, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Lower the "Job Satisfaction" higher the attrition rate(23%)

In [None]:
pd.crosstab(data.MaritalStatus, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Employees who are Single have significantly higher attrition rate(26%)

In [None]:
pd.crosstab(data.OverTime, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Employees who do "OverTime" have significantly higher attrition rate(31%)

In [None]:
pd.crosstab(data.PerformanceRating, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Performance Rating has no effect on the Attrition rate.

In [None]:
pd.crosstab(data.RelationshipSatisfaction, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Employees with less Relationship Satisfaction are more likely to leave the company (21%)

In [None]:
pd.crosstab(data.StockOptionLevel, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Employees who are NOT given "Stock Option" are more likely(24%) to leave the company

In [None]:
pd.crosstab(data.WorkLifeBalance, data.Attrition, margins=True, normalize='index').round(2).style.background_gradient(cmap='autumn_r')

**COMMENTS:**  Employees with less "Work Life Balance" are more likely (31%) to leave the company.

In [None]:
#A lambda function can take any number of arguments, but can only have one expression.
#Change the Attrition from Yes/No to binary 1/0
data['Attrition']=data['Attrition'].apply(lambda x : 1 if x=='Yes' else 0)

### **Visualizing the impact of Numerical Features on the Target**

In [None]:
#Comparing the numeric fields agains Attrition using boxplots
plt.figure(figsize=(24,12))
plt.subplot(231)  ; sns.boxplot(x='Attrition',y='Age',data=data)
plt.subplot(232)  ; sns.boxplot(x='Attrition',y='DailyRate',data=data)
plt.subplot(233)  ; sns.boxplot(x='Attrition',y='DistanceFromHome',data=data)
plt.subplot(234)  ; sns.boxplot(x='Attrition',y='HourlyRate',data=data)
plt.subplot(235)  ; sns.boxplot(x='Attrition',y='MonthlyIncome',data=data)
plt.subplot(236)  ; sns.boxplot(x='Attrition',y='PercentSalaryHike',data=data)

In [None]:
#Comparing the numeric fields agains Attrition using boxplots
plt.figure(figsize=(24,12))
plt.subplot(231)  ; sns.boxplot(x='Attrition',y='MonthlyRate',data=data)
plt.subplot(232)  ; sns.boxplot(x='Attrition',y='NumCompaniesWorked',data=data)
plt.subplot(233)  ; sns.boxplot(x='Attrition',y='TotalWorkingYears',data=data)
plt.subplot(234)  ; sns.boxplot(x='Attrition',y='TrainingTimesLastYear',data=data)
plt.subplot(235)  ; sns.boxplot(x='Attrition',y='YearsAtCompany',data=data)
plt.subplot(236)  ; sns.boxplot(x='Attrition',y='YearsInCurrentRole',data=data)

In [None]:
#Comparing the numeric fields agains Attrition using boxplots
plt.figure(figsize=(24,6))
plt.subplot(121)  ; sns.boxplot(x='Attrition',y='YearsSinceLastPromotion',data=data)
plt.subplot(122)  ; sns.boxplot(x='Attrition',y='YearsWithCurrManager',data=data)

In [None]:
#Correlation plot to find interelationship of the features
plt.figure(figsize=(20, 20))
sns.heatmap(data.corr(), annot=True)

In [None]:
#sns.pairplot(data['BusinessTravel','Gender','Attrition'], hue='Attrition')
#sns.pairplot(data, vars=["Gender", "Attrition"])