# Exploratory Data Analysis (EDA)

### Import required libraries

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

### Set the default seaborn style for our plots
seaborn default settings

In [None]:
sns.set() 

### Load and View the Employee Attrition Data

In [None]:
df_emp = pd.read_csv('/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv') 

In [None]:
df_emp.head() 

In [None]:
df_emp.tail() 

View field list with data type

In [None]:
df_emp.info() 

In [None]:
df_emp.describe(include='all') 

In [None]:
df_emp.isna().sum()

As we have few ordinal fields with ratings provided in numeric form, let us first convert them to object form

In [None]:
df_emp.EnvironmentSatisfaction = df_emp.EnvironmentSatisfaction.astype('object')
df_emp.JobSatisfaction = df_emp.JobSatisfaction.astype('object')
df_emp.PerformanceRating = df_emp.PerformanceRating.astype('object')
df_emp.WorkLifeBalance = df_emp.WorkLifeBalance.astype('object')
df_emp.info()

### Missing Data Treatment

In [None]:
df_emp.isnull().sum()

#### 1 missing value each in 'HourlyRate' field and 'PerformanceRating' field

In [None]:
df_emp.PerformanceRating.value_counts()

In [None]:
df_emp.PerformanceRating.mode()

In [None]:
df_emp.PerformanceRating.mode()[0]

In [None]:
df_emp['PerformanceRating'].fillna(df_emp['PerformanceRating'].mode(), inplace=True) 

In [None]:
df_emp['PerformanceRating'].isna().sum()

In [None]:
df_emp['HourlyRate'].describe() 

In [None]:
sns.boxplot(x='HourlyRate',data=df_emp) 

No outliers observed and mean(65.88) and median(66) are very near to each other

In [None]:
df_emp['HourlyRate'].fillna(df_emp['HourlyRate'].mean(), inplace=True)

#### Check if all values are updated and the dataset is clean to proceed

In [None]:
df_emp.info() 

PerformanceRating field is changed to float64 as numeric value (mode) was assigned to the missing data. Hence we have to change it back to object type

Also EmployeeNumber is a unique identifier and hence we will not need it.

In [None]:
df_emp.PerformanceRating = df_emp.PerformanceRating.astype('object') 
df_emp.drop(['EmployeeNumber'], axis = 1, inplace=True) 

In [None]:
df_emp.shape

In [None]:
df_emp.Department.value_counts()

In [None]:
df_emp.Department.value_counts(normalize=True)

In [None]:
df_emp.Attrition.value_counts()

In [None]:
df_emp.Attrition.value_counts(normalize=True)

#### Data Summary:

•	HR Attrition dataset has 1470 observations for 25 features.

•	EmployeeNumber is an ID field which can be ignored. Apart from this there are 12 numeric and 11 categorical variables (Note: Ordinal fields like ‘PerformanceRating’, ‘WorkLifeBalance’ etc. are converted into categorical variables)

•	Data is provided for both male and female employees between age group of 18years to 60 years.

•	Employees are from 6 different EducationField, are part of 3 Departments and play 9 different job roles. 

•	961 out of 1470 employees (65%) are from ‘Research and Development’ department.

•	1233 out of 1470 employees are still with the company. Hence, we have only 237 (16%) observations corresponding to employees who have left the company. 

•	Important information with regards to PerformanceRating, JobSatisfaction, WorkLifeBalance etc. has been provided which can be probable factors for the attrition.

•	2 missing information observed in the dataset has been treated by imputing relevant statistic values.



### Univariate Analysis

#### Numeric Fields

Get the statistics for the numeric field

In [None]:
df_emp['Age'].describe() 

Set the figure size dimensions

Create subplots to incorporate 2 plots(columns) in 1 row

Histogram (bins adjusted based on the distribution spread)

Boxplot

In [None]:
fig_dims = (10, 5) 
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims) 
sns.distplot(df_emp.Age, kde=True, ax=axs[1]) 
sns.boxplot(x= 'Age', data=df_emp, ax=axs[0]) 

Age of employees ranges from 18 through 60 years with mean age of 36-37 years. Age field is normally distributed and has no outliers

In [None]:
df_emp['MonthlyIncome'].describe()

 distplot -> rugplot introduced- plot of each datapoint.. Distribution of data points.

In [None]:
fig_dims = (10, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
sns.distplot(df_emp.MonthlyIncome, rug= True, kde=False, ax=axs[0]) 
sns.boxplot(x= 'MonthlyIncome', data=df_emp, color = 'm',ax=axs[1]) 

Monthly Income of the employees ranges from 1009 through 19999. Distribution is right skewed with outliers beyond a value of 16581.

In [None]:
df_emp['YearsAtCompany'].describe()

distplot -> Kernel Density Estimation (KDE) - estimate the probability density function of a continuous random variable

In [None]:
fig_dims = (10, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
sns.distplot(df_emp.YearsAtCompany, kde=True, ax=axs[0]) 
sns.boxplot(x= 'YearsAtCompany', data=df_emp, ax=axs[1])

Data is provided for employees who have joined recently up to those who are with the company for 40 years. Average experience of the employees is around 7 years. Distribution of this data is right skewed indicating there are few employees in the company who are with the company for more than 18 years. Q3+1.5*IQR = 9+1.5*(9-3)=9+9=18

In [None]:
df_emp['PercentSalaryHike'].describe()

In [None]:
fig_dims = (10, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
sns.distplot(df_emp.PercentSalaryHike, kde=False, ax=axs[0])
sns.boxplot(x= 'PercentSalaryHike', data=df_emp, ax=axs[1], orient = 'v') 

PercentSalaryHike ranges between 11% to 25% with average hike around 14-15%. Distribution is right skewed with no outliers.

#### Categorical Fields

##### Define Function for univariate analysis of categorical variable

In [None]:
def univariateAnalysis_category(cat_column):
    print("Details of " + cat_column)
    print("----------------------------------------------------------------")
    print(df_emp[cat_column].value_counts())
    sns.countplot(x=cat_column, data=df_emp, palette='pastel')
    plt.show()
    print("       ")

Pick up all categorical fields from the data

In [None]:
df_emp_object = df_emp.select_dtypes(include = ['object']) 
lstcatcolumns = list(df_emp_object.columns.values)
lstcatcolumns

Call the defined function to print countplot for all categorical fields

In [None]:
for x in lstcatcolumns:
    univariateAnalysis_category(x)

In [None]:
ax = sns.catplot(y='JobRole', kind='count', aspect=2, data=df_emp)
# aspect signifies the width of each bar

In [None]:
ax = sns.countplot(x="EducationField", data=df_emp)

ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")


•	882 out of 1470 are male employees. 60:40 ratio of male to female employees in the company.

•	Company provide opportunity for BusinessTravel. 277 out of 1470 i.e 19% of the employees have frequent travel opportunity.

•	Around 61%(453+446) employees have given 3 and above rating  (‘High’ or ‘Very High’) for Environment Satisfaction

•	416 employees i.e. 28% of the employees contribute to overtime services.

•	All employees seem to get a high rating of 3 and above (Excellent or Outstanding)


### Bivariate Analysis

#### Numeric & Numeric

Pick up all numeric fields from dataset for analysis

In [None]:
df_emp_numeric = df_emp.select_dtypes(include = ['int64','float64'])
df_emp_numeric.shape

In [None]:
sns.pairplot(df_emp_numeric)

From above plots we observe possible relation between few fields Age, Monthly Income, Total Working Years etc.

In [None]:
fig_dims = (10, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
sns.scatterplot(x='Age', y='MonthlyIncome', data=df_emp,ax= axs[0])
sns.scatterplot(x='TotalWorkingYears', y='MonthlyIncome', data=df_emp,ax= axs[1])

•	Scatterplot of Age against Monthly Income shows that as Age increases Monthly Income has also increased. However, range of income for employees between 30-40 spreads spreads from minimum of 2500 to above 15000.

•	Plot on Total Working Years and Monthly Income indicates that most of the employees beyond 20 years of work experience have a salary above 15000.


Get correlation between numeric fields

In [None]:
corr = df_emp_numeric.corr()
round(corr,2)

Generate a mask for the upper triangle before plotting the heatmap

In [None]:
fig_dims = (10,6)
#fig = plt.subplots(figsize=fig_dims)
#mask = np.triu(np.ones_like(corr, dtype=np.bool)) 
sns.heatmap(round(corr,2), annot=True,fmt='.2f', mask=(np.triu(corr,+1)))

•	Positive correlation is observed between the numeric fields. ‘TotalWorkingYears’ & ‘MonthlyIncome’; ‘YearsInCurrentRole’ & ‘YearsAtCompany’; ‘YearsWithCurrManager’ & ‘YearsAtCompany’; ‘YearsInCurrentRole’ & ‘YearsWithCurrManager’ have a correlation of more than 70%

#### Categorical & Categorical

In [None]:
sns.countplot(x='Attrition', hue='Gender', data=df_emp)

•	Attrition more by male employees compared to female employees.

In [None]:
sns.countplot(y='OverTime', hue='Attrition', data=df_emp)

•	More attrition observed in employees who do overtime.

In [None]:
#pd.crosstab(df_emp.Attrition, df_emp.EducationField, margins=True, normalize='columns')
pd.crosstab(df_emp.Attrition, df_emp.EducationField)

•	Count-wise maximum number of attrition observed in employee with ‘Life Sciences’ and 'Medical' education field.

•	However, if we consider each department 26% attrition observed in Human Resources department

In [None]:
pd.crosstab(df_emp.Attrition, df_emp.JobSatisfaction, margins=True)

In [None]:
pd.crosstab(df_emp.Attrition, df_emp.JobSatisfaction, normalize='columns')

In [None]:
pd.crosstab(df_emp.Attrition, df_emp.JobSatisfaction, margins=True, normalize=True)

•	Maximum employees given a rating of 3 or 4 (61%). Attrition higher in employees who have given low rating of 1 or 2.

#### Categorical & Numeric

In [None]:
fig_dims = (12, 5)
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=fig_dims)
sns.boxplot(x='Attrition', y='MonthlyIncome', data=df_emp, ax=axs[0])
sns.boxplot(x='Attrition', y='Age', data=df_emp, ax=axs[1])
sns.boxplot(x='Attrition', y='DistanceFromHome', data=df_emp, ax=axs[2])

•	Average and range of monthly income of attrite employees is much lesser.

•	Younger employees, average age of around 32 are leaving the company.

•	Employees who stay at far away places seem to be leaving the company.

### Multivariate Analysis

In [None]:
fig_dims = (12, 6)
fig = plt.subplots(figsize=fig_dims)
sns.scatterplot(x='NumCompaniesWorked', y='TotalWorkingYears', hue='Attrition', data=df_emp)

Attrition more by employees who have worked in many companies (3 and above) and have lesser then 20 years of working experience. 


In [None]:
fig_dims = (12, 6)
fig = plt.subplots(figsize=fig_dims)
sns.boxplot(x='Attrition', y='YearsAtCompany', hue='BusinessTravel',data=df_emp)

•	Attrite employees have been with the company for around 2.5 years on an average.

### Facetgrid

In [None]:
g = sns.FacetGrid(df_emp, col="JobRole", hue='Attrition',col_wrap=3, height=3)
g = g.map(plt.scatter, "YearsSinceLastPromotion", 'YearsAtCompany')

•	More attrition observed for Job Role of Sales Executive. Attrition has occurred for employees who have been with the company for more than 5-6 years and have not been promoted for more than 2-3 years. 

### Summary
•	Company has fair spread of employees across age group (18 through 60) and gender (60:40 ratio of male vs female).

•	Fields like Monthly Income have outliers - max value being 19999. Need to check if outlier treatment will be required while building predictive models.

•	More than 60% employees seem to be satisfied with the job and working environment.

•	PerformanceRating paramter has only 2 levels corresponding to Excellent and Outstanding. Observation to be shared with HR department.

•	16% of the observations correspond to the Attrition.

•	Attrition observed more in male employees and also in employees who work overtime. 

•	Distance from work also seems to be a factor for attrition.

•	Higher attrition observed in employees who are Sales Executive and delay in promotion seems to reason for the same.