Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'

EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobInvolvement 
1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

PerformanceRating 
1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'

RelationshipSatisfaction 
1 'Low' 2 'Medium' 3 'High' 4 'Very High'

WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
%precision %.2f

In [None]:
import plotly.graph_objs as go

In [None]:
df=pd.read_csv('../input/WA_Fn-UseC_-HR-Employee-Attrition.csv')

In [None]:
df.sample(4)

Questions we could Ask Ourselves:
Columns and Observations: How many columns and observations is there in our dataset?
Missing data: Are there any missing data in our dataset?
Data Type: The different datatypes we are dealing in this dataset.
Distribution of our Data: Is it right-skewed, left-skewed or symmetric? This might be useful especially if we are implementing any type of statistical analysis or even for modelling.
Structure of our Data: Some datasets are a bit complex to work with however, the tidyverse package is really useful to deal with complex datasets.
Meaning of our Data: What does our data mean? Most features in this dataset are ordinal variables which are similar to categorical variables however, ordering of those variables matter. A lot of the variables in this dataset have a range from 1-4 or 1-5, The lower the ordinal variable, the worse it is in this case. For instance, Job Satisfaction 1 = "Low" while 4 = "Very High".
Label: What is our label in the dataset or in otherwords the output

In [None]:
df.info()

Summary:
Dataset Structure: 1470 observations (rows), 35 features (variables)
Missing Data: Luckily for us, there is no missing data! this will make it easier to work with the dataset.
Data Type: We only have two datatypes in this dataset: factors and integers
Label" Attrition is the label in our dataset and we would like to find out why employees are leaving the organization!
Imbalanced dataset: 1237 (84% of cases) employees did not leave the organization while 237 (16% of cases) did leave the organization making our dataset to be considered imbalanced since more people stay in the organization than they actually leave.

In [None]:
df.describe(include='all').fillna(' ')

In [None]:
print('Attrition count')
print(df.Attrition.value_counts())
print('Attrition in percentage')
print(df.Attrition.value_counts()*100/len(df))

In [None]:
import plotly.plotly as py
import cufflinks as cf
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')


In [None]:
df.Attrition.value_counts().iplot(kind='bar',title='Bar graph of attrition count in both category')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(221)
sns.distplot(df.Age[df.Gender=='Male'],label='Male avg age=36.65',norm_hist=True,color='c',hist=False)
plt.ylabel('Density')
plt.axvline(df.Age[df.Gender=='Male'].mean(),linestyle='dashed',color='c',linewidth=1)
sns.distplot(df.Age[df.Gender=='Female'],label='Female avg age=37.32',norm_hist=True,color='k',hist=False)
plt.axvline(df.Age[df.Gender=='Female'].mean(),linestyle='dashed',color='k',linewidth=2)
plt.subplot(222)
sns.distplot(df.Age,norm_hist=True,label='Age',hist=False,color='b')
plt.axvline(df.Age.mean())

Now figure out the age distribution in both the attriton category

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(221)
sns.distplot(df.Age[df.Attrition=='Yes'],label='attrition',norm_hist=True,color='c')
#plt.ylabel('Density')
plt.axvline(df.Age[df.Attrition=='Yes'].mean(),linestyle='dashed',color='c',linewidth=1)
plt.subplot(222)
sns.distplot(df.Age[df.Attrition=='No'],label='Age dist of non attrition',norm_hist=True,color='k')
plt.axvline(df.Age[df.Gender=='Female'].mean(),linestyle='dashed',color='k',linewidth=2)
plt.subplot(223)
sns.distplot(df.Age,norm_hist=True,label='Age',hist=False,color='b')
plt.axvline(df.Age.mean())

You can see in first and second  plot that avg age(non attrition)>avg age(attrition)
Most of the people whot left the company, are  between the age 25 and 35. Once age is getting increased the tail is getting less heavier hence people are not tend to change the company in higher age

Now I want to see department attrition overall then with respect to male and female

In [None]:
print('Employee dept wise')
print(df.Department.value_counts())
print('....Attrition departmet wise....')
print(df.Department[df.Attrition=='Yes'].value_counts())
sns.countplot(df.Department[df.Attrition=='Yes'])

Lets calculate the attrition rate department wise

In [None]:
print('Attriion rate for the R & D depat =',100*133/961)
print('Attriion rate for the Sales depat =',100*92/446)
print('Attriion rate for the HR depat =',100*12/63)

You can see that attrition rate is highest and approx equal in the sales and HR department compare to R and D

In [None]:
#Lets check thes summary of daily rate and monthly rate in attrition categaory
print('--------Daily and Monthly rate of the Attrition category--------')
print(df[df.Attrition=='Yes'].describe()[['DailyRate','MonthlyRate']])
print('--------Daily and Monthly rate of the Non-Attrition category--------')
print(df[df.Attrition=='No'].describe()[['DailyRate','MonthlyRate']])
#Lets check the distribution
plt.figure(figsize=(20,10))
plt.subplot(121)
sns.distplot(df.DailyRate[df.Attrition=='Yes'],label='Attrition',norm_hist=True,color='c',hist=False)
sns.distplot(df.DailyRate[df.Attrition=='No'],label='No Attrition',norm_hist=True,color='b',hist=False)
plt.title('Distribution of Daily rate')
plt.subplot(122)
sns.distplot(df.MonthlyRate[df.Attrition=='Yes'],label='Attrition',norm_hist=True,color='c',hist=False)
sns.distplot(df.MonthlyRate[df.Attrition=='No'],label='No Attrition',norm_hist=True,color='b',hist=False)
plt.title('Distribution of monthly rate')

We can't conclude much from distribution of daily rate but distribution of monthly rate suggests that distribution of aatrition is slighlty right shifted compare to not attrition category  so it suggest that higher monthly rate leads to attrition

In [None]:
#Now try to analyze the business travel impact on the attrition rate
sns.countplot(x='BusinessTravel',hue='Attrition',data=df)

Since data is not balanced. We have 'travel rarely' mostly in the data also attriton is highest in travel rarely

In [None]:
print('Percentage of attriton in Travel_Rarely',
      100*len(df[(df.BusinessTravel=='Travel_Rarely') & (df.Attrition=='Yes')])
      /len(df[df.BusinessTravel=='Travel_Rarely']))

print('Percentage of attriton in Travel_Frequently',
      100*len(df[(df.BusinessTravel=='Travel_Frequently') & (df.Attrition=='Yes')])
      /len(df[df.BusinessTravel=='Travel_Frequently']))

print('Percentage of attriton in Non_Travel',
      100*len(df[(df.BusinessTravel=='Non-Travel') & (df.Attrition=='Yes')])
      /len(df[df.BusinessTravel=='Non-Travel']))

So above calculations shows that employee who travel frequently are having highet percentage of attrition

In [None]:
#Now plot the distribtion of distance from home

print('--------Sumary of Distance from home in the Attrition category--------')
print(df[df.Attrition=='Yes'].describe()[['DistanceFromHome']])
print('--------Daily and Monthly rate of the Non-Attrition category--------')
print(df[df.Attrition=='No'].describe()[['DistanceFromHome']])
#Lets check the distribution
plt.figure(figsize=(7,7))
sns.distplot(df.DistanceFromHome[df.Attrition=='Yes'],label='Attrition',norm_hist=True,color='c',hist=False)
sns.distplot(df.DistanceFromHome[df.Attrition=='No'],label='No Attrition',norm_hist=True,color='b',hist=False)


As you can seeabove as distance from increases chances of leaving the organization also increases. Attrition supass the non attrition if distance from home >10

In [None]:
dict(zip([1,2,3,4,5],['Below_College','College','Bachelor','Master','Doctor']))

In [None]:
def education(value):
    edu=dict(zip([1,2,3,4,5],['Below_College','College','Bachelor','Master','Doctor']))
    if value in edu:
        #print(value,edu[value])
        return edu[value]

In [None]:
#list(map(lambda x:education(x),list(df.Education)))
#df.Education.apply(lambda x:education(x))
df['Education_labels']=list(map(lambda x:education(x),list(df.Education)))

In [None]:
pd.concat([df.Education.apply(lambda x:education(x)),df.EducationField,df.Attrition],axis=1,ignore_index=True).head(2)

In [None]:
#Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'
#Let uss find out the percentage of attrition in each education

for i in set(list(df.Education_labels)):
    print('Percentage Attrition in {0} degree = {1} %'.format(i,100*len(df[(df.Education_labels==i) & (df.Attrition=='Yes')])/
                                                           len(df[(df.Education_labels==i)])))

#See the distribution of attrition across the education field and education_labels(Master,bachelors
#degree etc)
g=sns.catplot(x='EducationField',hue='Attrition',col='Education_labels',data=df,
            kind='count',height=4,aspect=0.9,col_wrap=3,sharex =True)
g.set_xticklabels(rotation=45)

In [None]:
for i in set(list(df.EducationField)):
    print('Percentage Attrition in {0} field = {1} %'.format(i,100*len(df[(df.EducationField==i) & (df.Attrition=='Yes')])/
                                                           len(df[(df.EducationField==i)])))


As you can see a very nice pattern in this analysis that  'below_college_degree' person attrion percentage>college_degree>bachelor>master>doctor. So in general we can say that person
who is having less qualification is tend to leave the organization.
Also you can observe the attrition rate in education field. Highest attrition rate occured in Technocal degre and HR degree


Let us analyze the EnvironmentSatisfaction 
1 'Low' 2 'Medium' 3 'High' 4 'Very High'

In [None]:
def EnvironmentSatisfaction(value):
    env=dict(zip([1,2,3,4],['Low','Medium','High','Very_High']))
    if value in env:
        return env[value]

In [None]:
df['EnvSatisfaction_labels']=list(map(lambda x:EnvironmentSatisfaction(x),list(df.EnvironmentSatisfaction)))

In [None]:
for i in set(list(df.EnvSatisfaction_labels)):
    print('Percentage Attrition in {0} env satisfaction = {1} %'.format(i,100*len(df[(df.EnvSatisfaction_labels==i) & (df.Attrition=='Yes')])/
                                                           len(df[(df.EnvSatisfaction_labels==i)])))
plt.figure(figsize=(10,5))
plt.subplot(121)
sns.countplot(x='EnvSatisfaction_labels', hue='Attrition',data=df)
#plt.subplot(122)
sns.catplot(x='EnvSatisfaction_labels', hue='Gender',data=df,col='Attrition',kind='count')

As it is clear that highest attrition in low enviornment satisfaction


Let us analyze the hourly rate with attrtion

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(df.HourlyRate[df.Attrition=='Yes'],label='Attrion',color='c',norm_hist=True,hist=False)
sns.distplot(df.HourlyRate[df.Attrition=='No'],label='No Attrition',color='black',hist=False)

We can not conclude much from the hourly rate. As you can see from black curve that even higher hourly rate people don't leave the company

In [None]:
#Job role wise 

jr_A=dict(df.JobRole[df.Attrition=='Yes'].value_counts())
plt.figure(figsize=(20,10))
plt.subplot(121)
plt.pie(jr_A.values(),labels=jr_A.keys(),autopct='%1.1f')
plt.title('Job role under attrition')

plt.subplot(122)
ms=dict(df.MaritalStatus[df.Attrition=='Yes'].value_counts())
plt.pie(ms.values(),labels=ms.keys(),autopct='%1.1f')
plt.title('MaritalStatus under attrition')


You can see that single left the company more frequently then married and  divorces are less tend to leave the company

Let us plot the histogram of the monthly income in both the categories

In [None]:
print('Average monthly income of the people who left the company in dollar= ',df.MonthlyIncome[df.Attrition=='Yes'].mean())
print('Median income of the people who left the company in dollar= ',df.MonthlyIncome[df.Attrition=='Yes'].median())
print('Average monthly income of the people who dont left the company in dollar= ',df.MonthlyIncome[df.Attrition=='No'].mean())
print('Median income of the people who don''t left the company in dollar= ',df.MonthlyIncome[df.Attrition=='No'].median())
print('Average Income of the people who left the company is {0} percentage below then the who dont left'.
      format(100*(df.MonthlyIncome[df.Attrition=='No'].mean()-df.MonthlyIncome[df.Attrition=='Yes'].mean())/df.MonthlyIncome[df.Attrition=='Yes'].mean()))
#g=sns.FacetGrid(col='Attrition',data=df,height=6,aspect=0.9)
#g.map(plt.hist,'MonthlyIncome')
plt.figure(figsize=(20,4))
plt.subplot(131)
sns.countplot(df.NumCompaniesWorked)
plt.title('Employee count with respect to no. of companies worked')

plt.subplot(132)
sns.countplot(df.NumCompaniesWorked[df.Attrition=='Yes'])
plt.title('Attrition')

plt.subplot(133)
sns.countplot(df.NumCompaniesWorked[df.Attrition=='No'])
plt.title('Non-Attrition')

In non attriton data, we can see that as no. of companies worked increases employe will not tend to switch 

Now let us analyze the Impact of overtime on the aatrition

In [None]:
plt.figure(figsize=(13,5))
plt.subplot(131)
sns.countplot(df.OverTime)
plt.title('Overall')
plt.subplot(132)
sns.countplot(df.OverTime[df.Attrition=='Yes'])
plt.title('Attrition')
plt.subplot(133)
sns.countplot(df.OverTime[df.Attrition=='No'])
plt.title('Non-Attriton')

In overall dataset number of people are higher who do not do overtime. Also one interesting thing in the Attrition that people left the organization irrespective of the overtime as yoou can see in the midlde graph

Let us check the impact of percentage salary hike

In [None]:

plt.figure(figsize=(20,5))
plt.subplot(121)
sns.violinplot(x='PercentSalaryHike',y='Attrition',data=df,hue='Gender',scale='count',inner='quartile',split=True)
plt.subplot(122)
sns.violinplot(x='PercentSalaryHike',y='Attrition',data=df,hue='PerformanceRating',inner='quartile',split='True')

Distribution of the percentage salary hike between male and female ia approximately same. Also maximum attririon occured between 11-15 % bucket.
As percentge salary hike increases attrition decreases.
As you can see from thesecond plot that in rating 3(orange), average percentage salary hike is less in attrition group compare to non attriton group for the rating 3.
But in rating 4 group, one interesting fact that peoplr tend to left even they get high salary hike. You can see that distribution of pecent salary hike in no attriton category is right skewed while in attriton category it is notmal(Blue color)

Let's analyze on the working experience:-


In [None]:
#g=sns.FacetGrid(col='Department',data=df,height=6,aspect=0.9)
#g.map(plt.hist,'TotalWorkingYears',normed=1)
plt.figure(figsize=(15,7))
sns.violinplot(x='TotalWorkingYears',y='Attrition',hue='Department',data=df,inner='quartile',palette='Set2')

As it is clear that in HR department person with less experience tends to leave the organization. Also in sales department there are chances that person can leave the company after 10 years of workex as uper quartile value for both HR and R&D is less then 10 while Q3 for sales greater then 10



In [None]:
#t=dict(df.groupby(['Attrition','TrainingTimesLastYear'])['TrainingTimesLastYear'].count())
#t.key(('No',0))
t_a=dict(df.TrainingTimesLastYear[df.Attrition=='Yes'].value_counts())
t_a.keys()

In [None]:
plt.figure(figsize=(15,7))
plt.subplot(121)
t_a=dict(df.TrainingTimesLastYear[df.Attrition=='Yes'].value_counts())
plt.pie(t_a.values(),labels=t_a.keys(),autopct='%1.1f%%',colors = ['silver', 'yellowgreen', 'lightcoral', 'lightskyblue'])
plt.title('Attrition')
plt.xlabel('TrainingTimesLastYear')
plt.subplot(122)
t_a=dict(df.TrainingTimesLastYear[df.Attrition=='No'].value_counts())
#sns.countplot(df.TrainingTimesLastYear[df.Attrition=='No'],normed=1)
plt.pie(t_a.values(),labels=t_a.keys(),autopct='%1.1f%%')
plt.title('No Attrition')
plt.xlabel('TrainingTimesLastYear')

As it is clear from the figure that if training time of last year=0,2 or 4 then chances of attrition is higher. 

Let's analyze the data in WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'

In [None]:
dict_WorkLifeBalance={ 1:'Bad', 2:'Good', 3:'Better' ,4:'Best'}
dict1=dict(df.WorkLifeBalance[df.Attrition=='Yes'].value_counts().sort_index())
dict3=dict((dict_WorkLifeBalance[key],val*100/len(df.Attrition[df.Attrition=='Yes'])) for key,val in dict1.items())
plt.figure(figsize=(15,4))
plt.subplot(121)
sns.barplot(x=list(dict3.keys()),y=list(dict3.values()))
plt.xlabel('WorkLifeBalance')
plt.ylabel('% of people who attrited')
plt.subplot(122)
dict1=dict(df.WorkLifeBalance[df.Attrition=='No'].value_counts().sort_index())
dict3=dict((dict_WorkLifeBalance[key],val*100/len(df.Attrition[df.Attrition=='No'])) for key,val in dict1.items())
sns.barplot(x=list(dict3.keys()),y=list(dict3.values()))
plt.xlabel('WorkLifeBalance')
plt.ylabel('% of people who do not attrited')

You can see that attrition rate is higher in bad work life balance

I have divided the independent varaibles into three part

1. Not useful-EmployeeCount,EmployeeNumber,Over18,StandardHours(Since these are not having any variability)

2.Continuous variables
Age,DailyRate,DistanceFromHome,HourlyRate,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,StockOptionLevel
,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager

3.Nominal variables-
Business Travel,EducationField,Gender,JobRole,MaritalStatus,OverTime,

4.Ordinal variables
EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,PerformanceRating,RelationshipSatisfaction,WorkLifeBalance,Education

Also we need to encode the  BusinessTravel BusinessTravel
BusinessTravel ,Department,EducationField,Gender,JobRole,MaritalStatus,OverTime


In [None]:
#Drop the addition variables created in exploratory analysis
df.drop(['Education_labels','EnvSatisfaction_labels'],axis=1,inplace=True)
#Lets Drop the not useful predictors
df.drop(['EmployeeCount','EmployeeNumber','Over18','StandardHours'],axis=1,inplace=True)
print('no of unique values in dept  ',np.unique(df.Department))
print('no of unique values in EducationField  ',np.unique(df.EducationField))
print('no of unique values in Gender  ',np.unique(df.Gender))
print('no of unique values in JobRole  ',np.unique(df.JobRole))
print('no of unique values in MaritalStatus  ',np.unique(df.MaritalStatus))
print('no of unique values in OverTime  ',np.unique(df.OverTime))


In [None]:
#Lets use the one hot encoding to transform the categorical fatures
from sklearn.preprocessing import OneHotEncoder
oneh_enc=OneHotEncoder()
oneh_features=oneh_enc.fit_transform(df[['BusinessTravel','EducationField','Gender','JobRole','MaritalStatus','OverTime']])

In [None]:
oneh_features=pd.DataFrame(oneh_features.toarray(),columns=oneh_enc.get_feature_names())
oneh_features.sample(3)

In [None]:
oneh_features.drop(['x5_No','x2_Female','x5_No'],axis=1,inplace=True)

In [None]:
l_ind=[]
#Lets perform chi square test of independence
from sklearn.feature_selection import chi2
chisq,pval=chi2(oneh_features,df[['Attrition']])
for i in pval:
    if i<0.05:
        l_ind.append('Y')
    else:
        l_ind.append('N')


In [None]:
data={'chisq':chisq,'pval':pval,'ind':l_ind}
data=pd.DataFrame(data=data,index=[i for i in list(oneh_enc.get_feature_names()) if i not in ['x5_No','x2_Female','x5_No']])

In [None]:
data.sort_values(by=['chisq','ind'],ascending=False)

In [None]:
#Create a df having only significance levels
oneh_features=oneh_features[list(data[data.ind=='Y'].index)]

AS you can see that at 95% confidence, levels which are having indicator="Y" are related to the response.

**Below is example to showcase that how to calculate chisquare statistics and p val**

In [None]:
#Observed frequency table
Observed=pd.crosstab(df.Attrition,df.BusinessTravel,margins=True)
Observed.index=['No', 'Yes', 'row_total']
Observed.columns=['Non-Travel', 'Travel_Frequently', 'Travel_Rarely', 'col_total']
Observed

In [None]:
#Calculate the expected frquency table
Expected=pd.DataFrame(np.outer(Observed.iloc[0:2:,3:4],Observed.iloc[2:3:,0:3])/1470)
Expected.index=['No', 'Yes']
Expected.columns=['Non-Travel', 'Travel_Frequently', 'Travel_Rarely']

In [None]:
Expected

In [None]:
#pd.options.display.float_format = '{:,.2f}'.format

from scipy.stats import chi2_contingency
Observed=pd.crosstab(df.Attrition,df.BusinessTravel)
chi2_contingency(observed=Observed)# Displays chi2, p, dof, ex

In [None]:
#create an empty dataframe to hold the chi2 and p value of the categorical predictors
#As you can see, Gender is independent from the attrition hece we will drop it now
l_chisq=[]
l_pval=[]
independent_status=[]
for col in ['BusinessTravel','EducationField','Gender','JobRole','MaritalStatus','OverTime']:
    
    Observed=pd.crosstab(df.Attrition,df[col])
    chi2, p, dof, ex=chi2_contingency(observed=Observed)
    chi2=round(chi2,4)
    p=round(p,3)
    l_chisq.append(chi2)
    l_pval.append(p)
    if p<0.05:
        independent_status.append('Y')
    else:
        independent_status.append('N')
chisq_dict={'chisq':l_chisq,'pval':l_pval,'indicator':independent_status}
pd.DataFrame(data=chisq_dict,index=['BusinessTravel','EducationField','Gender','JobRole','MaritalStatus','OverTime'])

In [None]:
#Now lets use another feature selection methods for categorical data
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold
#Lets use variance threshold method

from sklearn.feature_selection import VarianceThreshold
selector=VarianceThreshold(threshold=0.2)
selector.fit_transform(oneh_features)


In [None]:
selector.get_support(indices=True)

only job role Healthcare Representative,x3_Human Resources,Laboratory Technician,Manager are having variability >0.2

In [None]:
#Using mutual information to select the categorical features
from sklearn.feature_selection import mutual_info_classif
mi=mutual_info_classif(oneh_features,df[['Attrition']])
data={'Features':list(oneh_features.columns),'MI val':list(mi)}
print('Chi squae columns',oneh_features.columns)
pd.DataFrame(data=data).sort_values(by='MI val',ascending=False)


**Now working on continuous data:-**
We have below features in continuous form
Age,DailyRate,DistanceFromHome,HourlyRate,MonthlyIncome,MonthlyRate,PercentSalaryHike.

**I will consider below variables as Discrete **
NumCompaniesWorked,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager

Let us check the normality of data points

In [None]:
#['NumCompaniesWorked','StockOptionLevel',
#                             'TrainingTimesLastYear','TotalWorkingYears','YearsAtCompany',
#                             'YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']
sns.distplot(df.Age)

In [None]:
#from statsmodels.graphics.gofplots import qqplot
#qqplot(df.Age)
#df['TotalWorkingYears'].quantile([0, .25, .5, .75, 1.])

In [None]:
#Let us use the Anderson Darlington test
#H0: the sample has a Gaussian distribution.
#H1: the sample does not have a Gaussian distribution.
#Tests whether a data sample has a Gaussian distribution.
#Assumptions
#Observations in each sample are independent and identically distributed (iid).

from scipy.stats import anderson

for col in ['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','PercentSalaryHike']:
    result=anderson(df[col])
    print('Statistics for the {0} variable'.format(col),result)


https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html
If the returned statistic is larger than these critical values then for the corresponding significance level, the null hypothesis that the data come from the chosen distribution can be rejected. The returned statistic is referred to as ‘A2’ in the references.

Here statistics is larger than the all critical values so we will reject the null hypothesis and accept the alternate. All the continuous varibles are not normal

#Let us use the 
**Shapiro-Wilk Test**
The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution

Assumptions

Observations in each sample are independent and identically distributed (iid).
Interpretation

H0: the sample has a Gaussian distribution.
H1: the sample does not have a Gaussian distribution.

In [None]:
from scipy.stats import shapiro
for col in ['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','PercentSalaryHike']:
    result=shapiro(df[col])
    print('Statistics for the {0} variable statistics= {1} and P-val= {2}'.format(col,result[0],result[1]))

Result can be interpreted as (Stas,pval).You can see that all the P values are very small so we can reject the null hypothesis

**Perform the Kolmogorov-Smirnov test for goodness of fit.**
http://lagrange.univ-lyon1.fr/docs/scipy/0.17.1/generated/scipy.stats.kstest.html

This performs a test of the distribution G(x) of an observed random variable against a given distribution F(x). Under the null hypothesis the two distributions are identical, G(x)=F(x). The alternative hypothesis can be either ‘two-sided’ (default), ‘less’ or ‘greater’. The KS test is only valid for continuous distributions.

In the one-sided test, the alternative is that the empirical cumulative distribution function of the random variable is “less” or “greater” than the cumulative distribution function F(x) of the hypothesis, G(x)<=F(x), resp. G(x)>=F(x).

In [None]:
from scipy.stats import kstest
np.random.seed(987654321)
for col in ['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','PercentSalaryHike']:
    result=kstest(np.array(df[col]),cdf='norm')
    print('Statistics for the {0} variable statistics= {1} and P-val= {2}'.format(col,result[0],result[1]))

/All the P values are very small. It suggests that None of the above variables follow normal distribution

https://machinelearningmastery.com/nonparametric-statistical-significance-tests-in-python

https://statistics.laerd.com/spss-tutorials/mann-whitney-u-test-using-spss-statistics.php

https://statistics.laerd.com/spss-tutorials/kruskal-wallis-h-test-using-spss-statistics.php

In [None]:
from scipy.stats import mannwhitneyu
#Under the null hypothesis H0, the distributions of both populations are equal.[3]
#The alternative hypothesis H1 is that the distributions are not equal.
#mannwhitneyu(np.array(pd.get_dummies(df[['Attrition']],drop_first=True)).ravel(),np.array(df[['Age']]).ravel())
for col in ['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','PercentSalaryHike']:
    
    stat,p=mannwhitneyu(df[col][df.Attrition=='Yes'],df[col][df.Attrition=='No'],alternative='less')
    print('----------------------------{0}----------------------------'.format(col))
    print('Statistics=%.3f,p=%.4f'%(stat,p))
    alpha=0.05
    if p>alpha:
        print('Same dist of feature {0} on both the categories, fail to reject null hypothesis'.format(col))
    else:
        print('Accept the alternate hypothesis i.e. values in attrition population of {0} is less than non attrition population of {1}'.format(col,col))

From mannwhitneyu you came to know that Age,DailyRate,Montly Income are the significant variable because their distribution in both the categories are different

We can Also use another test Kruskal-Wallis H Test
https://machinelearningmastery.com/nonparametric-statistical-significance-tests-in-python/

scipy.stats.kruskal(*args, **kwargs)[source]¶
Compute the Kruskal-Wallis H-test for independent samples

The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. It is a non-parametric version of ANOVA. The test works on 2 or more independent samples, which may have different sizes. Note that rejecting the null hypothesis does not indicate which of the groups differs. Post-hoc comparisons between groups are required to determine which groups are different.

Fail to Reject H0: All sample distributions are equal.
Reject H0: One or more sample distributions are not equal.



In [None]:
from scipy.stats import kruskal
for col in ['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','PercentSalaryHike']:
    
    stat,p=kruskal(df[col][df.Attrition=='Yes'],df[col][df.Attrition=='No'],alternative='less')
    print('----------------------------{0}----------------------------'.format(col))
    print('Statistics=%.3f,p=%.4f'%(stat,p))
    alpha=0.05
    if p>alpha:
        print('Same dist of feature {0} on both the categories, fail to reject null hypothesis'.format(col))
    else:
        print('Accept the alternate hypothesis i.e. values in attrition population of {0} is differ than non attrition population of {1}'.format(col,col))

From above two tests you can understand that Age,DailyRate,Montly Income are strong predictors but Distance from home is coming as strong in kruskal-wallis test but it is not significant in mannwhitney test. You can observe that it is weak/medium predictor it's p value in mann whitney test is 0.9988 but it is very less compare to other not significant predictors. We are going to include this in our model later on we will drop and will see the model performance

'HourlyRate','MonthlyRate','PercentSalaryHike' are not significant at 95% confidence 