# Understanding the Data

Importing neccessary modules

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mp

Importing the data

In [None]:
# general_data
emp_gen = pd.read_csv('../input/hr-analytics-case-study/general_data.csv')
# Employee Survey data
emp_sur = pd.read_csv('../input/hr-analytics-case-study/employee_survey_data.csv')
# Manager Survey Data
emp_man = pd.read_csv('../input/hr-analytics-case-study/manager_survey_data.csv')


Merging the datasets

In [None]:
# Merging datasets for general data and Employee Survey
emp1 = pd.merge(emp_gen,emp_sur,on=['EmployeeID'],how='inner')
# Merging the resultant dataset with Manager survey data
emp = pd.merge(emp1,emp_man,on=['EmployeeID'],how='inner')


Getting column Names

In [None]:
emp.columns

Understanding sample data

In [None]:
emp.head(10)

Rows and columns in dataset

In [None]:
emp.shape

**Splitting the attribtes in categorical data and non-categorical data**

**Categorical features :**

Defining a function for Bar char plots for categorical values 

In [None]:
def plot_bar(x):
    pd.Series(emp[x].value_counts()).plot(kind= 'bar')
    plt.title('Plot for '+x+' employee counts')
    plt.xlabel(x)
    plt.ylabel('Employee_counts')

In [None]:
plot_bar('BusinessTravel')

In [None]:
plot_bar('Department')

In [None]:
plot_bar('EducationField')

In [None]:
plot_bar('Gender')

In [None]:
plot_bar('JobRole')

In [None]:
plot_bar('MaritalStatus')

In [None]:
plot_bar('JobLevel')

In [None]:
plot_bar('Education')

In [None]:
plot_bar('StockOptionLevel')

In [None]:
plot_bar('EnvironmentSatisfaction')

In [None]:
plot_bar('JobSatisfaction')

In [None]:
plot_bar('WorkLifeBalance')

In [None]:
plot_bar('JobInvolvement')

In [None]:
plot_bar('PerformanceRating')

Non Categorical features :

In [None]:
emp['Age'].value_counts()

In [None]:
emp['DistanceFromHome'].value_counts()

In [None]:
emp['MonthlyIncome'].value_counts()

In [None]:
emp['NumCompaniesWorked'].value_counts()

In [None]:
emp['PercentSalaryHike'].value_counts()

In [None]:
emp['TotalWorkingYears'].value_counts()

In [None]:
emp['TrainingTimesLastYear'].value_counts()

In [None]:
emp['YearsAtCompany'].value_counts()

In [None]:
emp['YearsSinceLastPromotion'].value_counts()

In [None]:
emp['YearsWithCurrManager'].value_counts()

Features with only 1 distinct values :

In [None]:
emp['Over18'].value_counts()

In [None]:
emp['StandardHours'].value_counts()

In [None]:
emp['EmployeeCount'].value_counts()

Features with unique values :

In [None]:
len(emp['EmployeeID'].unique())

# Data Cleansing

Checking if any of the attributes have Null values

In [None]:
emp.isna().any()

Below attributes have Null values :
	NumCompaniesWorked   	TotalWorkingYears  	EnvironmentSatisfaction  JobSatisfaction   WorkLifeBalance


Updating the Null values with their mean values :

In [None]:
pd.set_option('mode.chained_assignment',None)

emp['NumCompaniesWorked'][emp['NumCompaniesWorked'].isna() == True] = round(emp['NumCompaniesWorked'].mean())
emp['TotalWorkingYears'][emp['TotalWorkingYears'].isna() == True] = round(emp['TotalWorkingYears'].mean())
emp['EnvironmentSatisfaction'][emp['EnvironmentSatisfaction'].isna() == True] = round(emp['EnvironmentSatisfaction'].mean())
emp['JobSatisfaction'][emp['JobSatisfaction'].isna() == True] = round(emp['JobSatisfaction'].mean())
emp['WorkLifeBalance'][emp['WorkLifeBalance'].isna() == True] = round(emp['WorkLifeBalance'].mean())

Understanding the Outliers if any for non-categorical data using Box Plots

Function which creates boxplots

In [None]:
def box_plot(x):
    f1, p1 = plt.subplots()
    p1.set_title(x)
    p1.boxplot(emp[x])

In [None]:
box_plot('Age')

In [None]:
box_plot('DistanceFromHome')

In [None]:
box_plot('MonthlyIncome')

In [None]:
box_plot('NumCompaniesWorked')

In [None]:
box_plot('PercentSalaryHike')

In [None]:
box_plot('TotalWorkingYears')

In [None]:
box_plot('TrainingTimesLastYear')

In [None]:
box_plot('YearsAtCompany')

In [None]:
box_plot('YearsSinceLastPromotion')

In [None]:
box_plot('YearsWithCurrManager')

There doesnt seems to be any serious outliers as all boxes are clearly of visble height

# Data Modelling 

Going forward we will use emp_ohe DataFrame . emp dataframe will be used to refer actual data after data cleansing.

One hot encoding for Categorical data with dropping first attribute


In [None]:
emp_ohe = pd.get_dummies(emp, columns=["BusinessTravel","Department","EducationField","Gender","JobRole","MaritalStatus","JobLevel","Education","StockOptionLevel","EnvironmentSatisfaction","JobSatisfaction","WorkLifeBalance","JobInvolvement","PerformanceRating"], prefix=["BusinessTravel:","Department:","EducationField:","Gender:","JobRole:","MaritalStatus:","JobLevel:","Education:","StockOptionLevel:","EnvironmentSatisfaction:","JobSatisfaction:","WorkLifeBalance:","JobInvolvement:","PerformanceRating:"] ,drop_first = True)

Understanding One hot encoded columns

In [None]:
emp_ohe.columns

Label encoding for Attrition

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder_y = LabelEncoder()
emp_ohe['Attrition'] = label_encoder_y.fit_transform(emp_ohe['Attrition'])

Below 4 features are dropped from modelling because they have only 1 distinct value or unique values

In [None]:
emp_ohe.drop(['Over18'],axis =1,inplace=True)
emp_ohe.drop(['EmployeeID'],axis =1,inplace=True)
emp_ohe.drop(['StandardHours'],axis =1,inplace=True)
emp_ohe.drop(['EmployeeCount'],axis =1,inplace=True)

Finding Multicollinearity attributes

Defining the function to find the VIF for all features

In [None]:
import statsmodels.formula.api as sm

def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        if vif > 5:
            print (xvar_names[i], " VIF = " , vif )


Calculating VIF values using that function

In [None]:
vif_cal(input_data=emp_ohe, dependent_col="Attrition")

Dropping the attribute 'EducationField:_Life Sciences' with highest VIF and calculating the VIF's again

In [None]:
emp_ohe.drop(['EducationField:_Life Sciences'],axis =1,inplace=True)
vif_cal(input_data=emp_ohe, dependent_col="Attrition")

Dropping the attribute 'Department:_Sales' with highest VIF and calculating the VIF's again


In [None]:
emp_ohe.drop(['Department:_Sales'],axis =1,inplace=True)
vif_cal(input_data=emp_ohe, dependent_col="Attrition")

Now all the features have VIF's < 5 . Hence all the Multicollinearity features are removed

Defining Features for predicting the model

In [None]:
y = emp_ohe['Attrition']
X = emp_ohe[
["Age"]+ ["DistanceFromHome"]+ ["MonthlyIncome"]+["NumCompaniesWorked"]+ ["PercentSalaryHike"]+ 
["TotalWorkingYears"]+["TrainingTimesLastYear"]+ ["YearsAtCompany"]+ ["YearsSinceLastPromotion"]+
["YearsWithCurrManager"]+ ["BusinessTravel:_Travel_Frequently"]+["BusinessTravel:_Travel_Rarely"]+ 
["Department:_Research & Development"]+["EducationField:_Marketing"]+ ["EducationField:_Medical"]+
["EducationField:_Other"]+ ["EducationField:_Technical Degree"]+["Gender:_Male"]+ 
["JobRole:_Human Resources"]+["JobRole:_Laboratory Technician"]+ ["JobRole:_Manager"]+
["JobRole:_Manufacturing Director"]+ ["JobRole:_Research Director"]+["JobRole:_Research Scientist"]+ 
["JobRole:_Sales Executive"]+["JobRole:_Sales Representative"]+ ["MaritalStatus:_Married"]+
["MaritalStatus:_Single"]+ ["JobLevel:_2"]+ ["JobLevel:_3"]+ ["JobLevel:_4"]+["JobLevel:_5"]+ 
["Education:_2"]+ ["Education:_3"]+ ["Education:_4"]+["Education:_5"]+ ["StockOptionLevel:_1"]+ 
["StockOptionLevel:_2"]+["StockOptionLevel:_3"]+ ["EnvironmentSatisfaction:_2.0"]+["EnvironmentSatisfaction:_3.0"]+ 
["EnvironmentSatisfaction:_4.0"]+["JobSatisfaction:_2.0"]+ ["JobSatisfaction:_3.0"]+ ["JobSatisfaction:_4.0"]+["WorkLifeBalance:_2.0"]+ 
["WorkLifeBalance:_3.0"]+ ["WorkLifeBalance:_4.0"]+["JobInvolvement:_2"]+ ["JobInvolvement:_3"]+ ["JobInvolvement:_4"]+["PerformanceRating:_4"]]

Model fitting and finding the summary

In [None]:
import statsmodels.api as sm
m1=sm.Logit(y,X)
m1.fit()
m1.fit().summary()

Removing features having p value > 0.05 and getting the summary

In [None]:
X = emp_ohe[
["Age"]+ ["NumCompaniesWorked"]+ ["PercentSalaryHike"]+ 
["TotalWorkingYears"]+["TrainingTimesLastYear"]+ ["YearsAtCompany"]+ ["YearsSinceLastPromotion"]+
["YearsWithCurrManager"]+ ["BusinessTravel:_Travel_Frequently"]+["BusinessTravel:_Travel_Rarely"]+ 
["EducationField:_Marketing"]+ ["EducationField:_Other"]+ ["EducationField:_Technical Degree"]+
["JobRole:_Laboratory Technician"]+ ["JobRole:_Research Director"]+["JobRole:_Research Scientist"]+ 
["JobRole:_Sales Executive"]+ ["MaritalStatus:_Married"]+["MaritalStatus:_Single"]+ ["StockOptionLevel:_1"]+ 
["EnvironmentSatisfaction:_2.0"]+["EnvironmentSatisfaction:_3.0"]+ ["EnvironmentSatisfaction:_4.0"]+["JobSatisfaction:_2.0"]+ 
["JobSatisfaction:_3.0"]+ ["JobSatisfaction:_4.0"]+["WorkLifeBalance:_2.0"]+ ["WorkLifeBalance:_3.0"]+ ["WorkLifeBalance:_4.0"]]

# Model fit with new set of features
m1=sm.Logit(y,X)
m1.fit()
m1.fit().summary()


Repeating the process and getting the new summary

In [None]:
X = emp_ohe[
["Age"]+ ["NumCompaniesWorked"]+ ["PercentSalaryHike"]+ 
["TotalWorkingYears"]+["TrainingTimesLastYear"]+ ["YearsSinceLastPromotion"]+
["YearsWithCurrManager"]+ ["BusinessTravel:_Travel_Frequently"]+["BusinessTravel:_Travel_Rarely"]+ 
["EducationField:_Other"]+ ["JobRole:_Laboratory Technician"]+ 
["JobRole:_Research Director"]+["JobRole:_Research Scientist"]+ 
["JobRole:_Sales Executive"]+ ["MaritalStatus:_Single"]+ 
["EnvironmentSatisfaction:_2.0"]+["EnvironmentSatisfaction:_3.0"]+ ["EnvironmentSatisfaction:_4.0"]+["JobSatisfaction:_2.0"]+ ["JobSatisfaction:_3.0"]+ ["JobSatisfaction:_4.0"]+["WorkLifeBalance:_2.0"]+ ["WorkLifeBalance:_3.0"]+ ["WorkLifeBalance:_4.0"]]

# Model fit with new set of features
m1=sm.Logit(y,X)
m1.fit()
m1.fit().summary()

Repeating the process and getting the new summary

In [None]:
X = emp_ohe[
["Age"]+ ["NumCompaniesWorked"]+ ["PercentSalaryHike"]+ 
["TotalWorkingYears"]+["TrainingTimesLastYear"]+ ["YearsSinceLastPromotion"]+
["YearsWithCurrManager"]+ ["BusinessTravel:_Travel_Frequently"]+["BusinessTravel:_Travel_Rarely"]+ 
["JobRole:_Laboratory Technician"]+ ["JobRole:_Research Director"]+["JobRole:_Research Scientist"]+ 
["JobRole:_Sales Executive"]+ ["MaritalStatus:_Single"]+ ["EnvironmentSatisfaction:_2.0"]+["EnvironmentSatisfaction:_3.0"]+ ["EnvironmentSatisfaction:_4.0"]+["JobSatisfaction:_2.0"]+ ["JobSatisfaction:_3.0"]+ ["JobSatisfaction:_4.0"]+
["WorkLifeBalance:_2.0"]+ ["WorkLifeBalance:_3.0"]+ ["WorkLifeBalance:_4.0"]]

# Model fit with new set of features
m1=sm.Logit(y,X)
m1.fit()
m1.fit().summary()

The model seems to be perfect now with all p values < 0.05 and Multicollinearity features reomved.
But there seems to be lot of predictor attributes and we can reduce it using Chi-square by selecting the max of them. I have redefined X as selecting 7 of them.

In [None]:
X = emp_ohe[["MaritalStatus:_Single"]+["JobSatisfaction:_4.0"]+ ["BusinessTravel:_Travel_Frequently"]+ ["YearsSinceLastPromotion"]+ ["EnvironmentSatisfaction:_4.0"]+["YearsWithCurrManager"]+["WorkLifeBalance:_3.0"]]

Importing necessary modules from Scikit learn to split data , fit data and find acuracy.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

Splitting data into Train and Test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

Defining the function to fit and get the accuracy of Logistic model

In [None]:
def get_accuracy(x_val,y_val,i):
    logistic= LogisticRegression(solver='lbfgs' , max_iter=i)
    logistic.fit(x_val,y_val)
    predict1=logistic.predict(x_val)
    cm = confusion_matrix(y_val,predict1)
    total=sum(sum(cm))
    accuracy=(cm[0,0]+cm[1,1])/total
    print(accuracy)

Getting accuracy for Trained data :

In [None]:
get_accuracy(X_train,y_train,1000)

Accuracy is close to 83.9% which is good.

Getting accuracy for test data :

In [None]:
get_accuracy(X_test,y_test,1000)

Accuracy is test data is  84.4% which is better than trained data.

# Impact of features on Attrition

Now we need to understand how these 7 final features affect the attrition.
To get that we need to understand the data for the actual features rather than One hot encoded attributes. Hence we will check the plots for each one of them. For this we need to use the original emp dataframe now


Plotting BusinessTravel vs count of employee attrition :

In [None]:
pd.crosstab(emp.BusinessTravel,emp.Attrition).plot(kind='bar')
plt.title('Plotting BusinessTravel vs Attrition')
plt.xlabel('BusinessTravel')
plt.ylabel('Count of  Attrition')

We see that just by plotting a bar graph we cannot conclude that which one of them is actually affecting the attrition. 
Eg above graph shows Travel_Rarely has high attrition , But count of employees that travel rarely is also higher . 
Instead we need to plot the percentage of attrition for each Business travel type which can give clear idea.

Below is the function that helps in plotting the Percentage of attrition for each attribute in a feature

In [None]:
def percent_plot(x):
    temp = emp[['Attrition',x]]
    temp['Attrition'] = temp['Attrition'].map({'Yes':1 , 'No':0})
    grouped = temp.groupby(x).sum()
    grouped['Total'] = temp.groupby(x).count()
    row_count = emp.shape[0]
    grouped['Percentage_Attrition'] = grouped['Attrition']*100/grouped['Total']
    pd.Series(grouped['Percentage_Attrition'],index = grouped.index).plot()
    title = 'Plotting '+x+' vs % of Attrition'
    plt.title(title)
    plt.xlabel(x)    
    plt.ylabel('Percentage of  Attrition')

Now analysing again how BusinessTravel affect attrition

In [None]:
percent_plot('BusinessTravel')

We see that % of attrition for Frequent business travel is more than other around 25%

Plotting for Maritial status :

In [None]:
percent_plot('MaritalStatus')

Its clear that Person being single has more chances to leave the company

Plotting for EnvironmentSatisfaction :

In [None]:
percent_plot('EnvironmentSatisfaction')

In the model we had picked up EnvironmentSatisfaction 4.0 as the best fit for model. Thats because we see that employee who has EnvironmentSatisfaction as 4 has very less chances to leave the organization than others. EnvironmentSatisfaction with 1 has more chances 

Plotting for JobSatisfaction :

In [None]:
percent_plot('JobSatisfaction')

Same with JobSatisfaction. JobSatisfaction of 1 is more prone to leave the org

Plotting for YearsSinceLastPromotion :

In [None]:
percent_plot('YearsSinceLastPromotion')

This plot is not so clear as others. Some years like 8 and 12 are missing in data given hence a sudden dip to 0. But we can state here that until 5 years since last promotion the chances of leaving organization is less. Later the chances of leaving org is much higher

Plotting for YearsWithCurrManager :

In [None]:
percent_plot('YearsWithCurrManager')

The plot shows a gradual increase for employees who are with current manager for 14 years.
This seems to be more of suspect as if we exclude year = 14 others show a gradual decrease.
Understanding the data using Value counts.


In [None]:
emp['YearsWithCurrManager'].value_counts()

As we see the data is very less for later years .. after year 15-17. hence it can be ignored and we can conclude that more the years with the same manager lesser the chances for attrition.

Plotting for Maritial status :

In [None]:
percent_plot('WorkLifeBalance')

Lesser the work life balance more chances of leaving

# Conclusion

**Organization can look up on these 7 factors to understand if the employee is prone to leave the org :**

1 . **BusinessTravel**              : If the Employee travels freaquently are at higher risk of attrition.

2 . **MaritalStatus**               : Employee being single are at higher risk of attrition.

3 . **EnvironmentSatisfaction**     : Lower the EnvironmentSatisfaction higher the risk of attrition. 

4 . **JobSatisfaction**             : Lower the JobSatisfaction higher the risk of attrition. 

5 . **YearsSinceLastPromotion**     : Employees who didnt get promoted since last 5 years have higher risk of attrition.

6 . **YearsWithCurrManager**        : More the years an employee spends with the manager lesser the risk of attrition.

7 . **WorkLifeBalance**             : Lower the WorkLifeBalance higher the risk of attrition.


Organization can target employees based on above factor and determine organizational changes that can improve the working environments and hence minimize the attrition rate.