## Problem Statement

#### We need to analyze employee_attrition.csv dataset provided. The dataset provides a variety of information about the employees, such as demographics, time on job, etc. and also if they will stay with or leave the company('Attrition' attribute 1(Yes)/0(No)).

##### R Shiny App: https://sankalpsingh.shinyapps.io/HW01/

# Import Libraries

In [None]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 
from pandas import DataFrame, Series
import seaborn as sns
import apyori as ap
from apyori import apriori #Apriori Algorithm
import mlxtend as ml
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

# Reading the Dataset

In [None]:
df=pd.read_csv('employee_attrition.csv')

In [None]:
df.shape

In [None]:
df.head()

# Data Pre-processing & EDA
#### I have followed the CRISP-DM methodology here which starts with data understanding and then data cleaning & pre-processing. I have addressed the following data issues here: 
1) Find missing values and impute them accordingly

2) Remove irrelevant columns 

3) Handling outliers

4) Removing highly correlated columns

5) Discretization of continuous attributes

##### 1) Find missing values and impute them accordingly

In [None]:
df.isnull().sum()

##### As Gender and Over18 are categorical nominal attributes, we replace the missing values in them with mode.

In [None]:
sns.countplot(x="Gender", data=df).set_title('Distribution based on Gender for our dataset')

In [None]:
df['Gender'].fillna('Male', inplace=True)

In [None]:
df['OverTime'].fillna('No', inplace=True)

##### Below is a correlation matrix plot - this will help us impute missing values and would also help us later to remove highly correlated columns

In [None]:
plt.figure(figsize=(20,20))
plt.title("Correltaion Matrix");
sns.heatmap(df.corr(),annot=True);

###### DistanceFromHome: We use the median value as the data exhibits skewness. We can safely do this because the number of missing values is less. So, it won't affect the performance of our model.

In [None]:
df.DistanceFromHome.fillna(np.nanmedian(df.DistanceFromHome),inplace=True)

##### We will be replacing the missing value in JobLevel, PercentSalaryHike by median as we can see from the below plot that the data is right skewed.

In [None]:
sns.distplot(df['JobLevel'],kde = False).set_title('Distribution for JobLevel')

In [None]:
df.JobLevel.fillna(np.nanmedian(df.JobLevel),inplace=True)

In [None]:
sns.distplot(df['PercentSalaryHike'],kde = False).set_title('Distribution for PercentSalaryHike')

In [None]:
df.PercentSalaryHike.fillna(np.nanmedian(df.PercentSalaryHike),inplace=True)

##### We will be replacing the missing value in RelationshipSatisfaction by mean (integer closest to mean) as we can see from the below plot that the data is left skewed.

In [None]:
sns.distplot(df['RelationshipSatisfaction'],kde = False).set_title('Distribution for RelationshipSatisfaction')

In [None]:
df['RelationshipSatisfaction'].mean()

In [None]:
df.RelationshipSatisfaction.fillna(3,inplace=True)

##### As PerformanceRating attribute just contains 2 unique values, we can easily replace the missing value with the mode here i.e. 3

In [None]:
sns.distplot(df['PerformanceRating'],kde = False).set_title('Distribution for PerformanceRating')

In [None]:
df.PerformanceRating.fillna(3,inplace=True)

##### We will be replacing the missing value in TotalWorkingYears, YearsSInceLastPromotion by median as we can see from the below plot that the data is right skewed.

In [None]:
sns.distplot(df['TotalWorkingYears'],kde = False).set_title('Distribution for TotalWorkingYears')

In [None]:
df.TotalWorkingYears.fillna(np.nanmedian(df.TotalWorkingYears),inplace=True)

In [None]:
sns.distplot(df['YearsSinceLastPromotion'],kde = False).set_title('Distribution for YearsSinceLastPromotion')

In [None]:
df.YearsSinceLastPromotion.fillna(np.nanmedian(df.YearsSinceLastPromotion),inplace=True)

In [None]:
print("Total number of missing values in the dataframe after cleaning:",df.isna().sum().sum()) #We check if there are any missing values in our dataframe.

##### 2) Removing irrelevant columns -  We have removed the below columns as all of these contain either one unique value / all unique values. So, these attributes won't help us improve our model's performance.

In [None]:
del df['EmployeeCount']
del df['EmployeeNumber']
del df['StandardHours']
del df['Over18']

#### 3) Handling outliers

##### From the plot below, we can see that we have an outlier in DistanceFromHome attribute. It could be a possibility that the person is working remotely so the distance from home is large. We will categorize this datapoint into high_DistanceFromHome later while binning.

In [None]:
sns.boxplot(x="DistanceFromHome", data=df)
plt.title("Box plot DistanceFromHome")

##### From the plot below, we can see that we have an outlier in TotalWorkingYears attribute. The outlier value seems incorrect as the total working years is more than 100 years. We replace this data point with the 

In [None]:
sns.boxplot(y="TotalWorkingYears", data=df)
plt.title('Box Plot TotalWorkingYears')

In [None]:
df.TotalWorkingYears[143]=14

#### 4) Removing highly correlated columns

##### Due to high correlation between Monthly Income and Job Level, we will drop the Monthly Income column out of the two columns. As Job Level is already a categorical variable, so we won't need to do any preprocessing in order to use in the association rule model.

In [None]:
df.drop("MonthlyIncome",axis=1,inplace=True)

### 5) Data Transformations: Discretization - As Association rules mining only takes categorical attributes as input, so now we will be discretizing all the continuous numerical variables into discrete variables.

##### From the below plot we can see that Age is normally distributed. So, we will discretize it into 3 labels - Low, Medium, High Age

In [None]:
sns.distplot(df['Age'],kde = False).set_title('Distribution for Age')

In [None]:
df["Age"] = pd.qcut(df.Age, 3, labels = ['low_age','med_age','high_age'])

##### Similarly, we will be discretizing all the continuos numerical attributes into bins accordingly - Low, Medium, High

In [None]:
sns.distplot(df['DailyRate'],kde = False).set_title('Distribution for DailyRate')

In [None]:
df["DailyRate"] = pd.cut(df.DailyRate, 3, labels = ['low_DailyRate','med_DailyRate','high_DailyRate'])

In [None]:
sns.distplot(df['DistanceFromHome'],kde = False).set_title('Distribution for DistanceFromHome')

In [None]:
df["DistanceFromHome"] = pd.cut(df.DistanceFromHome, 3, labels = ['low_DistanceFromHome','med_DistanceFromHome','high_DistanceFromHome'])

In [None]:
sns.distplot(df['HourlyRate'],kde = False).set_title('Distribution for HourlyRate')

In [None]:
df["HourlyRate"] = pd.cut(df.HourlyRate, 3, labels = ['low_HourlyRate','med_HourlyRate','high_HourlyRate'])

In [None]:
sns.distplot(df['MonthlyRate'],kde = False).set_title('Distribution for MonthlyRate')

In [None]:
df["MonthlyRate"] = pd.cut(df.MonthlyRate, 3, labels = ['low_MonthlyRate','med_MonthlyRate','high_MonthlyRate'])

##### Discretizing all the other continuous numerical variables similarly.

In [None]:

df["NumCompaniesWorked"] = pd.cut(df.NumCompaniesWorked, 3, labels = ['low_NumCompaniesWorked','med_NumCompaniesWorked','high_NumCompaniesWorked'])

df["PercentSalaryHike"] = pd.cut(df.PercentSalaryHike, 3, labels = ['low_PercentSalaryHike','med_PercentSalaryHike','high_PercentSalaryHike'])

df["TotalWorkingYears"] = pd.cut(df.TotalWorkingYears, 3, labels = ['low_TotalWorkingYears','med_TotalWorkingYears','high_TotalWorkingYears'])

df["TrainingTimesLastYear"] = pd.cut(df.TrainingTimesLastYear, 3, labels = ['low_TrainingTimesLastYear','med_TrainingTimesLastYear','high_TrainingTimesLastYear'])

df["YearsAtCompany"] = pd.cut(df.YearsAtCompany, 3, labels = ['low_YearsAtCompany','med_YearsAtCompany','high_YearsAtCompany'])

df["YearsInCurrentRole"] = pd.cut(df.YearsInCurrentRole, 3, labels = ['low_YearsInCurrentRole','med_YearsInCurrentRole','high_YearsInCurrentRole'])

df["YearsSinceLastPromotion"] = pd.cut(df.YearsSinceLastPromotion, 3, labels = ['low_YearsSinceLastPromotion','med_YearsSinceLastPromotion','high_YearsSinceLastPromotion'])

df["YearsWithCurrManager"] = pd.cut(df.YearsWithCurrManager, 3, labels = ['low_YearsWithCurrManager','med_YearsWithCurrManager','high_YearsWithCurrManager'])

## Some more visualizations and insights

In [None]:
sns.countplot(x="BusinessTravel", data=df).set_title('Distribution for Business Travel')

In [None]:
sns.countplot(x="Department", data=df).set_title('Distribution for Department')

#### Below plot generates an insight that people who have lower environmental satisfaction tend to quit their jobs more.

In [None]:
sns.boxplot(y='Attrition', x='EnvironmentSatisfaction', data=df)
plt.title('Attrition vs Environment Satisfaction')

#### Below plot shows that people in the lower age groups tend to quit their jobs more as compared to their senior counterparts

In [None]:
fig,ax=plt.subplots(1,2,figsize=(15,5))
df_plot = df.groupby(["Attrition", "Age"]).size().reset_index().pivot(columns="Attrition", index="Age", values=0)
sns.countplot(df["Age"],ax=ax[0]).set_title("Countplot of Age");
df_plot.div(df_plot.sum(axis=1), axis=0).plot(kind='bar', stacked=True, ax=ax[1]);
ax[1].set_title("Stacked Proportion Plot of Attrition based on Age");

# Modeling - ARules Mining

#### Preparing dataframe for Association rules mining model

In [None]:
# converting the df columns to string
df=df.astype("str")

# converting all the columns to object type to fulfill the type requirement of an association rules
df=df.astype("object")

#Final dataset to be used for R Shiny App
df.to_csv("ShinyApp_data.csv", index=None)

# Creating a new dataframe for arules mining
df2 = pd.DataFrame({col: str(col)+'=' for col in df}, 
                index=df.index) + df.astype(str)


In [None]:
records = []
for i in range(0,len(df)):
    records.append([str(df2.values[i,j]) 
    for j in range(0, len(df2.columns))])
frequent_itemset = ap.apriori(records, min_support=0.5, min_confidence=0.5,
                              min_lift=1,min_length=2)
results = list(frequent_itemset)
len(results)
results[1:5]

#### Below we have run a baseline model, we list all the results by descending order of support. Most of the antecedents in the rules are of length 1.

In [None]:
te = TransactionEncoder()
te_ary = te.fit(records).transform(records)
df3 = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df3, min_support=0.5, use_colnames=True)
frequent_itemsets.sort_values(by='support',ascending=False).head(10)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()
rules[(rules['lift']>1) & (rules['confidence'] > 0.5)].head(20)

In [None]:
def SupervisedApriori(data,consequent,min_supp,min_conf,min_lift):
    frequent_itemsets = apriori(data, min_supp, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_conf)
    #filter according to lift
    rules = rules[rules['lift'] > min_lift]
    sup_rules = pd.DataFrame()
    for i in consequent:
        df3 = rules[rules['consequents'] == {i}]
        sup_rules = sup_rules.append(df3,ignore_index = True)
    return(sup_rules)
    


## Attrition=No

#### Below we run the model to predict rules for consequent Attrition as No.

In [None]:
attrition_no = SupervisedApriori(df3,consequent = ['Attrition=No'],
min_supp=0.4, min_conf=0.9, min_lift=1)

In [None]:
attrition_no.sort_values("lift",ascending=False).head()

In [None]:
for i in attrition_no.sort_values("lift",ascending=False).head().index:
    print(attrition_no.loc[i]["antecedents"])

## Attrition=Yes

#### Below we run the model to predict rules for consequent Attrition as Yes.

In [None]:
attrition_yes = SupervisedApriori(df3,consequent = ['Attrition=Yes'],
min_supp=0.04, min_conf=0.3, min_lift=1)

In [None]:
attrition_yes.sort_values("lift",ascending=False).head()

In [None]:
for i in attrition_yes.sort_values("lift",ascending=False).head().index:
    print(attrition_yes.loc[i]["antecedents"])

# Conclusion

#### Attrition=No:  All the rules have a lift greater than 1 so we can say that there is some relationship between the antecedents and the consequents.  Confidence is very high meaning that our rules are very significant. From the rules, we can generate a few insights: people in the R&D department tend to stay moreloyal to their jobs; less attrition rate is predicted when employees are not told to do over time; low distance from home and low years worked under current manager are also factors that can lead to employee retainment.

#### Attrition=Yes: Rules have a high lift of 3 and the confidence is above .45. So, the rules are not caused due to randomness. One factor leading to employee attrition is - employee who have no stock option tend to leave the comapany. Other factors that can lead to employee attrition are- TotalWorkingYears= low, Overtime=Yes, Marital Status=Yes.