## HR Analytics dataset from Kaggle
https://www.kaggle.com/vjchoudhary7/hr-analytics-case-study


Problem Statement
A large company named XYZ, employs, at any given point of time, around 4000 employees. However, every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. The management believes that this level of attrition (employees leaving, either on their own or because they got fired) is bad for the company, because of the following reasons -

- The former employeesâ€™ projects get delayed, which makes it difficult to meet timelines, resulting in a reputation loss among consumers and partners
- A sizeable department has to be maintained, for the purposes of recruiting new talent
- More often than not, the new employees have to be trained for the job and/or given time to acclimatise themselves to the company

Hence, the management has contracted an HR analytics firm to understand what factors they should focus on, in order to curb attrition. In other words, they want to know what changes they should make to their workplace, in order to get most of their employees to stay. Also, they want to know which of these variables is most important and needs to be addressed right away.

Since you are one of the star analysts at the firm, this project has been given to you.

<b><u>Goal of the case study</u></b><br>
<b>You are required to model the probability of attrition using a logistic regression. The results thus obtained will be used by the management to understand what changes they should make to their workplace, in order to get most of their employees to stay.</b>



### Import required Python packages


In [None]:
# Numerical libraries
import numpy as np   

# to handle data in form of rows and columns 
import pandas as pd    

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns

# Import Logistic Regression machine learning library
from sklearn.linear_model import LogisticRegression

#Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split


# calculate accuracy measures and confusion matrix
from sklearn import metrics

# To scale the dimensions we need scale function which is part of sckikit preprocessing libraries
from sklearn import preprocessing

# To enable plotting graphs in Jupyter notebook
%matplotlib inline 

In [None]:
# Read data files to Dataframes
gen_df = pd.read_csv('../input/hr-analytics-case-study/general_data.csv')
eos_df = pd.read_csv('../input/hr-analytics-case-study/employee_survey_data.csv')
mos_df = pd.read_csv('../input/hr-analytics-case-study/manager_survey_data.csv')

print('general data shape = ',gen_df.shape)
print('employee survey data shape = ',eos_df.shape)
print('manager survey data shape = ',mos_df.shape)

In [None]:
# merge employee survey and manager survey data with general data using join() method
gen_df=gen_df.join(eos_df,on='EmployeeID',rsuffix='_EOS')
gen_df=gen_df.join(mos_df,on='EmployeeID',rsuffix='_MOS')
gen_df.head()

In [None]:
gen_df.info()

In [None]:
# check missing values count
missing_values=gen_df.columns[gen_df.isnull().any()]
gen_df[missing_values].isnull().sum()

In [None]:
# replace missing values with mean
gen_df.fillna(gen_df.mean(),inplace=True)

In [None]:
# check again to confirm there are no more missing values
missing_values=gen_df.columns[gen_df.isnull().any()]
gen_df[missing_values].isnull().sum()

In [None]:
gen_df.describe(include='all').T

In [None]:
# There are columns like EmployeeCount, Over18, StandardHours that has only 1 value hence we would drop them 
gen_df.drop(['EmployeeCount','Over18','StandardHours'],axis=1,inplace=True)

In [None]:
# Let's get overall number and percent of people left and stayed
no=gen_df.Attrition.value_counts()['No']
yes=gen_df.Attrition.value_counts()['Yes']
print('Attrition->No: Count=',no,' & Percentage=',((no/len(gen_df))*100).round(2))
print('Attrition->Yes: Count=',yes,' & Percentage=',((yes/len(gen_df))*100).round(2))

In [None]:
# Lets add Attrition Column to eos_df and find out relationship between Employee Survey Results and Attrition
eos_df=eos_df.join(gen_df['Attrition'],on=['EmployeeID'],rsuffix='_gen')
mos_df=mos_df.join(gen_df['Attrition'],on=['EmployeeID'],rsuffix='_gen')


In [None]:
# create new dataframe of Attrition from EmployeeSurvey
eos_yes_df=eos_df.loc[eos_df['Attrition']=='Yes']

# create new dataframe of Attrition from ManagerSurvey
mos_yes_df=mos_df.loc[mos_df['Attrition']=='Yes']

# Create summary dataframe for ratings
yes_summary=pd.DataFrame({'Ratings':[0,1,2,3,4]})

# Create new columns from EmployeeSurvey and store count in summary dataframe
yes_summary['EnvironmentSatisfaction']=eos_yes_df.groupby(['EnvironmentSatisfaction'])['Attrition'].count()
yes_summary['JobSatisfaction']=eos_yes_df.groupby(['JobSatisfaction'])['Attrition'].count()
yes_summary['WorkLifeBalance']=eos_yes_df.groupby(['WorkLifeBalance'])['Attrition'].count()

yes_summary['JobInvolvement']=mos_yes_df.groupby(['JobInvolvement'])['Attrition'].count()
yes_summary['PerformanceRating']=mos_yes_df.groupby(['PerformanceRating'])['Attrition'].count()

# remove row with NA value
yes_summary.fillna(0,inplace=True)

# Replace Rating number to Text
yes_summary.Ratings.replace({1:'Low',2:'Medium',3:'High',4:'Very High',},inplace=True)

yes_summary.plot(kind='bar',x='Ratings',figsize=(20,10),title='Attrition related to Employee and Manager Survey')



## Observation

### Above data suggests that People with High and Very High Employee Satisfaction and Performance Ratings have left an Organisation. 


In [None]:
# Lets look Age of people who have left Organisation
attrition_age=gen_df.loc[gen_df['Attrition']=='Yes'].groupby('Age')['Attrition'].count()\
.plot(kind='line',figsize=(20,10),title='Age wise Attrition')


### High number of People between age of 25 and 35 have left an Organisation with max being at age of 29 and 31

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(20,7))
sns.countplot(x='Attrition',data=gen_df,hue='JobLevel',ax=ax1)
sns.countplot(x='Attrition',data=gen_df,hue='Gender',ax=ax2)


### Majority of People after Job Level 2 have left firm. 


In [None]:
sns.pairplot(gen_df[['Age','MonthlyIncome','DistanceFromHome','Attrition']],hue = 'Attrition')

In [None]:
x=gen_df.drop(['Attrition','EmployeeID'],axis=1)
#x=gen_df.drop(['Attrition','EmployeeID','BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus'],axis=1)
y=gen_df['Attrition']
y.replace({'Yes':1,'No':0},inplace=True)



In [None]:
# There are few fiels whose data type is Object indicating they have categorical values.
# Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()


x['BusinessTravel'] = label_encoder.fit_transform(x['BusinessTravel'].fillna('0'))    
x['Department'] = label_encoder.fit_transform(x['Department'].fillna('0'))    
x['EducationField'] = label_encoder.fit_transform(x['EducationField'].fillna('0'))    
x['Gender'] = label_encoder.fit_transform(x['Gender'].fillna('0'))    
x['JobRole'] = label_encoder.fit_transform(x['JobRole'].fillna('0'))    
x['MaritalStatus'] = label_encoder.fit_transform(x['MaritalStatus'].fillna('0'))    


#### Model 1 - Without Scaling

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=1)
type(x_train)


model=LogisticRegression()
model.fit(x_train,y_train)
y_predict=model.predict(x_test)
model_score=model.score(x_test,y_test)
print('Accuracy = ',model_score)
print(metrics.confusion_matrix(y_test,y_predict))


#### Model 2 - Scale

In [None]:
from sklearn import preprocessing

x_train_scaled = preprocessing.scale(x_train)
x_test_scaled = preprocessing.scale(x_test)

model=LogisticRegression()
model.fit(x_train_scaled,y_train)
y_predict=model.predict(x_test_scaled)
model_score=model.score(x_test_scaled,y_test)
print('Accuracy = ',model_score)
print(metrics.confusion_matrix(y_test,y_predict))



#### Model 3 - MinMaxScaler

In [None]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
x_train_scaled = min_max_scaler.fit_transform(x_train)
x_test_scaled = min_max_scaler.fit_transform(x_test)

model=LogisticRegression()
model.fit(x_train_scaled,y_train)
y_predict=model.predict(x_test_scaled)
model_score=model.score(x_test_scaled,y_test)
print('Accuracy = ',model_score)
print(metrics.confusion_matrix(y_test,y_predict))


#### Model 4 - Standard Scaler

In [None]:
from sklearn import preprocessing


x_scaler = preprocessing.StandardScaler().fit(x_train)
x_test_scaler = preprocessing.StandardScaler().fit(x_test)

x_train_scaled=x_scaler.transform(x_train)
x_test_scaled=x_scaler.transform(x_test)

model=LogisticRegression()
model.fit(x_train_scaled,y_train)
y_predict=model.predict(x_test_scaled)
model_score=model.score(x_test_scaled,y_test)
print('Accuracy = ',model_score)
print(metrics.confusion_matrix(y_test,y_predict))


#### Model 6 - Max Absolute Scaler

In [None]:
from sklearn import preprocessing

max_abs_scaler = preprocessing.MaxAbsScaler()
x_train_scaled = max_abs_scaler.fit_transform(x_train)
x_test_scaled = max_abs_scaler.fit_transform(x_test)

model=LogisticRegression()
model.fit(x_train_scaled,y_train)
y_predict=model.predict(x_test_scaled)
model_score=model.score(x_test_scaled,y_test)
print('Accuracy = ',model_score)
print(metrics.confusion_matrix(y_test,y_predict))


#### Model 6 - Quantile Transformer

In [None]:
from sklearn import preprocessing

quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
x_train_scaled = quantile_transformer.fit_transform(x_train)
x_test_scaled = quantile_transformer.fit_transform(x_test)

model=LogisticRegression()
model.fit(x_train_scaled,y_train)
y_predict=model.predict(x_test_scaled)
model_score=model.score(x_test_scaled,y_test)
print('Accuracy = ',model_score)
print(metrics.confusion_matrix(y_test,y_predict))


#### Model 7 - Quantile Transformer with Normal Distribution

In [None]:
from sklearn import preprocessing

quantile_transformer = preprocessing.QuantileTransformer(output_distribution='normal',random_state=0)
x_train_scaled = quantile_transformer.fit_transform(x_train)
x_test_scaled = quantile_transformer.fit_transform(x_test)

model=LogisticRegression()
model.fit(x_train_scaled,y_train)
y_predict=model.predict(x_test_scaled)
model_score=model.score(x_test_scaled,y_test)
print('Accuracy = ',model_score)
print(metrics.confusion_matrix(y_test,y_predict))


#### Model 8 - Log Transformation using FunctionTransformer

In [None]:
from sklearn import preprocessing

from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p, validate=True)
x_train_scaled = transformer.transform(x_train)
x_test_scaled = transformer.transform(x_test)

model=LogisticRegression()
model.fit(x_train_scaled,y_train)
y_predict=model.predict(x_test_scaled)
model_score=model.score(x_test_scaled,y_test)
print('Accuracy = ',model_score)
print(metrics.confusion_matrix(y_test,y_predict))


## Observation

### Summary of Models

<table>
    <tr><th>Scaler</th><th>Accuracy</th><th>True Positives</th><th>True Negatives</th><th>False Positives (Type I Error)</th><th>False Negatives (Type II Error)</th></tr>
    <tr><td>None</td><td>83.14</td><td>0</td><td>1100</td><td>0</td><td>223</td></tr>
    <tr><td>Scale</td><td>83.82</td><td>12</td><td>1097</td><td>3</td><td>211</td></tr>
    <tr><td>MinMaxScaler</td><td>83.67</td><td>10</td><td>1097</td><td>3</td><td>213</td></tr>
    <tr><td>Standard Scaler</td><td>83.67</td><td>12</td><td>1097</td><td>5</td><td>211</td></tr>
    <tr><td>MaxAbsScaler</td><td>83.9</td><td>10</td><td>1100</td><td>0</td><td>213</td></tr>
    <tr style="background-color:#90ee90"><td>QuantileTransformer</td><td>83.97</td><td>19</td><td>1092</td><td>8</td><td>204</td></tr>
    <tr><td>QuantileTransformer - Normal Distribution</td><td>83.59</td><td>24</td><td>1082</td><td>18</td><td>199</td></tr>
    <tr><td>FunctionTransformer</td><td>83.82</td><td>23</td><td>1086</td><td>14</td><td>200</td></tr>
</table>

### From above comparison table we can see that QuantileTransformer has better Accuracy and less Type 2 errors