# Source of this dataset


https://www.kaggle.com/vjchoudhary7/hr-analytics-case-study/

# About the data


Depending on the data available for the employees and attrition information (whether the employee has left the company on previous year or not), we need to be able to predict the probability of an employee to stay in the company.
Also this data should help us to reduce attrition rate focusing on the right factors.

# Importing Modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Loading the data

In [None]:
employee_survey = pd.read_csv("../input/hr-analytics-case-study/employee_survey_data.csv")
manager_survey = pd.read_csv("../input/hr-analytics-case-study/manager_survey_data.csv")
general_data = pd.read_csv("../input/hr-analytics-case-study/general_data.csv")

In [None]:
print(employee_survey.columns)
print(manager_survey.columns)
print(general_data.columns)

We will merge the 3 datasets on the common column "EmployeeID" and work on a single dataset.

In [None]:
from functools import reduce
df_list = [employee_survey, manager_survey, general_data]
emp_df = reduce(lambda left,right: pd.merge(left,right,how='inner',on='EmployeeID'), df_list)
emp_df.columns

Let's have a quick look at the data we are going to analyze.

In [None]:
emp_df.shape

In [None]:
emp_df.info()

In [None]:
emp_df.describe()

In [None]:
print(emp_df['Over18'].unique())
print(emp_df['EmployeeCount'].unique())
print(emp_df['StandardHours'].unique())

By inspecting the features quickly, we can see that we can get rid of the following features for the mentioned reasons

*   EmployeeID - is a unique ID, Attrition rate does not depend on this
*   Over18 - has a single value for all columns (i.e. Y)
*   EmployeeCount - has a single value for all columns (i.e. 1)
*   StandardHours - has a single value for all columns (i.e. 8)



# Drop Unnecessary Features (Part 1)

In [None]:
# This function takes the dataframe and list of features to be dropped
# returns the updated dataframe

def drop_features(df, feat_list):
    for col in feat_list:
        if col in df.columns:
            df.drop(col, axis=1, inplace=True)
            print(f"{col} is dropped")
        else:
            print(f"{col} is already dropped")

In [None]:
drop_features(emp_df, ['EmployeeID', 'EmployeeCount', 'Over18', 'StandardHours'])

# Visualizing Categorical Data

Our target column is "Attrition"

Let's check out all the labeled columns in the dataframe
- We will list the categorical features
- get their position/column index
- see how target column "Attrition" is related with them

In [None]:
print("The categorical columns and their index-")
for col in emp_df.columns:
    if emp_df[col].dtype == 'object':
        print(col, emp_df.columns.get_loc(col))

In [None]:
def show_percentage_of_people_left(column_name):
    df = emp_df.groupby(column_name)['Attrition'].describe()
    df['percentage of people left'] = (1 - (df['freq']/df['count']))*100
    print(df)
    print('===============================')

In [None]:
for col in ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']:
    show_percentage_of_people_left(col)

Let's visualize the categorical data in plots as well.

- For working with the categorical data, we need to encode them.
Onehotencoding will increase the number of features dramatically 
(for e.g. for the column 'JobRole', it will add 8 columns!)

- Understanding the trend of Attrition depending on the various labels would help us to bind similar labels together 
and thus reduce the column numbers.

- We will also reduce a column after the encoding to avoid dummy variable trap

In [None]:
sns.countplot(x='Attrition', hue='BusinessTravel', data=emp_df)

In [None]:
sns.countplot(x='Attrition', hue='Department', data=emp_df)

 **Observation** : Human Resource Department has quite higher attrition rate than others


In [None]:
sns.countplot(x='Attrition', hue='EducationField', data=emp_df)

**Observation** : Human Resource has quite high attrition rate than others

In [None]:
sns.countplot(x='Attrition', hue='Gender', data=emp_df)

In [None]:
sns.countplot(x='Attrition', hue='JobRole', data=emp_df)

In [None]:
sns.countplot(x='Attrition', hue='MaritalStatus', data=emp_df)

# Encoding Categorical Data

**Encoding "MaritalStatus"**

We will encode this column "MaritalStatus" in a new column named "Single"

- Value : 1 means Single, Attrition rate 25% 
- Value : 0 means Married/Divorced, Attrition rate ~11% on average

In [None]:
emp_df['Single'] = pd.get_dummies(emp_df["MaritalStatus"])['Single']

**Encoding "JobRole"**

```

Let's divide this into 3 categories-
1st:
Research Director           240      2  No  183                     23.75
2nd:
Laboratory Technician       777      2  No  651                   16.2162
Research Scientist          876      2  No  717                   18.1507
Sales Executive             978      2  No  813                   16.8712
3rd:
Healthcare Representative   393      2  No  336                   14.5038
Human Resources             156      2  No  135                   13.4615
Manager                     306      2  No  264                   13.7255
Manufacturing Director      435      2  No  387                   11.0345
Sales Representative        249      2  No  213                   14.4578

Column Representation-
RD    LT_RS_SE  
1     0        - means 1st category (Attrition rate 23%)
0     1        - means 2nd category (avg. Attrition rate 17%)
0     0        - means 3rd category (avg. Attrition rate 13%)

```
Using the first line of code to avoid the warning as below

C:\Users\Dell\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

Ref link:

https://www.dataquest.io/blog/settingwithcopywarning/

Here it is safe to ignore this error as we do want to update our original dataframe.


In [None]:
pd.set_option('mode.chained_assignment', None)

emp_df['RD'] = np.zeros(emp_df.shape[0])
emp_df['LT_RS_SE'] = np.zeros(emp_df.shape[0])

for row_num in range(0, emp_df.shape[0]):
    if emp_df['JobRole'][row_num] == 'Research Director':
        emp_df['RD'][row_num] = 1
    if emp_df['JobRole'][row_num] in ['Laboratory Technician', 'Research Scientist', 'Sales Executive']:
        emp_df['LT_RS_SE'][row_num] = 1

In [None]:
emp_df['Male'] = pd.get_dummies(emp_df["Gender"])["Male"]

# In EducationField
# HR : Avg Attrition Rate 40 %
# Others : Avg Attrition Rate 14 %
emp_df["EducationField_HR"] = pd.get_dummies(emp_df["EducationField"], prefix='EducationField')["EducationField_Human Resources"]

# In Department
# HR : Avg Attrition Rate 30 %
# Others : Avg Attrition Rate 15 %

emp_df["Department_HR"] = pd.get_dummies(emp_df["Department"], prefix='Department')["Department_Human Resources"]

emp_df["Travel_Frequently"] = pd.get_dummies(emp_df["BusinessTravel"])["Travel_Frequently"]
emp_df["Travel_Rarely"] = pd.get_dummies(emp_df["BusinessTravel"])["Travel_Rarely"]

emp_df["Attrition_Yes"] = pd.get_dummies(emp_df["Attrition"], prefix='Attrition')["Attrition_Yes"]

Now that all our categorical columns as encoded, we can remove the original columns.

# Drop Unnecessary Features (Part 2)

In [None]:
drop_features(emp_df, ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Attrition'])

In [None]:
emp_df.shape

In [None]:
emp_df.info()

# Study Correlation of the features

In [None]:
plt.figure(figsize=(24,10))
sns.heatmap(emp_df.corr(), annot=True)

**Observations :**
- "DistanceFromHome" and "StockOptionLevel" are barely correlated with Attrition, hence can be removed
- "PercentageSalaryHike" is highly correlated with "PerformanceRating", "PercentageSalaryHike" can be removed
- "YearsAtCompany", "YearsSinceLastPromotion", "YearsWithCurrManager", "TotalWorkingYears" are highly correlated
- Removing "YearsSinceLastPromotion", "YearsWithCurrManager", "TotalWorkingYears"
- Keeping both HR columns (education field and department) as other educational fields/dept.s won't be considered otherwise
- Keeping both travel_frequently and travel_rarely, as we will lose non_travel data otherwise

# Drop Unnecessary Features (Part 3)

In [None]:
drop_features(emp_df, ["DistanceFromHome", "StockOptionLevel", "PercentSalaryHike", "YearsSinceLastPromotion", "YearsWithCurrManager", "TotalWorkingYears"])

In [None]:
emp_df.shape

In [None]:
emp_df.columns

# Visualizing Numerical Data

In [None]:
sns.jointplot(emp_df['Age'], emp_df['NumCompaniesWorked'], data=emp_df, kind='kde')

**Observation :** Most employees have worked in 1-2 companies, aged between 28-35 (roughly).

In [None]:
sns.jointplot(emp_df['Age'], emp_df['NumCompaniesWorked'], data=emp_df, kind='hex')

In [None]:
sns.jointplot(emp_df['JobLevel'], emp_df['NumCompaniesWorked'], data=emp_df, kind='kde')

In [None]:
sns.countplot(emp_df['EnvironmentSatisfaction'], hue=emp_df['Attrition_Yes'], data=emp_df)

In [None]:
sns.countplot(emp_df['JobSatisfaction'], hue=emp_df['Attrition_Yes'], data=emp_df)

In [None]:
sns.countplot(emp_df['WorkLifeBalance'], hue=emp_df['Attrition_Yes'], data=emp_df)

**Observation** : Employees with poor work-life balance are more likely to leave

In [None]:
plt.figure(figsize=(18,10))
sns.countplot(emp_df['Age'], hue=emp_df['Attrition_Yes'], data=emp_df)

# Splitting the data for training and testing

In [None]:
X = emp_df.iloc[:, :-1].values
y = emp_df.iloc[:, -1].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=47)

# Handling Missing Values

In [None]:
plt.figure(figsize=(24,10))
sns.heatmap(emp_df.isnull())

In [None]:
for col in ['EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'NumCompaniesWorked']:
  print("Column : ", col)
  print("Mean : ", emp_df[col].mean())
  print("Mode : ", emp_df[col].mode())
  print("Unique values : ", emp_df[col].unique())
  print("Index : ", emp_df.columns.get_loc(col))

**Comment **: 
- We will impute missing values of 'EnvironmentSatisfaction', 'WorkLifeBalance', 'NumCompaniesWorked' with most frequent values 
- We will impute missing values of 'JobSatisfaction' with const. 3

- This features are not continuous, hence chose whole number(mode) than fraction(mean)

**Imputing missing values with SimpleImputer**

- SimpleImputer takes 2-D numpy array
```
X_train[:, 0].reshape(X_train[:, 0].shape[0], 1) is a 2-D array made from X_train[:, 0] which is a 1-D array
```
- shape of the 2-D array is (n, 1) and shape of 1-D array is (n,)
- array.shape[0] = no. of rows
- array.shape[1] = no. of cols, this gives error for 1-D array
- Missing value imputation is done after splitting the data in training and test set.
- Note that fit is done only once with training set, not on complete set to avoid data leakage
- training and test set both are transformed with the same value(for eg. mean) calculated by fit() method applied on training set



In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_const = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=3)

# for 'EnvironmentSatisfaction', 'WorkLifeBalance', 'NumCompaniesWorked'

for col in [0, 2, 9]:
    imputer.fit(X_train[:, col].reshape(X_train[:, col].shape[0], 1))
    X_train[:, col] = imputer.transform(X_train[:, col].reshape(X_train[:, col].shape[0], 1))[:, 0]
    X_test[:, col] = imputer.transform(X_test[:, col].reshape(X_test[:, col].shape[0], 1))[:, 0]

# for 'JobSatisfaction'
col = 1
imputer_const.fit(X_train[:, col].reshape(X_train[:, col].shape[0], 1))
X_train[:, col] = imputer_const.transform(X_train[:, col].reshape(X_train[:, col].shape[0], 1))[:, 0]
X_test[:, col] = imputer_const.transform(X_test[:, col].reshape(X_test[:, col].shape[0], 1))[:, 0]

Verifying if missing values are correctlty filled up.

In [None]:
for i in [0, 1, 2, 9]:
    array_sum = np.sum(X_train[:,i])
    array_has_nan = np.isnan(array_sum)
    print(array_has_nan)
    
    array_sum = np.sum(X_test[:,i])
    array_has_nan = np.isnan(array_sum)
    print(array_has_nan)

# Scaling the data

- Scaling the data is required for Logistic Regression. 
- Decision Tree Classifier and Random Forest do not need Scaled data.
- Scaling is done only on the independant columns.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Importing Modules for Performance Evaluation** 

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Decision Tree Classification


In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Conclusion

- We got 84.5 % accuracy with our Logistic Regression Model
- Decision Tree Classifier predicted result with 98.8% accuracy.
- We have been able to achieve maximum of 99.4% accuracy with Random Forest Classifier!



**Note** :

- It is important to encode all the categorical columns before starting to train the classification models. Otherwise it throws error similar to "ValueError: could not convert string to float: b"
- Reference Link - https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree#:~:text=question%20is%20misleading.-,As%20it%20stands%2C%20sklearn%20decision%20trees%20do%20not%20handle%20categorical,()%20will%20treat%20as%20numeric.

- This was observed for Logistic Regression, Decision Tree and Random Forest