# 0. Introduction

## What is employee attrition?
---

From "Tech Funnel": _"Employee attrition is a gradual but deliberate reduction in the number of employees in a company or business organization. Employees will at some point in time look to change their job places for a number of reasons. It might be for professional or personal reasons but it does happen."_

This definition give us a key point: "Employees change their jobe places for a number a reasons". So, now the question is: **What are the reasons why an employee quits?**

Personally I think that this question is very important for any company, this is because the hiring process is expensive in most of the cases.

Susan Heathfield explains in her article [Top 10 Reasons Why Employees Quit Their Jobs](https://www.thebalancecareers.com/top-reasons-why-employees-quit-their-job-1918985) the next reasons:

1. Relationship With the Boss
2. Bored and Unchallenged by the Work Itself
3. Relationships With Coworkers
4. Opportunities to Use Their Skills and Abilities
5. Contribution of Their Work to the Organization’s Business Goals
6. Autonomy and Independence on the Job
7. Meaningfulness of the Employee's Job
8. Knowledge About Your Organization’s Financial Stability
9. Overall Corporate Culture
10. Management’s Recognition of Employee Job Performance

In this notebook I'll concentrate on prove the relationship between relation with the boss, bored and unchallenged, relationships with coworkers, salary, overtime and employee attrition.


# 1. Exploratory Data Analysis (EDA)

In [None]:
#Library section
import pandas as pd 
import numpy as np 

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE
from collections import Counter



pd.set_option('display.max_columns', 0) #this allow us to visualize all columns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Let's go to import data
df = pd.read_csv("/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")
print(df.shape)

In [None]:
df.info() #check data type of all columns

In [None]:
df.isna().sum() #Lets check if are missing values

When analyzing the data set, it can be seen that most of the columns are  integer (numerical) type and only nine of them are categorical. Also, there are no missing data, which helps to decrease the time spent on data cleaning. 

We know that the Attrition column is our target variable but it is categorized, we need to convert it to numerical.

In [None]:
df['Attrition'] = df['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)

Other binary features are "Over18", "OverTime" and "Gender":

- Over18 : Specifies if the worker is over 18 years old.

- OverTime: Specifies if the worker is working overtime.

- Gender: Specifies the worker gender.

In [None]:
df['Over18'] = df['Over18'].apply(lambda x: 1 if x == 'Y' else 0)
df['OverTime'] = df['OverTime'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Gender'] = df['Gender'].apply(lambda x: 1 if x == 'Female' else 0)

Let's apply the ".describe" method to find how workers who quit behave.

In [None]:
df[df['Attrition'] == 1].describe()

In [None]:
df[df['Attrition'] == 0].describe()

A short analysis reveals the following key points:


- For people who leave the company (on average):

    - They are younger: 33 years
    - They live further from their work: 11km
    - Less satisfaction with the work environment: 2
    - Lower level of work: 1
    - Less satisfaction with work: 2
    - Lower monthly salary: $ 4800.00
    - Work more overtime: 0.5
    - Less years in the company: 5
    - Fewer years in current position: 2
    - Fewer years with current manager: 2.8

## 1.1 Relationship With the Boss
The employee does not necessarily have to establish a friendly relationship with the boss, but it is necessary that good communication exists.
According to many sources: _"A bad boss is also the number one reason why employees quit their job."_

In te data set, we do not have a characteristic that qualifies the relationship with the boss, but we do have a column that quantifies the years with the current manager.

In [None]:
fig = px.histogram(df, x="YearsWithCurrManager", color="Attrition", marginal="box")
fig.show()

We can observe that employees who resign have less time with their manager than employees who keep their jobs.

## 1.2 Bored and Unchallenged by the Work Itself

No one wants to be bored and unchallenged by their work. 

Employees want to enjoy their job. They spend more than a third of their days working, getting ready for work, and transporting themselves to work.

This approach is related to job satisfaction.

In [None]:
job_satisfaction = df.groupby(["JobSatisfaction", "Attrition"]).agg(count_col=pd.NamedAgg(column="Attrition", aggfunc="count")).reset_index()
fig = px.histogram(job_satisfaction, x="JobSatisfaction", y = 'count_col' ,color="Attrition")
fig.update_layout(barmode='group')
fig.show()

A high degree of attraction can be observed when job satisfaction is low, but also when the value it's high. This means that employees must leave the company for other reasons.

## 1.3 Relationships With Coworkers

"When an employee leaves the company, every email that is sent to the whole company, to say good-bye, includes a comment about passionate coworkers who the employee cares about and will miss." Research from the Gallup organization indicates that one of the 12 factors that illuminate whether an employee is happy on their job is having a best friend at work. 

Relationships with coworkers retain employees. 


In [None]:
fig = px.box(df, x = 'Attrition', y = 'JobSatisfaction', color = 'Attrition')
fig.update_layout(title = 'Relationships With Coworkers')
fig.show()

A majority of employees are grouped between quartile 1 and 2 which corresponds to a lower satisfaction rating with co-workers

## 1.4 Salary and attrition

Passion is very important in any job, it has been proven that a worker who does not like what he is doing will eventually quit. 

But, a worker who enjoys what he does, can not only live from passion salary is also important.

In [None]:
fig = px.box(df, x = 'Attrition', y = 'MonthlyIncome', color = 'Attrition')
fig.update_layout(title = 'Relationships With Coworkers')
fig.show()

## 1.5 Overtime and attrition

The emotional burnout that the job can generate is also a major factor in the employee quitting no matter what their salary and position in their job.

Burnout can be generated by overtime, so it is very important to find out what the relationship is between overtime and quitting.

In [None]:
job_satisfaction = df.groupby(["OverTime", "Attrition"]).agg(count_col=pd.NamedAgg(column="Attrition", aggfunc="count")).reset_index()
fig = px.histogram(job_satisfaction, x="OverTime", y = 'count_col' ,color="Attrition")
fig.update_layout(barmode='group')
fig.show()

It is a fact that the number of employees who quit is higher when the employee works overtime

## 1.6 Quick overview

The five factors analyzed have a relationship with the employee's attrition, but there is no compelling reason which causes an employee to resign. This is because, in order for an employee to resign, several factors must be combined.



# 2. Feature Selection

### To Drop:
+ EmployeeCount: All values have the same value.
+ Over18: All values have the same value.
+ StandartHours: All values have the same value.
+ EmployeeNumber: Irrelevant variable, it is only an employee identifier.

### About DailyRate, HourlyRate and MonthlyRate
+ From [Sunix Liu](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/discussion/139552) (Kaggle User): Monthly rate is the internal charge out rate which will be used to calculate the cost of each employee monthly, in general, the monthly rate will cover salary, social insurance, administration, logistics, over head etc.
+ HourlyRate and DailyRate. These are not considered because the Standart Hours for every employee are 80 hours.

I decided to drop these three variebles and keep only with "MonthlyIncome" that is the total salary.

In [None]:
df.drop(columns = ["EmployeeCount", "Over18", "StandardHours", "EmployeeNumber", "MonthlyRate", "DailyRate", "HourlyRate"], inplace = True)
df.shape

## 2.1 Imput and Output variables (X & Y)

In [None]:
# Create an object scaler
MMS = MinMaxScaler()
# get dummies 
dummies = pd.get_dummies(df[df.columns.difference(["Attrition"])])
# scaling the data and define features
X = MMS.fit_transform(dummies)
# Define target variable
y = df[["Attrition"]].values.ravel()

In [None]:
#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0, shuffle = True)

In [None]:
Counter(y_train)

# 3. Modeling

To finish this notebook I tried to answer the next question: 

### Can you predict who will leave the company?

To achieve this I used 2 models: 
 + Logistic regresion
 + Random Forest



## 3.1 Logistic regresion

In [None]:
log_reg_model = LogisticRegression(max_iter=1000, solver = "newton-cg")
log_reg_model.fit(X_train, y_train)

In [None]:
y_pred = log_reg_model.predict(X_test)
print("Model accruracy score: {}".format(accuracy_score(y_test, y_pred)))

In [None]:
print(classification_report(y_test, y_pred))

We can see that the model predicts quite well the "none quite employees" (94% accuracy) but it doesn't predict as well the "quite employees" (53% accuracy). 

## 3.2 Random Forest Classifier

In [None]:
random_forest_model = RandomForestClassifier(random_state = 0)
random_forest_model.fit(X_train, y_train)

In [None]:
y_pred = random_forest_model.predict(X_test)
print("Model accruracy score: {}".format(accuracy_score(y_test, y_pred)))

In [None]:
print(classification_report(y_test, y_pred))

Again, the model predicts quite well the "none quite employees" (92% accuracy) but it have a poor prediction of "quite employees" (29% accuracy).

## 3.3 SMOTE Data 
For the SMOTING technique I only followed one golden rule: 

DON'T PUT SYNTHETIC DATA IN YOUR TEST DATA!!!

In [None]:
smt = SMOTE(random_state=0, sampling_strategy = 0.4)
X_train_SMOTE, y_train_SMOTE = smt.fit_sample(X_train, y_train)

In [None]:
Counter(y_train_SMOTE) #new shape of the target

### 3.3.1 Logistic regresion with SMOTE data

In [None]:
log_reg_model = LogisticRegression(max_iter=1000, solver = "newton-cg")
log_reg_model.fit(X_train_SMOTE, y_train_SMOTE)

In [None]:
y_pred = log_reg_model.predict(X_test)
print("Model accruracy score: {}".format(accuracy_score(y_test, y_pred)))

In [None]:
print(classification_report(y_test, y_pred))

With the SMOTE technique it is possible to get a better precision in the attrition cases (62 %)

### 3.3.2 Random Forest Classifier with SMOTE

In [None]:
random_forest_model = RandomForestClassifier(random_state = 0)
random_forest_model.fit(X_train_SMOTE, y_train_SMOTE)

In [None]:
y_pred = random_forest_model.predict(X_test)
print("Model accruracy score: {}".format(accuracy_score(y_test, y_pred)))

In [None]:
print(classification_report(y_test, y_pred))

In case of RFC we have a better prediction of the "quite cases" but it doesn't better that logistic regresion.

# 4. Conclusions

## About the data.

While non-competitive salary, poor work environment or bad relationship with the boss may be reasons for a worker to quit, these are not sufficient reasons for an employee to resign. Labor resignation is caused by a combination of multiple factors that may or may not be part of the characteristics of this dataset, however, it must be taken in consideration that each company will present diverse factors and ways of qualifying the worker, so this dataset must be taken as a general overview. 

## About the model.

Logistic regression proved to be a good tool to classify and predict which employees will not quit, however, the unbalance of the data set does not help to predict which employees will quit. To compensate for this, the SMOTE technique was used to generate synthetic data to compensate for the lack data from employees who quit. 

I recommend using this tool carefully because it generates synthetic data around a cluster, which is not always good.