## **Employee Attrition Problem:**

The key to success in any organization is attracting and retaining top talent who stays and works together. One of the key tasks is to prevent an employee from leaving the company. We will use machine learning to predict the employees which are going to leave the company and try to prevent it from happening.

Description of few variables:

Data Description:

1. **status** – Current employment status (Employed / Left)
1
2. **department** – Department employees belong(ed) to
3. **salary** – Salary level relative to rest of their department
4. **tenure** – Number of years at the company
5. **recently_promoted** – Was the employee promoted in the last 3 years?
6. **n_projects** – Number of projects employee is staffed on
7. **avg_monthly_hrs** – Average number of hours worked per month
8. **satisfaction** – Score for employee’s satisfaction with the company (higher is better)
9. **last_evaluation** – Score for most recent evaluation of employee (higher is better)
10. **filed_complaint** – Has the employee filed a formal complaint in the last 3 years?


# Importing Python Libraries

In [None]:
from sklearn import metrics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier

# **Importing Dataset and Performing Descriptive Statistics**

In [None]:
#Importing Train dataset into Colab
data=pd.read_csv('../input/predicting-employee-status/employee_data (1).csv')

In [None]:
#First few rows of the dataframe
data.head()

In [None]:
# Get all types of colums
data.dtypes

In [None]:
# Can Statistic data on each column to understand the data better
data.describe(include='all')

# **Dublicate Rows**

Before we will start working on the data, let's make sure there is no duplicate data in our dataset.

In [None]:
# View all dublicate row
data.duplicated().sum()
data[data.duplicated()]

**Observation:**

We can see that most of the dublicated rows caused due has Null values, so that are not really dublicated. So we wont remove them.

# **Handle Missing Values**

Let's check if we have missings cells in our dataset.

In [None]:
# Finding number of null values in individual column
data.isnull().sum()

**Overall Observations Missing Value Analysis:**

Below are colums with the  missing values in our dataset:

1. departmen
2. filed_complaint
3. last_evaluation
4. recently_promoted
5. satisfaction
6. tenure

Let's plot the missing values.

In [None]:
nullTable=round((data.isnull().sum()/data.shape[0])*100,2)
nullValueCols=pd.DataFrame(nullTable,columns=['Missing Value %'])
nullValueCols.reset_index(inplace=True)
nullValueCols.rename(columns={'index': 'Column Name'},inplace=True)
nullValueCols[nullValueCols['Missing Value %']!=0]
print(nullTable)

sns.heatmap(data.isnull(), yticklabels=False, cbar=False, cmap='plasma')
plt.title("Heat map plotting the missing values in the columns")
plt.show()

**Fill Missing Values in depertment:**

Let's check this column values in order to decide what values we can fill the the missing spots


In [None]:
plt.figure(figsize = (10, 8))
Attrplot=sns.countplot(x = 'department', data = data)
plt.title("Countplot for department Column")
plt.show()

The department column is a categorial column and has diffrent types of departments. Some of the workers in the company does not have any department value, maybe they are not in specific department so we will create for them a new department type called 'other'.

In [None]:
# filling all null values with new department type 'other'
data['department'] = data['department'].fillna('other')

**Fill Missing Values in filed_complaint and recently_promoted:**

Let's check this column values in order to decide what values we can fill the the missing spots

In [None]:
plt.figure(figsize = (10, 8))
Attrplot=sns.countplot(x = 'filed_complaint', data = data)
plt.title("Countplot for filed_complaint Column")
plt.show()

plt.figure(figsize = (10, 8))
Attrplot=sns.countplot(x = 'recently_promoted', data = data)
plt.title("Countplot for recently_promoted Column")
plt.show()

filed_complaint is saying if the employee has filed a formal complaint in the last 3 years, as we can see in the plot 1 is when the employee field a complaint. All missing values are when the employee didnt file it. As for this , the missing values will be 0. Also recently_promoted is the same, only the one who recently promoted are mask with 1 , all the missing ones should be 0.

In [None]:
# filling all null values of filed_complaint and recently_promoted with 0
data['filed_complaint'] = data['filed_complaint'].fillna(0)
data['recently_promoted'] = data['recently_promoted'].fillna(0)

**Fill Missing Values in last_evaluation & satisfaction & tenure:**

Let's check this column values in order to decide what values we can fill the the missing spots

In [None]:
plt.figure(figsize = (10, 8))
Attrplot=sns.countplot(x = 'last_evaluation', data = data.sample(n=1000))
plt.title("Countplot for last_evaluation Column")
plt.show()

plt.figure(figsize = (10, 8))
Attrplot=sns.countplot(x = 'satisfaction', data = data.sample(n=1000))
plt.title("Countplot for satisfaction Column")
plt.show()

plt.figure(figsize = (10, 8))
Attrplot=sns.countplot(x = 'tenure', data = data.sample(n=1000))
plt.title("Countplot for tenure Column")
plt.show()

**Observation:**

As we can in the plot all those feilds are numerical and distrubuted well. We will fill when with their mean.

In [None]:
data['last_evaluation'] = data['last_evaluation'].fillna(data['last_evaluation'].mean())
data['satisfaction'] = data['satisfaction'].fillna(data['satisfaction'].mean())
data['tenure'] = data['tenure'].fillna(data['tenure'].mean())

**Verify That There is no Missing Values:**

In [None]:
sns.heatmap(data.isnull(), yticklabels=False, cbar=False, cmap='plasma')
plt.title("Heat map plotting the missing values in the columns")
plt.show()

# **Outliers**

Let's plot our numerical colums to see if we have some outliers

In [None]:
NewNumeric=data[['last_evaluation','n_projects','satisfaction','tenure']]
NewNumericMelt=NewNumeric.melt()
plt.figure(figsize=(15,10))
plt.title("Boxplots for Numerical variables")
bp=sns.boxplot(x='variable',y='value',data=NewNumericMelt)
bp.set_xticklabels(bp.get_xticklabels(),rotation=90)
plt.show()

**Observation:**

We have some outliers in tenure column. Most the people stays in the company for 3-4 years. But some people are more than 6 years or even 10(maybe the founders). Let's check how much from the dataset this outliers represents:

In [None]:
100 * (data["tenure"] > 6).sum() / data.shape[0]

As we can see the outliers are 3.76% of out data. Let's see how many people who stays more that 6 years left the company.

In [None]:
above6years = data[data['tenure'] > 6]
(above6years['status'] == 'Left').sum()

There are no people at the level that left the comapny. Since our goal is to find those people we will remove this outlier data becouse it wont have us quite.

In [None]:
# saving all the people who are working only less then 6 years
data = data[data['tenure'] < 6]

In [None]:
NewNumeric=data[['last_evaluation','n_projects','satisfaction','tenure']]
NewNumericMelt=NewNumeric.melt()
plt.figure(figsize=(15,10))
plt.title("Boxplots for Numerical variables")
bp=sns.boxplot(x='variable',y='value',data=NewNumericMelt)
bp.set_xticklabels(bp.get_xticklabels(),rotation=90)
plt.show()

Now let's make sure we dont have any outliers now

# **Imbalance Analysis**

Before we will start working on the data, let's make sure the data is balanced and we have enought cases of people who Left the company. If not , we will use SMOTE to create more data.




In [None]:
plt.figure(figsize = (10, 8))
Attrplot=sns.countplot(x = 'status', data = data)
plt.title("Countplot for status Column")
plt.show()

**Observation:**

Good for us ,we dont have imbalance data. third of the rows are decribing emplyees who left the comapny. We are good to go.

# **Create Feature and Targets Matrixes**

In [None]:
# Convert 'Left' for 1 and 'Employed' for 0
Y = np.where(data['status'].values == 'Left', 1, 0)

# dropping the target column and create the matrix of features
X = data.drop(['status'], axis=1)

# **Categorial Analysis**

In [None]:
# view the number of unique values each column
features=data.columns
for i in features:
    uniqueValues=data[i].nunique()
    print(i,uniqueValues)
    
data.dtypes

**Observation:**

Our categorial colums are:
1. department
2. filed_complaint(Aready 0 and 1)
3. recently_promoted(Aready 0 and 1)
4. salary

All those feilds are already objects and all others are numbers(float). We can run get_dummies.

In [None]:
# make all categorial columns to separeted columns
X = pd.get_dummies(X, drop_first=True)
X.head()

# **Numeric Analysis**

**Observation:**

Our numeric colums are:
1. avg_monthly_hrs
2. last_evaluation
3. n_projects
4. satisfaction
5. tenure

We will Scale those columns with the StandardScaler(x-std/mean)

In [None]:
num_cols=['avg_monthly_hrs','last_evaluation','n_projects','satisfaction','tenure']
scaler = StandardScaler()
X[num_cols] = scaler.fit_transform(X[num_cols])
X.head()

# **Train Test Split**

We will split our data for train and test so we can verify our performance. Since only 30% of our data refers to people who left the company we want our train and test data to be balanced with this samples.

In [None]:
# train test split


y_left = Y[Y == 1]
x_left = X[Y == 1]
y_stay = Y[Y == 0]
x_stay = X[Y == 0]

x_train_left, x_test_left, y_train_left, y_test_left = train_test_split(x_left,y_left , test_size = .25, random_state=45)
x_train_stay, x_test_stay, y_train_stay, y_test_stay = train_test_split(x_stay,y_stay , test_size = .25, random_state=45)
x_train = np.concatenate((x_train_left, x_train_stay), axis=0)
y_train = np.concatenate((y_train_left, y_train_stay), axis=0)
x_test = np.concatenate((x_test_left, x_test_stay), axis=0)
y_test = np.concatenate((y_test_left, y_test_stay), axis=0)

# **Model Building**

For every model I am using, I am running first RandomGridSearch to find the general rage of the optimize parameters from wide range parameters. Then I will run GridSearchCV to find the best parameters from small range of parameters by running each one. 

**Goal:**
Our main goal is to predict the highest percentege of the people who are going to leave the company. It will be better to find more of them and mistake in some. Meaning the recall is more important for us then precision. 

# **DecisionTree**

first let's run Random Grid Search to find the general rage of the optimize parameters

In [None]:
#creating a random search for some hyper parameters given in param_grid_1
dt=DecisionTreeClassifier()
param_grid={
    'criterion':['gini','entropy'],
    'max_depth':np.arange(4,20,1),
    'min_samples_split':np.arange(0.001,0.1,0.01),
    'max_features':['log2','sqrt','auto'],
    'min_weight_fraction_leaf':np.arange(0.001,0.25,0.05)
}
r_search=RandomizedSearchCV(dt,param_distributions=param_grid,n_iter=10,verbose=1)
r_search.fit(x_train,y_train)

In [None]:
#getting best performing hyper parameters from random search 
r_search.best_params_

Now let's run GridSearchCV for specific parameters to get the best values

In [None]:
param_grid = {'min_weight_fraction_leaf': np.arange(0.001,0.01,0.001),
 'min_samples_split': np.arange(0.07,0.12,0.01),
 'max_features': ['auto'],
 'max_depth': np.arange(10,20,1),
 'criterion': ['entropy']}
dt=DecisionTreeClassifier()
grid_search=GridSearchCV(estimator=dt,param_grid = param_grid,cv=5,verbose=1,n_jobs=-1)
grid_search.fit(x_train,y_train)
grid_search.best_params_
y_predictions = grid_search.best_estimator_.predict(x_test)
print(classification_report(y_test,y_predictions))

In [None]:
dt=DecisionTreeClassifier(criterion= 'entropy',max_depth= 17,max_features= 'auto',min_samples_split= 0.07,min_weight_fraction_leaf= 0.006)
dt.fit(x_train,y_train)
y_predictions = dt.predict(x_test)
print(classification_report(y_test,y_predictions))

# **Random Forest**

first let's run Random Grid Search to find the general rage of the optimize parameters

In [None]:
dt=RandomForestClassifier()
param_grid={
    'criterion':['gini','entropy'],
    'max_depth':np.arange(4,20,1),
    'min_samples_split':np.arange(0.001,0.1,0.01),
    'max_features':['log2','sqrt','auto'],
    'min_weight_fraction_leaf':np.arange(0.001,0.25,0.05),
    'n_estimators': np.arange(50,500,50)
}
r_search=RandomizedSearchCV(dt,param_distributions=param_grid,n_iter=50,verbose=1)
r_search.fit(x_train,y_train)
r_search.best_params_


Now let's run GridSearchCV for specific parameters to get the best values

In [None]:
param_grid = {'min_weight_fraction_leaf': np.arange(0.001,0.005,0.001),
 'min_samples_split': np.arange(0.01,0.06,0.01),
 'max_features': ['auto'],
 'max_depth': np.arange(15,20,1),
 'criterion': ['entropy','gini'],
 'n_estimators': [100]}
dt=RandomForestClassifier()
grid_search=GridSearchCV(estimator=dt,param_grid = param_grid,cv=5,verbose=1,n_jobs=-1, scoring='recall')
grid_search.fit(x_train,y_train)
grid_search.best_params_#getting best parameters of grid search
m_best = grid_search.best_estimator_
rf_predictions_val_y=m_best.predict(x_test)
print(classification_report(y_test,rf_predictions_val_y))


In [None]:
clf = RandomForestClassifier(max_depth=18, random_state=45)
clf.fit(x_train, y_train)
rf_predictions_val_y=clf.predict(x_test)
print(classification_report(y_test,rf_predictions_val_y))

# **Gradient Boosting**

In [None]:
param_grid = {'learning_rate': np.arange(0.1,1,0.05),
 'max_depth': np.arange(1,15,2),
 'max_features': ['auto', 'sqrt', 'log2'],
 'max_depth': np.arange(15,20,1),
 'n_estimators': np.arange(80,150,20)}
gradient_boosting=GradientBoostingClassifier()
grid_search=GridSearchCV(estimator=gradient_boosting,param_grid = param_grid,cv=5,verbose=1,n_jobs=-1, scoring='recall')
grid_search.fit(x_train,y_train)
grid_search.best_params_#getting best parameters of grid search
m_best = grid_search.best_estimator_
rf_predictions_val_y=m_best.predict(x_test)
print(classification_report(y_test,rf_predictions_val_y))

In [None]:
gradient_boosting=GradientBoostingClassifier(learning_rate= 0.15,max_depth= 8,max_features= 'log2',n_estimators= 100)
gradient_boosting.fit(x_train, y_train)
rf_predictions_val_y=gradient_boosting.predict(x_test)
print(classification_report(y_test,rf_predictions_val_y))

# **AdaBoost**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=10, random_state=0, learning_rate= 1)
clf.fit(x_train, y_train)
rf_predictions_val_y=gradient_boosting.predict(x_test)
print(classification_report(y_test,rf_predictions_val_y))