# <center style="background-color:#63809e; color:white;">Employee Burn Rate Prediction</center>

<center><img src="https://smallville.com.au/wp-content/uploads/2019/12/10-Questions-To-Ask-Yourself-To-Monitor-Your-Mental-HealthAsset-1@4x-100.jpg" ></center>

<br><br>
## <center style="background-color:#6abada; color:white;">About</center>
<div style="text-align: justify;">Understanding what will be the Burn Rate for the employee working in an organization based on the current pandemic situation where work from home is a boon and a bane. How are employees' Burn Rate affected based on various conditions provided? Through this notebook, we are going to understand and observe the mental health of all the employees for a company with the dataset provided. So, we need to predict the burn-out rate of employees based on the provided features thus helping the company to take appropriate measures for their employees' health and keep measures to improve their throughput.</div> 
<br>


<div style="text-align: justify;">Globally, World Mental Health Day is celebrated on <b>October 10</b> each year. The objective of this day is to raise awareness about mental health issues around the world and mobilize efforts in support of mental health. According to an anonymous survey, about <b>450 million</b> people live with mental disorders that can be one of the primary causes of poor health and disability worldwide. These days when the world is suffering from a pandemic situation, it becomes really hard to maintain mental fitness.
 </div>

## <center style="background-color:#6abada; color:white;">Featues in our Data</center>

* `Employee ID`: The unique ID allocated for each employee (example: **fffe390032003000**)
* `Date of Joining`: The date time when the employee has joined the organization (example: **2008-12-30**)
* `Gender`: The gender of the employee (**Male/Female**) 
* `Company Type`: The type of company where the employee is working (**Service/Product**)
* `WFH Setup Available`: Is the work from home facility available for the employee (**Yes/No**)
* `Designation`: The designation of the employee of work in the organization.
    * In the range of **[0.0, 5.0]** bigger is higher designation.
* `Resource Allocation`: The amount of resource allocated to the employee to work, ie. number of working hours. 
    * In the range of **[1.0, 10.0]** (higher means more resource)	
* `Mental Fatigue Score`: The level of fatigue mentally the employee is facing. 
    * In the range of **[0.0, 10.0]** where 0.0 means no fatigue and 10.0 means completely fatigue.
* `Burn Rate`: The value we need to predict for each employee telling the rate of Bur out while working.
    * In the range of **[0.0, 1.0]** where the higher the value is more is the burn out.

# Getting and Understanding Data

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from scipy import stats 
import scipy.stats as st

# !pip install pandas-profiling
from pandas_profiling import ProfileReport

# !pip install cufflinks
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import os
import re

## Getting Data

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
TRAIN_DATA_URL = "/kaggle/input/are-your-employees-burning-out/train.csv"
TEST_DATA_URL = "/kaggle/input/are-your-employees-burning-out/test.csv"
SAMPLE_DATA_URL = "/kaggle/input/are-your-employees-burning-out/sample_submission.csv"

df = pd.read_csv(TRAIN_DATA_URL)
df_test = pd.read_csv(TEST_DATA_URL)
print(df.shape)
df.tail()

## Profiling Data 

In [None]:
profile = ProfileReport(df, title='Pandas Profiling Report')
profile.to_file("./BurnOut_Profiling.html")
profile.to_widgets()

In [None]:
num_cols = ["Designation", "Resource Allocation", "Mental Fatigue Score", "Burn Rate"]
df[num_cols].iplot()

## Understanding Data

In [None]:
df.describe()

In [None]:
df.info()

## Dealing with missing values

In [None]:
df.isna().sum()

In [None]:
df.dropna(subset = ["Burn Rate"], inplace=True)
print(df.shape)

In [None]:
df = df.fillna(df.median())
print("Are there any value missing now? "+str(df.isna().any().any()))

In [None]:
print("Numerical valued features counts:----------", end="\n\n")

print(df["Designation"].value_counts(), end="\n\n")
print(df["Resource Allocation"].value_counts(), end="\n\n")
print(df["Mental Fatigue Score"].value_counts(), end="\n\n")

# Exploratory Data Analysis

In [None]:
sns_plot = sns.pairplot(df, height=2.5)
sns_plot.savefig("pairplot.png")

## Checking Data Normality

In [None]:
def normalize_features(original_data):
    fitted_data, fitted_lambda = stats.boxcox(original_data) 
    fig, ax = plt.subplots(1, 2) 

    # plotting the original data(non-normal) and  
    # fitted data (normal) 
    sns.distplot(original_data, hist = False, kde = True, 
                kde_kws = {'shade': True, 'linewidth': 2},  
                label = "Non-Normal", color ="green", ax = ax[0]) 

    sns.distplot(fitted_data, hist = False, kde = True, 
                kde_kws = {'shade': True, 'linewidth': 2},  
                label = "Normal", color ="green", ax = ax[1]) 

    # adding legends to the subplots 
    plt.legend(loc = "upper right") 

    # rescaling the subplots 
    fig.set_figheight(5) 
    fig.set_figwidth(10)
    return fitted_data

In [None]:
original_data = df.drop(df[df["Mental Fatigue Score"] <= 0.0].index)["Mental Fatigue Score"]
normalize_features(original_data)

In [None]:
original_data = df.drop(df[df["Designation"] <= 0.0].index)["Designation"]
normalize_features(original_data)

In [None]:
original_data = df.drop(df[df["Resource Allocation"] <= 0.0].index)["Resource Allocation"]
normalize_features(original_data)

In [None]:
original_data = df.drop(df[df["Burn Rate"] <= 0.0].index)["Burn Rate"]
normalize_features(original_data)

# Feature Engineering

## Categorize features

In [None]:
def categorize_designation(data):
    if data["Designation"] <= 1.0:
        return 0
    if data["Designation"] > 1.0 and data["Designation"] <= 2.0:
        return 1
    if data["Designation"] > 2.0 and data["Designation"] <= 5.0:
        return 2
    return -1


def categorize_resource(data):
    if data["Resource Allocation"] <= 3.0:
        return 0
    if data["Resource Allocation"] > 3.0 and data["Resource Allocation"] <= 5.0:
        return 1
    if data["Resource Allocation"] > 5.0 and data["Resource Allocation"] <= 10.0:
        return 2
    return -1
    

def categorize_Mental_Fatigue(data):
    if data["Mental Fatigue Score"] <= 4.0:
        return 0
    if data["Mental Fatigue Score"] > 4.0 and data["Mental Fatigue Score"] <= 5.0:
        return 1
    if data["Mental Fatigue Score"] > 5.0 and data["Mental Fatigue Score"] <= 6.0:
        return 2
    if data["Mental Fatigue Score"] > 6.0 and data["Mental Fatigue Score"] <= 7.0:
        return 3
    if data["Mental Fatigue Score"] > 7.0:
        return 4
    return -1



df["categorize_designation"] = df.apply(categorize_designation, axis=1)
df["categorize_resource"] = df.apply(categorize_resource, axis=1)
df["categorize_Mental_Fatigue"] = df.apply(categorize_Mental_Fatigue, axis=1)

df_test["categorize_designation"] = df_test.apply(categorize_designation, axis=1)
df_test["categorize_resource"] = df_test.apply(categorize_resource, axis=1)
df_test["categorize_Mental_Fatigue"] = df_test.apply(categorize_Mental_Fatigue, axis=1)

In [None]:
print("Cetegorized valued features values:----------", end="\n\n")

print(df["categorize_designation"].value_counts(), end="\n\n")
print(df["categorize_resource"].value_counts(), end="\n\n")
print(df["categorize_Mental_Fatigue"].value_counts(), end="\n\n")

## Date of Joining

In [None]:
current_date = pd.to_datetime('today')

df["Date of Joining"] = pd.to_datetime(df["Date of Joining"])
df_test["Date of Joining"] = pd.to_datetime(df_test["Date of Joining"])

In [None]:
def create_days_count(data):
    return (current_date - data["Date of Joining"])

df["days_count"] = df.apply(create_days_count, axis=1)
df["days_count"] = df["days_count"].dt.days

df_test["days_count"] = df_test.apply(create_days_count, axis=1)
df_test["days_count"] = df_test["days_count"].dt.days

## Encoding Features

In [None]:
print(df["Gender"].value_counts(), end="\n\n")
print(df["Company Type"].value_counts(), end="\n\n")
print(df["WFH Setup Available"].value_counts(), end="\n\n")

In [None]:
one = 1
zero = 0

def gender_encoder(data):
    if data["Gender"] == "Female":
        return one
    return zero


def wfh_setup_encoder(data):
    if data["WFH Setup Available"] == "Yes":
        return one
    return zero


def company_encoder(data):
    if data["Company Type"] == "Service":
        return one
    return zero



df["Gender"] = df.apply(gender_encoder, axis=1)
df["WFH Setup Available"] = df.apply(wfh_setup_encoder, axis=1)
df["Company Type"] = df.apply(company_encoder, axis=1)

df_test["Gender"] = df_test.apply(gender_encoder, axis=1)
df_test["WFH Setup Available"] = df_test.apply(wfh_setup_encoder, axis=1)
df_test["Company Type"] = df_test.apply(company_encoder, axis=1)

## Normalize Data

In [None]:
norm_cols = ["Designation", "Resource Allocation", "Mental Fatigue Score"]
#              + ["days_count", "categorize_designation", "categorize_resource", "categorize_Mental_Fatigue"]

train_df_min = df[norm_cols].min()
train_df_max = df[norm_cols].max()

df[norm_cols] = (df[norm_cols] - train_df_min)/(train_df_max - train_df_min)
df_test[norm_cols] = (df_test[norm_cols] - train_df_min)/(train_df_max - train_df_min)

In [None]:
df.head()

## Removing useless columns

In [None]:
df.drop(['Date of Joining', "Employee ID"], axis=1, inplace=True)
clean_df_test = df_test.drop(['Date of Joining', "Employee ID"], axis=1)

# Understand Correlation

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
plt.savefig("correlation_heatmap.png")

In [None]:
# df = df.loc[:, ["WFH Setup Available", "Designation", "Resource Allocation", "Mental Fatigue Score", "Burn Rate"]]
# clean_df_test = df_test.loc[:, ["WFH Setup Available", "Designation", "Resource Allocation", "Mental Fatigue Score"]]

## Working with clean data

In [None]:
clean_df = df.copy()

df.to_csv("clean_df_train.csv", index=False)
train_file_path = "./clean_df_train.csv"
new_df = pd.read_csv(train_file_path)

clean_df_test.to_csv("clean_df_test.csv", index=False)
test_file_path = "./clean_df_test.csv"
new_df_test = pd.read_csv(test_file_path)

new_df_test.head()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(clean_df.loc[:, clean_df.columns != "Burn Rate"],
                                                    clean_df.loc[:, clean_df.columns == "Burn Rate"],
                                                    test_size=0.2, 
                                                    random_state=42)

# Model Training and Predicitons

<center><img src="https://media.giphy.com/media/JstFYY8FwlBm48n7De/giphy.gif" width=70%></center>

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from catboost import CatBoostRegressor
from sklearn.neural_network import MLPRegressor

from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import RandomizedSearchCV

import xgboost


from sklearn.metrics import r2_score

In [None]:
def print_r2_score(y_train, train_pred, y_test, test_pred):
    r2_train = r2_score(y_train, train_pred)
    print("Score LR Train: "+str(round(100*r2_train, 4))+" %")

    r2_test = r2_score(y_test, test_pred)
    print("Score LR Test: "+str(round(100*r2_test, 4))+" %")

In [None]:
sub = pd.read_csv(TEST_DATA_URL)
sub = sub.loc[:, ["Employee ID"]]

## Linear Regression

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

train_pred_linear = lr_model.predict(X_train)
test_pred_linear = lr_model.predict(X_test)
print_r2_score(y_train, train_pred_linear, y_test, test_pred_linear)

lr_main_pred = lr_model.predict(clean_df_test)

sub["Burn Rate"] = lr_main_pred
sub.to_csv('submission_lr.csv', index=False)

## Ridge

In [None]:
ridge_model = Ridge()
ridge_model.fit(X_train, y_train)

train_pred_ridge = ridge_model.predict(X_train)
test_pred_ridge = ridge_model.predict(X_test)
print_r2_score(y_train, train_pred_ridge, y_test, test_pred_ridge)

ridge_main_pred = ridge_model.predict(clean_df_test)

sub["Burn Rate"] = ridge_main_pred
sub.to_csv('submission_lasso.csv', index=False)

## Lasso

In [None]:
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)

train_pred_lasso = lasso_model.predict(X_train)
test_pred_lasso = lasso_model.predict(X_test)
print_r2_score(y_train, train_pred_lasso, y_test, test_pred_lasso)

lasso_main_pred = lasso_model.predict(clean_df_test)

sub["Burn Rate"] = ridge_main_pred
sub.to_csv('submission_ridge.csv', index=False)

## Elastic net

In [None]:
elastic_model = ElasticNet()
elastic_model.fit(X_train, y_train)

train_pred_elastic = elastic_model.predict(X_train)
test_pred_elastic = elastic_model.predict(X_test)
print_r2_score(y_train, train_pred_elastic, y_test, test_pred_elastic)

elastic_main_pred = elastic_model.predict(clean_df_test)

sub["Burn Rate"] = elastic_main_pred
sub.to_csv('submission_elastic.csv', index=False)

## SVR

In [None]:
svr_model = SVR(C=1, gamma=1e-6)
svr_model.fit(X_train, y_train)

train_pred_svr = svr_model.predict(X_train)
test_pred_svr = svr_model.predict(X_test)
print_r2_score(y_train, train_pred_svr, y_test, test_pred_svr)

svr_main_pred = svr_model.predict(clean_df_test)

sub["Burn Rate"] = svr_main_pred
sub.to_csv('submission_svr.csv', index=False)

## Random Forest Regression

In [None]:
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)

train_pred_rf = rf_model.predict(X_train)
test_pred_rf = rf_model.predict(X_test)
print_r2_score(y_train, train_pred_rf, y_test, test_pred_rf)

rf_main_pred = rf_model.predict(clean_df_test)

sub["Burn Rate"] = rf_main_pred
sub.to_csv('submission_rf.csv', index=False)

## XGB (with tuning)

In [None]:
params = {  
    "n_estimators": range(1, 500, 50),
    "max_depth": range(1, 20, 2),
    "learning_rate": st.uniform(0.1, 0.9)     
}

xgbreg = xgboost.XGBRegressor(nthread=-1, objective='reg:squarederror', seed=42)  
gs = RandomizedSearchCV(xgbreg,params,n_jobs=-1, n_iter=15, cv=10, verbose=3, random_state=42)  
gs.fit(X_train, y_train) 
rf_best_params = gs.best_params_
print(rf_best_params, end="\n\n")

lr_main_pred = gs.predict(clean_df_test)

# /////////////////////////////////////////////////////////////////////////////
xgb_model = xgboost.XGBRegressor(
    n_estimators=rf_best_params["n_estimators"] , 
    max_depth=rf_best_params["max_depth"] , 
    learning_rate=rf_best_params["learning_rate"])

xgb_model.fit(X_train, y_train)

train_pred_xgb = xgb_model.predict(X_train)
test_pred_xgb = xgb_model.predict(X_test)
print_r2_score(y_train, train_pred_xgb, y_test, test_pred_xgb)

xgb_main_pred = xgb_model.predict(clean_df_test)
 
sub["Burn Rate"] = lr_main_pred
sub.to_csv('submission_xgb.csv', index=False)

## AdaBoostRegressor

In [None]:
abr_model = AdaBoostRegressor() 
abr_model.fit(X_train, y_train)

train_pred_abr = abr_model.predict(X_train)
test_pred_abr = abr_model.predict(X_test)
print_r2_score(y_train, train_pred_abr, y_test, test_pred_abr)

abr_main_pred = abr_model.predict(clean_df_test)

sub["Burn Rate"] = abr_main_pred
sub.to_csv('submission_abr.csv', index=False)

## CatBoostRegressor

In [None]:
cat_model = CatBoostRegressor()
cat_model.fit(X_train, y_train)

train_pred_cat = cat_model.predict(X_train)
test_pred_cat = cat_model.predict(X_test)
print_r2_score(y_train, train_pred_cat, y_test, test_pred_cat)

cat_main_pred = cat_model.predict(clean_df_test)

sub["Burn Rate"] = cat_main_pred
sub.to_csv('submission_cat.csv', index=False)

## GradientBoostingRegressor

In [None]:
gbr_model = GradientBoostingRegressor() 
gbr_model.fit(X_train, y_train)

train_pred_gbr = gbr_model.predict(X_train)
test_pred_gbr = gbr_model.predict(X_test)
print_r2_score(y_train, train_pred_gbr, y_test, test_pred_gbr)

gbr_main_pred = gbr_model.predict(clean_df_test)

sub["Burn Rate"] = gbr_main_pred
sub.to_csv('submission_gbr.csv', index=False)

## MLPRegressor

In [None]:
mlp_model = MLPRegressor(random_state=42) 
mlp_model.fit(X_train, y_train)

train_pred_mlp = mlp_model.predict(X_train)
test_pred_mlp = mlp_model.predict(X_test)
print_r2_score(y_train, train_pred_mlp, y_test, test_pred_mlp)

mlp_main_pred = mlp_model.predict(clean_df_test)

sub["Burn Rate"] = mlp_main_pred
sub.to_csv('submission_mlp.csv', index=False)

## StackingRegressor

In [None]:
estimators = [('lr', LinearRegression()),
              ('ridge', Ridge()), 
              ('rf', RandomForestRegressor()),
              ('xgb', xgboost.XGBRegressor(nthread=-1, learning_rate=0.1185260448662222, max_depth=3, n_estimators=351)),
              ('mlp', MLPRegressor()),
              ('ada', AdaBoostRegressor()),
              ('gbr', GradientBoostingRegressor()),
              ('cat', CatBoostRegressor())]


stacking_model = StackingRegressor(estimators=estimators, final_estimator=GradientBoostingRegressor(random_state=42))
stacking_model.fit(X_train, y_train)

train_pred_stacking = stacking_model.predict(X_train)
test_pred_stacking = stacking_model.predict(X_test)
print_r2_score(y_train, train_pred_stacking, y_test, test_pred_stacking)

stacking_main_pred = stacking_model.predict(clean_df_test)

sub["Burn Rate"] = stacking_main_pred
sub.to_csv('submission_stacking.csv', index=False)

In [None]:
# !pip install tensor-dash

# from tensordash.tensordash import Tensordash
# histories = Tensordash(
#     ModelName = 'burnout-1')

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
model = Sequential()
model.add(Dense(4, input_dim=10, kernel_initializer='normal', activation='relu'))
model.add(Dense(2670, activation='relu'))
model.add(Dense(1, activation='linear'))
model.summary()

In [None]:
model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
history=model.fit(clean_df.loc[:, clean_df.columns != "Burn Rate"], 
                  clean_df.loc[:, clean_df.columns == "Burn Rate"], 
                  epochs=100, 
                  batch_size=150, 
                  verbose=1, 
                  validation_split=0.08)

neural_main_pred = model.predict(clean_df_test)

In [None]:
sub["Burn Rate"] = neural_main_pred
sub.to_csv('submission_neural.csv', index=False)

In [None]:
print(history.history.keys())
# "Loss"
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

<center><h2>Project Under Development</h2></center>
<img src="https://cdn1.iconfinder.com/data/icons/construction-220/64/43-512.png" width=100 height=100>
<center><h4>I hope it was helpful!!</h4></center>
