# What we will do in this notebook

- Data cleaning
- Basic EDA
- Visualization
- Try to understand why those people left
- Typical employee that left profiling.
- Untypical employee that left profiling.
- Regression analysis.
- Random Forest classificaion.

Plus we got <font color=deepskyblue> highest </font> classification score so far  <font color=deepskyblue> : ) </font>

## This notebook is still under construction.
## Feel free to <font color=deepskyblue> FORK  </font> this notebook, Please  <font color=deepskyblue> UPVOTE !! </font> if it's helpful to you  <font color=deepskyblue> : ) </font>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
pd.set_option('display.max_columns', None)
general_data=pd.read_csv('/kaggle/input/hr-analytics-case-study/general_data.csv')
employee_survey_data=pd.read_csv('/kaggle/input/hr-analytics-case-study/employee_survey_data.csv')
manager_survey_data=pd.read_csv('/kaggle/input/hr-analytics-case-study/manager_survey_data.csv')

# 1. KNOW YOUR DATA
1. we have 401 entries and 35 columns or features
2. we have 3 datasets for this project,
3. we have 23 numerical features，8 string features.


In [None]:
print('\n','-'*20,'General info','-'*20,'\n')
general_data.info()
print('\n','-'*20,'Employee Survey info','-'*20,'\n')
employee_survey_data.info()
print('\n','-'*20,'Manager Survey info','-'*20,'\n')
manager_survey_data.info()

### Check Uniqueness of EmployeeID column, we gonna merge those three dataframe using this column

In [None]:
print(general_data.EmployeeID.nunique())
print(employee_survey_data.EmployeeID.nunique())
print(manager_survey_data.EmployeeID.nunique())

In [None]:
general_data.set_index('EmployeeID')
employee_survey_data.set_index('EmployeeID')
manager_survey_data.set_index('EmployeeID')
data=pd.concat([general_data,employee_survey_data,manager_survey_data],axis=1)

### Checking nans in Our data set.

 Normally,we should carefully check every nan data to find out why it is missing and deal with them. We gonna leave them there and ignore them when analyzing,since this is just a quick analysis,and nans are just a tiny portion of the whole dataset. 
 
 **What we know**
 
 1. we have 110 enteries that have at least one Nan data.About 2% of all dataset.
 
 **What we do**
 
 1. we ignore them.

In [None]:
data.head()

In [None]:
print(f'There is {data.isna().any(axis=1).mean()*100:.2f}% Nans')

# 2. DESCRIPTIVE STATISTICS

** what we know**

1. we have 16% employees left last year.
2. 60% employees are Male.
3. Most employees are during 30~38 years old.
4. employees that left are generally younger than employees stay.
5. The top relative features to attrition is MaritalStatus,EnvironmentSatisfaction,JobSatisfaction,YearsAtCompany,YearsWithCurrManager,Age,TotalWorkingYears
6. people who left have pretty low jobsatisfaction,whereas enviromentssatisfaction varies.
7. Noticed that there is relatively low ralation between leaving and performanceRating or jobinvolvements.
8. Single empolyees are more likely to leave.

In [None]:
data.describe(include='all')

In [None]:
# last year arrtrition
data.Attrition.value_counts(normalize=True).to_frame()

In [None]:
cmap = plt.get_cmap("tab20c")
outer_colors = cmap(np.arange(2)*13)

plt.subplots(figsize=(20,10))
ax=plt.subplot(1,2,1)
sns.distplot(data.Age,bins=23)
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.subplot(1,2,2)
plt.pie(data.Gender.value_counts(normalize=True),radius=1,autopct='%1.1f%%',wedgeprops=dict(width=0.5, edgecolor='w'),colors=outer_colors,labels=['Male','Female'])
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.violinplot(data=data,x='Gender',y='Age',hue='Attrition')
plt.show()

In [None]:
from sklearn.preprocessing import LabelEncoder
labelEncoder_X = LabelEncoder()
data_labeled=data.copy()
data_labeled['Attrition'] = labelEncoder_X.fit_transform(data_labeled['Attrition'])
data_labeled['BusinessTravel'] = labelEncoder_X.fit_transform(data_labeled['BusinessTravel'])
data_labeled['Department'] = labelEncoder_X.fit_transform(data_labeled['Department'])
data_labeled['EducationField'] = labelEncoder_X.fit_transform(data_labeled['EducationField'])
data_labeled['Gender'] = labelEncoder_X.fit_transform(data_labeled['Gender'])
data_labeled['JobRole'] = labelEncoder_X.fit_transform(data_labeled['JobRole'])
data_labeled['MaritalStatus'] = labelEncoder_X.fit_transform(data_labeled['MaritalStatus'])
data_labeled.drop(['Over18','StandardHours','EmployeeCount'],axis=1,inplace=True)

In [None]:
# check relationship between attrition and other features.
data_labeled.corr()['Attrition'].sort_values().to_frame()

In [None]:
data_labeled.dropna(inplace=True)
ax=plt.figure(figsize=(10,5))
sns.distplot(data_labeled[data_labeled['Attrition']==1].TotalWorkingYears,label='yes')
sns.distplot(data_labeled.TotalWorkingYears,label='all')
plt.xticks(range(0,40,2))
plt.legend()
plt.show()

In [None]:
plt.subplots(figsize=(10,5))
sns.distplot(data[data['Attrition']=='Yes'].Age,label='yes')
sns.distplot(data.Age,label='all')
plt.legend()
plt.show()

In [None]:
plt.subplots(figsize=(10,5))
sns.distplot(data[data['Attrition']=='Yes'].YearsWithCurrManager,label='yes')
sns.distplot(data.YearsWithCurrManager,label='all')
plt.legend()
plt.show()

In [None]:
plt.subplots(figsize=(10,5))
sns.distplot(data[data['Attrition']=='Yes'].YearsAtCompany,label='yes')
sns.distplot(data.YearsAtCompany,label='all')
plt.legend()
plt.show()

In [None]:
plt.subplots(figsize=(10,10))
plt.subplot(2,1,1)
sns.boxplot(data=data,x='Attrition',y='JobSatisfaction')
plt.subplot(2,1,2)
sns.boxplot(data=data,x='Attrition',y='EnvironmentSatisfaction')
plt.show()

In [None]:
pd.crosstab(data['Attrition'],data['MaritalStatus'],margins=True,normalize=True)

# 3.Typical employee left profiling

1. You are a pretty young people in this company,under 32 yrs old,possiblely a male,slightly more chance you are single.
2. You have been woking at this company for 1~6 years.Toltal working years is under 12.
3. but you hate it, your jobsatisfaction is pretty low, which is 1.0
4. Despite that,your emotion doesnt affect your jobinvolment and job performance.

In [None]:
data_left=data[data.Attrition=='Yes']

In [None]:
plt.figure(figsize=(12,6))
ax=plt.subplot(131)
sns.distplot(data_left.Age,ax=ax)
ax=plt.subplot(132)
sns.distplot(data_left.YearsAtCompany,ax=ax)
ax=plt.subplot(133)
sns.distplot(data_left.TotalWorkingYears,ax=ax)
plt.grid()
plt.show()

In [None]:
data[data.Attrition=='Yes'].Gender.value_counts(normalize=True)

# 4 Untypical employee left profiling

But,why those senior people / higher job satisfaction people leave?

1. Senior People tend to leave because of bad Environment Satisfaction,and TotalWorkingYears is another negative factor.
2. people with high JobSatisfaction tend to leave because of Age, and TotalWorkingYears

In [None]:
data_labeled[data_labeled.Age>40].corr()['Attrition']

In [None]:
plt.figure(figsize=(12,6))
ax=plt.subplot(121)
sns.boxplot(data=data_labeled[data_labeled.JobSatisfaction>3],x='Attrition',y='TotalWorkingYears',ax=ax)
ax=plt.subplot(122)
sns.boxplot(data=data_labeled[data_labeled.JobSatisfaction>3],x='Attrition',y='Age',ax=ax)
plt.show()

In [None]:
data_labeled.drop(columns=['EmployeeID'],inplace=True)

In [None]:
corr_cols = ['Age','Attrition','BusinessTravel','DistanceFromHome','Education', 'EducationField','Gender', 'JobLevel', 'JobRole',
       'MaritalStatus', 'MonthlyIncome', 'NumCompaniesWorked',
       'PercentSalaryHike', 'StockOptionLevel', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'YearsAtCompany', 'YearsSinceLastPromotion',
       'YearsWithCurrManager']
corr = data_labeled[corr_cols].corr()
plt.figure(figsize=(16,14))
sns.heatmap(corr, annot =True)
plt.show()

# 5 Modeling

## 5.1 LogisticRegression
Let's first use LogisticRegression to deal with this data.
It seems like LogisticRegression is not very ideal for this dataset

In [None]:
y = data_labeled['Attrition']
x = data_labeled.drop('Attrition', axis = 1)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(x,y, test_size = 0.20, random_state=1)

In [None]:
from sklearn.preprocessing import StandardScaler
Scaler_X = StandardScaler()
X_train = Scaler_X.fit_transform(X_train)
X_test = Scaler_X.transform(X_test)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#confusion matrix
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

## 5.2 Model Selection

Let's see if there is any other model can generate better reuslt.

In [None]:
cols_todrop = ['JobLevel','Department','JobRole','NumCompaniesWorked','PercentSalaryHike','StockOptionLevel',
               'YearsWithCurrManager']
x = data_labeled.drop(['Attrition'], axis=1).reset_index(drop=True)
y = data_labeled['Attrition'].values
x.drop(cols_todrop, axis=1, inplace=True)
x.Age = pd.cut(x.Age, 4)

In [None]:
x = pd.get_dummies(x)
x_copy=x.copy()

In [None]:
scaler = StandardScaler()
x = scaler.fit_transform(x)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.8, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve, train_test_split, RandomizedSearchCV
from sklearn import preprocessing
from sklearn.metrics import r2_score, accuracy_score, roc_auc_score, mean_squared_error

In [None]:
def get_scores(score1, score2):
    models = []
    models.append(('LR', LogisticRegression()))
    models.append(('LDA', LinearDiscriminantAnalysis()))
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('CART', DecisionTreeClassifier()))
    models.append(('NB', GaussianNB()))
    models.append(('SVM', SVC()))
    models.append(('ADA', AdaBoostClassifier()))
    models.append(('GradientBooster', GradientBoostingClassifier()))
    models.append(('ExtraTrees', ExtraTreesClassifier()))
    models.append(('RandomForest', RandomForestClassifier()))
    cv_scores = []
    test_scores = []
    names = []
    stds = []
    differences = []
    #res = pd.DataFrame(columns = {'Model',score+('(train)'), 'Std', score+('(test_score)'), 'difference'})
    #res = res[['Model',score+('(train)'), 'Std', score+('(test_score)'), 'difference']]
    res = pd.DataFrame()
    for index, model in enumerate(models):
        kfold = StratifiedKFold(n_splits=7)
        cv_results = cross_val_score(model[1], x_train, y_train, cv=kfold, scoring=score1)
        cv_scores.append(cv_results)
        names.append(model[0])
        model[1].fit(x_train,y_train)
        predictions = model[1].predict(x_test)
        test_score = score2(predictions, y_test)
        test_scores.append(test_score)
        stds.append(cv_results.std())
        differences.append((cv_results.mean() - test_score))
        res.loc[index,'Model'] = model[0]
        res.loc[index,score1+('(train)')] = cv_results.mean()
        res.loc[index,score1+('(test_score)')] = test_score
        res.loc[index,'Std'] = cv_results.std()
        res.loc[index,'difference'] = cv_results.mean() - test_score
    return res

We can see RandomForest can generate better results in this models

In [None]:
get_scores('accuracy', accuracy_score)

Let's use RandomizedSearchCV to try tuning parameters

In [None]:
params = {'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000]}
RandomForest = RandomForestClassifier()
randomgrid_forest = RandomizedSearchCV(estimator=RandomForest, param_distributions = params, 
                               cv=5, n_iter=25, scoring = 'accuracy',
                               n_jobs = 4, verbose = 3, random_state = 42,
                               return_train_score = True)
randomgrid_forest.fit(x_train,y_train)

In [None]:
forest_preds = randomgrid_forest.predict(x_test)
print(classification_report(y_test,forest_preds))

In [None]:
feature_importances_=randomgrid_forest.best_estimator_.feature_importances_.tolist()
feature_names = x_copy.columns
pd.DataFrame(pd.Series(feature_importances_,feature_names),columns=['importance']).sort_values('importance',ascending=False)

Let's draw a ROC curve

In [None]:
from sklearn.metrics import roc_curve, auc
y_score = randomgrid_forest.predict_proba(x_test)  # 随机森林
fpr, tpr, thresholds = roc_curve(y_test, y_score[:, 1])
roc_auc = auc(fpr, tpr)
def drawRoc(roc_auc,fpr,tpr):
    plt.subplots(figsize=(7, 5.5))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([-0.1, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc="lower right")
    plt.show()
    
drawRoc(roc_auc, fpr, tpr)