**Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying theright people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:**
1. They first identify a set of employees based on recommendations/ past performance
2. Selected employees go through the separate training and evaluation program for each vertical. These programs are based on  the required skill of each vertical.At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion
3. For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying  the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle. 

**Evaluation Metric to be Used:** 

**The evaluation metric for this analysis and model should be F1 Score.**

### Dataset Description

<table>
  <thead>
    <tr>
      <th>Field</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td> Variable</td>
      <td>Definition</td>
    </tr>
    <tr>
      <td>employee_id</td>
      <td>Unique ID for employee</td>
    </tr>
    <tr>
      <td>department</td>
      <td>Department of employee</td>
    </tr>
    <tr>
      <td>region</td>
      <td>Region of employment (unordered)</td>
    </tr>
    <tr>
      <td>education</td>
      <td>Education Level</td>
    </tr>
    <tr>
      <td>gender</td>
      <td>Gender of Employee</td>
    </tr>
     <tr>
      <td>recruitment_channel</td>
      <td>Channel of recruitment for employee</td>
    </tr>
     <tr>
      <td>no_of_trainings</td>
      <td>no of other trainings completed in previous year on soft skills, technical skills etc.</td>
    </tr>
    <tr>
     <td>age</td>
     <td>Age of Employee</td>
    </tr>
    <tr>
     <td>previous_year_rating</td>
     <td>Employee Rating for the previous year</td>
    </tr>
    <tr>
     <td>length_of_service</td>
     <td>Length of service in years</td>
    </tr>
    <tr>
     <td>KPIs_met >80%</td>
     <td>if Percent of KPIs(Key performance Indicators) >80% then 1 else 0</td>
    </tr>
    <tr>
      <td>awards_won?</td>
      <td>if awards won during previous year then 1 else 0</td>
    </tr>
      <tr>
      <td>avg_training_score</td>
      <td>Average score in current training evaluations</td>
    </tr>
      <tr>
      <td>Total expenditure</td>
      <td>General government expenditure on health as a percene of total government expenditure (%)</td>
    </tr>
      <tr>
      <td>is_promoted	(Target)</td>
      <td>Recommended for promotion</td>
    </tr>
  </tbody>
</table>

#### Importing Package

In [None]:
import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
from pandas import DataFrame
import pylab as pl
import numpy as np
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train_data=pd.read_csv("../input/hranalyticsav/train_data.csv")
#test_data=pd.read_csv("C:\\Users\\ARPIT\\Desktop\\New folder\\HR Analytics\\test_data.csv")

In [None]:
train_data.head()

In [None]:
train_data.shape

In [None]:
numeric_data = train_data.select_dtypes(include=[np.number])
categorical_data = train_data.select_dtypes(exclude=[np.number])
print("Numeric_Column_Count =", numeric_data.shape)
print("Categorical_Column_Count =", categorical_data.shape)

In [None]:
allna = (train_data.isnull().sum() / len(train_data))*100
allna = allna.drop(allna[allna == 0].index).sort_values()
NA=train_data[allna.index.to_list()]
NAcat=NA.select_dtypes(include='object')
NAnum=NA.select_dtypes(exclude='object')
print(f'We have :{NAcat.shape[1]} categorical features with missing values')
print(f'We have :{NAnum.shape[1]} numerical features with missing values')

In [None]:
total_missing=train_data.isnull().sum().sort_values(ascending=False)
percent=(train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total_missing,percent],axis=1,keys=['Missing_Total','Percent'])
missing_data.head()

In [None]:
import missingno as msno
msno.matrix(numeric_data)
total = numeric_data.isnull().sum().sort_values(ascending=False)
percent_1 = numeric_data.isnull().sum()/numeric_data.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total','%'])

In [None]:
train_data.describe().T

In [None]:
train_data['previous_year_rating']=train_data.groupby(["department","region"])['previous_year_rating'].transform(lambda x: x.fillna(x.median()))

In [None]:
train_data['previous_year_rating'].isna().sum()

In [None]:
train_data['education']=train_data['education'].fillna(train_data['education'].mode()[0], inplace = False)

### Exploratory Data Analysis

In [None]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

In [None]:
pd.crosstab( train_data.is_promoted,train_data.education,margins=True).style.background_gradient(cmap='Wistia')

In [None]:
import matplotlib.style as style
style.use('fivethirtyeight')

In [None]:
plt.figure(figsize=(15,9))
size = [39078, 805,14925]
labels = "Bachelor's", "Below Secondary","Master's & above"
colors = ['green','blue','yellow']
explode = [0, 0.2,0.3]
plt.pie(size, labels = labels, colors = colors, explode = explode, shadow = False, autopct = "%.2f%%")
plt.title('Pie Chart Representing distribution of Employess based on their Education', fontsize =20)
plt.axis('on')
plt.legend(bbox_to_anchor=(0.1, 1.05, 1., .80), loc='lower left',
           ncol=3, mode="expand", borderaxespad=1.5)
plt.show()

In [None]:
# checking dependency of different regions in promotion
data = pd.crosstab(train_data['region'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (15, 8), color = ['lightblue', 'purple'])
plt.title('Dependency of Regions in determining Promotion of Employees', fontsize = 30)
plt.xlabel('Different Regions of the Company', fontsize = 18)
plt.legend(bbox_to_anchor=(0.1, 1.05, 1., .80), loc='lower left',
           ncol=3, mode="expand", borderaxespad=1.5)
plt.show()

In [None]:
# dependency of awards won on promotion
data = pd.crosstab(train_data['awards_won?'], train_data['is_promoted'])
data.div(data.sum(1).astype('float'), axis = 0).plot(kind = 'bar', stacked = True, figsize = (6, 6), color = ['magenta', 'purple'])
plt.title('Dependency of Awards in determining Promotion', fontsize = 25)
plt.xlabel('Awards Won or Not', fontsize = 20)
plt.legend()
plt.show()

In [None]:
# scatter plot between average training score and is_promoted
data = pd.crosstab(train_data['avg_training_score'], train_data['is_promoted'])
data.div(data.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (20, 9), color = ['darkred', 'lightgreen'])
plt.title('Looking at the Dependency of Training Score in promotion', fontsize = 30)
plt.xlabel('Average Training Scores', fontsize = 15)
plt.legend(bbox_to_anchor=(0.1, 1.05, 1., .80), loc='lower left',
           ncol=3, mode="expand", borderaxespad=1.5)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot('recruitment_channel',hue='is_promoted',data=train_data).set_title('Promotion_Recruitment Channel')

In [None]:
plt.figure(figsize = (15,5))
sns.kdeplot(train_data["length_of_service"][train_data.is_promoted == 0], color = "magenta", shade = True)
sns.kdeplot(train_data["length_of_service"][train_data.is_promoted == 1], color = "blue", shade = True)
sns.kdeplot(train_data["avg_training_score"][train_data.is_promoted == 0], color = "green", shade = True)
sns.kdeplot(train_data["avg_training_score"][train_data.is_promoted == 1], color = "yellow", shade = True)
plt.title("Best Age where Employees can get Promoted and Avg training Score for Promotion")
plt.legend(['Promoted = 0', 'Promoted = 1'])

In [None]:
train_data['education'].value_counts()

In [None]:
plt.figure(figsize = (15,5))
train_data.age[train_data.education == "Bachelor's"].plot(kind='kde')    
train_data.age[train_data.education == "Master's & above"].plot(kind='kde')
train_data.age[train_data.education == "Below Secondary"].plot(kind='kde')
 # plots an axis lable
plt.xlabel("Age")    
plt.title("Age Distribution with Education Qualification")
# sets our legend for our graph.
plt.legend(("Bachelor's","Master's & above","Below Secondary"),loc='best') ;

In [None]:
plt.figure(figsize=(15,8))
c = 'y'
# Create dictionary of keyword aruments to pass to plt.boxplot
red_dict =  {'patch_artist': True,
             'boxprops': dict(color=c, facecolor=c),
             'capprops': dict(color=c),
             'flierprops': dict(color=c, markeredgecolor=c),
             'medianprops': dict(color=c),
             'whiskerprops': dict(color=c)}
train_data.boxplot(column=['age','length_of_service','avg_training_score'],**red_dict,grid=False)

In [None]:
import plotly.express as px 
df = train_data
fig = px.box(df, x="is_promoted", y="age",points="outliers")
fig.update_traces(quartilemethod="inclusive") 
fig.show()

In [None]:
promoted= 'promoted'
not_promoted = 'not promoted'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(14, 6))
female = train_data[train_data['gender']=='f']
male = train_data[train_data['gender']=='m']
ax = sns.distplot(female[female['is_promoted']==0].age.dropna(), bins=40, label = not_promoted, ax = axes[0], kde =False)
ax = sns.distplot(female[female['is_promoted']==1].age.dropna(), bins=18, label = promoted, ax = axes[0], kde =False)

ax.legend()
ax.set_title('Female')
ax = sns.distplot(male[male['is_promoted']==1].age.dropna(), bins=18, label = promoted, ax = axes[1], kde = False)
ax = sns.distplot(male[male['is_promoted']==0].age.dropna(), bins=40, label = not_promoted, ax = axes[1], kde = False)
ax.legend()
_ = ax.set_title('Male')

In [None]:
plt.figure(figsize = (10,7))
sns.countplot(train_data['is_promoted'])
plt.show()
print('Percent of  people getting Promoted: ',len(train_data[train_data['is_promoted']==1])/len(train_data['is_promoted'])*100,"%")
print('Percent of people not getting promoted: ',len(train_data[train_data['is_promoted']==0])/len(train_data['is_promoted'])*100,"%")

### Min-Max Scaling

In [None]:
x=train_data.drop(['employee_id','KPIs_met >80%','awards_won?','is_promoted','department','region','education','gender','recruitment_channel'],axis=1)
y=train_data['is_promoted']

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaled_features = MinMaxScaler().fit_transform(x.values)
scaled_features_df = pd.DataFrame(scaled_features,index=x.index, columns=x.columns)

In [None]:
train_data=train_data.drop(['no_of_trainings', 'age', 'previous_year_rating', 'length_of_service','avg_training_score'],axis=1,inplace=False)

In [None]:
train_data=pd.concat([scaled_features_df,train_data], axis=1).reindex(train_data.index)

In [None]:
train_data.head()

### Feature Engineering

In [None]:
pd.get_dummies(train_data['gender'], prefix='G')
train_data = pd.concat([train_data, pd.get_dummies(train_data['gender'], prefix='G')], axis=1)

In [None]:
pd.get_dummies(train_data['recruitment_channel'], prefix='R')
train_data = pd.concat([train_data, pd.get_dummies(train_data['recruitment_channel'], prefix='R')], axis=1)

In [None]:
pd.get_dummies(train_data['region'], prefix='Re')
train_data = pd.concat([train_data, pd.get_dummies(train_data['region'], prefix='Re')], axis=1)

In [None]:
pd.get_dummies(train_data['department'], prefix='Dep')
train_data = pd.concat([train_data, pd.get_dummies(train_data['department'], prefix='Dep')], axis=1)

In [None]:
train_data=train_data.drop(['employee_id','department','Dep_Technology','region','Re_region_8','recruitment_channel','R_other','gender','G_f'],axis=1,inplace=False)
train_data.head()

#### Ordinal Encoding on Education

In [None]:
train_data['education'].value_counts()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinalencoder = OrdinalEncoder()

In [None]:
ordinalencoder.fit_transform(train_data[['education']])
categories = pd.Categorical(train_data['education'], categories=["Master's & above","Bachelor's","Below Secondary"], ordered=True)
# Order of labels set for data
categories
# Factorizing the column data
labels, unique = pd.factorize(categories, sort=True)
train_data['education'] = labels
# Encoded Income Range Data
train_data['education'].value_counts()

#### Checking for Correlated Columns

In [None]:
my_corr=train_data.corr()
plt.figure(figsize=(18,18))
sns.set(font_scale=0.8)
sns.heatmap(my_corr, cbar=True, annot=True, square=True, fmt='.1f', annot_kws={'size': 8},linewidth=0.8)
plt.show()

#### Correlation Among Columns > 0.5

In [None]:
cor_target =train_data.corr().abs()
Target_Corr = cor_target.corr()['avg_training_score'].to_frame().reset_index() #Feature Correlation related to SalePrice
Feature_corr =cor_target.unstack().to_frame(name='Correlation') # Feature Relation
Feature = Feature_corr[(Feature_corr['Correlation']>=0.5)&(Feature_corr['Correlation']<1)].sort_values(by='Correlation', ascending = False).reset_index()
Feature.head(10)

### VIF

In [None]:
X=train_data.drop('is_promoted',axis=1)
Y=train_data.is_promoted

In [None]:
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_vif = add_constant(X)
vif = pd.Series([variance_inflation_factor(X_vif.values, i) 
               for i in range(X_vif.shape[1])], 
              index=X_vif.columns)

In [None]:
print(vif.sort_values(ascending = False).head(20))

In [None]:
X=train_data.drop(['Re_region_2','Dep_Sales & Marketing','Re_region_22','avg_training_score','Re_region_7','Dep_Operations','is_promoted'],axis=1)
Y=train_data.is_promoted

### Feature Selection

#### Stats Model for Logistic reg.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=123)

#### Feature Selection using random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

In [None]:
feat_sel=SelectFromModel(RandomForestClassifier(n_estimators=1000))
feat_sel_fit=feat_sel.fit(X_train,y_train)

In [None]:
feat_sel_fit.get_support()

In [None]:
selected_feat= X_train.columns[(feat_sel_fit.get_support())]
selected_feat

### Modeling 

#### Phase1: Data Preprocessing, Data Balancing

##### Creating the subset from existing DatFrame for Analysis.


In [None]:
new_df=train_data[['no_of_trainings', 'age', 'previous_year_rating', 'length_of_service','KPIs_met >80%', 'awards_won?','G_m','R_sourcing','is_promoted']]

In [None]:
x=new_df.drop(['is_promoted'],axis=1)
y=new_df.is_promoted

**X will contain all the Independent variables such as no_of_trainings, age, previous_year_rating,  length_of_service, KPIs_met >80%, awards_won?, G_m,  R_sourcing <br>
Y has the is_promoted i.e. the Dependent variable**

##### Data Balancing using SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='minority')

In [None]:
X, Y = smote.fit_sample(x, y)
X.shape,Y.shape

##### Splitting the dataset into the Training set and Test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)

#### Phase 2: Making the Neural Network (NN)

##### Importing the Keras libraries and packages

In [None]:
import keras
from keras.models import Sequential # sequential module reqd to initialize the NN
from keras.layers import * # dense module reqd to build the layers of the NN

##### Initialising the ANN

In [None]:
promotion_pred = Sequential()# creating object of Sequential class

##### Adding the input layer and the hidden layer

In [None]:
promotion_pred.add(Dense(input_dim=8, activation="relu", kernel_initializer="uniform", units=5))

**1.add method of object classifier to add layers.<br>
2.Dense function will take care of the first step of ANN i.e. randomly intializing weights of synapses to small number close to 0 (but not 0); done with init = 'uniform' (initialize weights based on uniform distribution) 8 input nodes we know from our dataset; hence input_dim = 8.<br>
3.Forward-propagation by applying the activiation function. Neuron applies the activation fn to the sum of weights inputs. The closer the activation fn value is to 1, the more activated the neuron, and the more activated the neuron, the more it passes on the signal.<br>
4.Use rectifier activation fn for hiddern layers; activation = 'relu' units i.e the output dimensions is set = 6 which is the chosen number of nodes in this hidden layer.<br>
TIP: no rule of thumb to choosing ouput dimensions; can choose average of the number of nodes in the input layer and the number of nodes in the output layer.**

##### Adding the output layer

In [None]:
promotion_pred.add(Dense(activation = 'sigmoid', kernel_initializer = "uniform", units = 1))

##### Compiling the ANN

In [None]:
promotion_pred.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

**1.optimizer is the algo used to find the optimal no of weights in the NN (until now weights have only been initialized); 'adam' is a type of SGD algo loss deals with the loss function within the SGD algo which needs to be optimized(minimized);<br>
2.loss fn for SGD going to be the same as that for logistic regression (logarithmic loss); since sigmoid fn used as activation fn we use log loss fn acuracy metric ensure that accuracy increases batch by batch; metrics parameter expecting a list so 'accuracy' added in []**


##### Fitting the ANN to the Training set

In [None]:
History=promotion_pred.fit(X_train, y_train, batch_size = 10, epochs = 100,validation_data=(X_test,y_test))

**1.Weights upated via batch learning so batch size needs to be specified (no rule of thumb)<br>
2.When the whole training set is passed throught the ANN, that makes an epoch. <br>
3.Epoch size needs to be specified (no rule of thumb)**

#### Phase 3: Making the predictions and evaluating the model


##### Predicting the Test set results.

In [None]:
y_pred = promotion_pred.predict(X_test)# this gives us the probability of a employee getting promoted

In [None]:
y_pred = (y_pred > 0.5) # this above syntax is equivalent to sayig if y_pred>0.5 give value 1 and if not give value 0. Binary classification

##### Making the Confusion Matrix


In [None]:
from sklearn.metrics import confusion_matrix
from yellowbrick.classifier import ConfusionMatrix
conf_mat=confusion_matrix(y_test,y_pred)

In [None]:
from sklearn.metrics import classification_report , confusion_matrix , accuracy_score
from mlxtend.plotting import plot_confusion_matrix
cm_test = confusion_matrix(y_test, y_pred)
fig, ax = plot_confusion_matrix(conf_mat=conf_mat,figsize=(8, 8),
                                show_absolute=True,
                                show_normed=True,
                                colorbar=True)
plt.show()

##### Accuracy

In [None]:
print("Accuracy of the model is - " , promotion_pred.evaluate(X_test,y_test)[1]*100 , "%")

In [None]:
y_pred=y_pred.astype(int).flatten()
print(y_pred)

**Classification Report**

In [None]:
from sklearn.metrics import classification_report
cls = classification_report(y_test,y_pred)
print(cls)

In [None]:
print(History.history.keys())

In [None]:
import matplotlib.style as style
style.use('fivethirtyeight')

**Loss-Train-Test**

In [None]:
plt.figure(figsize=(18,8))
plt.plot(History.history['loss'],label='train')
plt.xlabel('epochs')
plt.plot(History.history['val_loss'],label='test')
plt.ylabel('loss')
plt.legend()
plt.show()

**Accuracy-Train-Test**

In [None]:
plt.figure(figsize=(18,8))
plt.plot(History.history['accuracy'],label='train')
plt.xlabel('epochs')
plt.plot(History.history['val_accuracy'],label='test')
plt.ylabel('accuracy')
plt.legend()
plt.show()