Problem Statement
----------------------------

Corporation X is a multinational problem, with nine broad verticals across the organization. One of the problems your Corporation X faces is identifying the right people for promotion (only for the manager position and below) and preparing them in time.

Currently the process, they are following is:

   ● They first identify a set of employees based on recommendations/ past performance.
   
   ● Selected employees go through the separate training and evaluation program for each vertical.
   
   ● These programs are based on the required skill of each vertical. At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., the employee gets a promotion.
For the process mentioned above, the final promotions are only announced after the evaluation, and this leads to a delay in transition to their new roles.

Business Problem that needs to be Solved
--------------------------------------------------

Build a classification Machine Learning model to identify the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

Metric Of Success
----------------------

The model should have an accuracy of atleast 90% to ensure that the right employees are promoted on time 

In [1]:
#Load the required Libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


In [2]:
#Load the data set
df = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')

In [3]:
#Check the first five Records
df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [4]:
#Check the shape of the data
df.shape

(54808, 14)

In [5]:
#Check for missing Data
df.isna().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [6]:
#Check for duplicated data
df.duplicated().sum()

0

Observations: 
1. There is no duplicated data
2.There are missing values in Education and previous_year_rating

The strategy for dealing with missing values is to fill them with mode for each category

In [7]:
#Dealing with the missing values
df['education'].fillna(df['education'].mode()[0], inplace=True)
df['previous_year_rating'].fillna(df['previous_year_rating'].mode()[0],inplace = True)


In [8]:
#Check if Missing Values have been fixed

df.isna().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [9]:
#Checking summary statistics
df.describe()

Unnamed: 0,employee_id,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
count,54808.0,54808.0,54808.0,54808.0,54808.0,54808.0,54808.0,54808.0,54808.0
mean,39195.830627,1.253011,34.803915,3.304481,5.865512,0.351974,0.023172,63.38675,0.08517
std,22586.581449,0.609264,7.660169,1.21477,4.265094,0.47759,0.15045,13.371559,0.279137
min,1.0,1.0,20.0,1.0,1.0,0.0,0.0,39.0,0.0
25%,19669.75,1.0,29.0,3.0,3.0,0.0,0.0,51.0,0.0
50%,39225.5,1.0,33.0,3.0,5.0,0.0,0.0,60.0,0.0
75%,58730.5,1.0,39.0,4.0,7.0,1.0,0.0,76.0,0.0
max,78298.0,10.0,60.0,5.0,37.0,1.0,1.0,99.0,1.0


In [10]:
#Check data types
df.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

In [11]:
#Rename awards_won by removing question make
df.rename(columns = {'awards_won?':'awards_won'}, inplace = True)
#Drop eomplyee_id column as it is not useful
df = df.drop('employee_id',axis=1)

In [12]:
#Encode the catogorical columns because the algorithm that has been chosen uses numerical features
OHE = pd.get_dummies(df, drop_first= True)

In [16]:
# View the final df
OHE.head()

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won,avg_training_score,is_promoted,department_Finance,department_HR,...,region_region_5,region_region_6,region_region_7,region_region_8,region_region_9,education_Below Secondary,education_Master's & above,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,1,35,5.0,8,1,0,49,0,0,0,...,0,0,1,0,0,0,1,0,0,1
1,1,30,5.0,4,0,0,60,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1,34,3.0,7,0,0,50,0,0,0,...,0,0,0,0,0,0,0,1,0,1
3,2,39,1.0,10,0,0,50,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,1,45,3.0,2,0,0,73,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [17]:
#Selecting the features
X = OHE.drop('is_promoted', axis = 1)
y = OHE['is_promoted']

In [18]:
# Splitting the data into train and tests sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state = 42)

In [19]:
# Normalize the features
# Insatiate the minmax scaler
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train) 
X_train = norm.transform(X_train) 
X_test = norm.transform(X_test)

In [20]:
# Choosing the best Estimator before training
for n in range(1,11):
        random_forest_classifier = RandomForestClassifier(n_estimators= n)
        random_forest_classifier.fit(X_train,y_train)
        print('model_score :',n, random_forest_classifier.score(X_train,y_train))

model_score : 1 0.9572527042877623
model_score : 2 0.9633520135540206
model_score : 3 0.9817281376254399
model_score : 4 0.9765411182066989
model_score : 5 0.9886615404665711
model_score : 6 0.9837351752899779
model_score : 7 0.9915287371301968
model_score : 8 0.987723185194839
model_score : 9 0.9933793822494461
model_score : 10 0.9898605499804509


In [24]:
# From the Above the best Estimator is 9 hence this will be used as the n_estimator

random_forest_classifier = RandomForestClassifier(n_estimators=9)
random_forest_classifier.fit(X_train,y_train)
#Predict
train_predictions = random_forest_classifier.predict(X_train)
test_predictions = random_forest_classifier.predict(X_test)

In [25]:
# Checking Model Accuracy

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions))
print('Test set:', accuracy_score(y_test, test_predictions))

Accuracy
Training set: 0.9926234849472175
Test set: 0.9343185550082101


##Conclusions

The Problem required a classification problem since there were only 2 options after evaluation i.e whether an employee will be promoted or not.

For the Model Chosen i.e random forest classifier, the best n_estimator was tested to be 9.

The model Performed well on both training and Test sets.

More Improvements can be achieved by testing on the tree depth paramter.