# Business Problem: 
***Get insights from the dataset of INX Future Inc., to find-out why the employees' Performance Index is not as per expectations and what can be done to improve the current situation.***

# Objective: 
- In this notebook we use the Processed Data that we have transformed from Raw Data and built a Machine Learning Model.
- Here we use 'INX_Future_Inc_Employee_Performance_Processed_Data.xlsx

**Steps in Train_Model**

Step 1 : Import the libraries

Step 2 : Import the Processed data-set

Step 3 : Split the Processed data-set

Step 4 : Try Different Machine Learning Model

Step 5 : Select the Model,Hypertune it and Train it

Step 6 : Export the Trained Model

## **Step 1 : Import the libraries**

In [1]:
# Import the libraries
import pandas as pd #andas is for data manipulation and analysis. 

# Import Different Models 
from sklearn.linear_model import LogisticRegression
from sklearn import svm, tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import joblib 

## **Step 2 : Import the Processed data-set**

In [2]:
#pd.set_option('display.height', 500)
#pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
df = pd.read_excel('INX_Future_Inc_Employee_Performance_Processed010_Data.xlsx')
df.head()

Unnamed: 0,EmpNumber,SalaryHike_NewCat,Env_Satis_NewCat,EmpEnvironmentSatisfaction,EmpLastSalaryHikePercent,YearsSinceLastPromotion,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,EmpJobRole,EmpHourlyRate,EmpDepartment_Development,Age,PerformanceRating
0,E1001000,1,2,4,12,0,2,7,13,55,0,32,3
1,E1001006,1,2,4,12,1,3,7,13,42,0,47,3
2,E1001007,2,2,4,21,1,3,13,13,48,0,40,4
3,E1001009,1,1,2,15,12,2,6,8,73,0,41,3
4,E1001010,1,1,1,14,2,3,2,13,84,0,60,3


## **Step 3 : Split the Processed data-set into X and y**

In [3]:
# Save EmpNumber for later
Emp_Number = df.EmpNumber

In [4]:
df.drop("EmpNumber",axis=1, inplace = True)

In [5]:
# Create train and test splits
target_name = 'PerformanceRating'
X = df.drop('PerformanceRating', axis=1)


y=df[target_name]

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=0, stratify=None)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(960, 11)
(240, 11)
(960,)
(240,)


## **Step 4 : Try Different Machine Learning Model**

In [7]:
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier

In [8]:
classifiers = []
model1 = xgboost.XGBClassifier()
classifiers.append(model1)
model2 = svm.SVC()
classifiers.append(model2)
model3 = tree.DecisionTreeClassifier()
classifiers.append(model3)
model4 = RandomForestClassifier()
classifiers.append(model4)
model5 = KNeighborsClassifier()
classifiers.append(model5)
model6 =GaussianNB()
classifiers.append(model6)
model7 =MLPClassifier(alpha=1, max_iter=1000)
classifiers.append(model7)
model8 = AdaBoostClassifier()
classifiers.append(model8)


In [9]:
for clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred= clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print("Accuracy of %s is %s"%(clf, acc))

Accuracy of XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1) is 0.9625
Accuracy of SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False) is 0.7458333333333333
Accuracy of DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impur

In [11]:
# Using 10 fold Cross-Validation to train our RandomForestClassifier
from sklearn.model_selection import cross_val_score

model4 = RandomForestClassifier()

#The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its 
#best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. 
scores = cross_val_score(model4 ,X, y, cv=10,scoring='f1_micro')
print(scores)
#The mean score and the 95% confidence interval of the score estimate are hence given by:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


[0.89166667 0.925      0.98333333 0.93333333 0.99166667 0.98333333
 0.94166667 0.89166667 0.90833333 0.88333333]
Accuracy: 0.93 (+/- 0.08)


In [None]:
model1 = xgboost.XGBClassifier()

In [12]:
# Using 10 fold Cross-Validation to train our  XGBClassifier
from sklearn.model_selection import cross_val_score

model1 = xgboost.XGBClassifier()

#The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its 
#best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. 
scores = cross_val_score(model1 ,X, y, cv=10,scoring='f1_micro')
print(scores)
#The mean score and the 95% confidence interval of the score estimate are hence given by:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.89166667 0.94166667 0.98333333 0.925      0.99166667 0.975
 0.91666667 0.9        0.93333333 0.89166667]
Accuracy: 0.94 (+/- 0.07)


### Key Observations:
 After Using 10 fold Cross-Validation
- Accuracy of RandomForestClassifier() is after 0.93 (+/- 0.08)
- Accuracy of XGBClassifier() is after 0.94 (+/- 0.07)


Hence, we select XGBClassifier() as our final Machine Learning Model for prediction

## Step 5 : Select the Model and Train it

In [81]:
xgb_ht = xgboost.XGBClassifier(base_score=0.75, booster='gbtree', 
                          learning_rate=0.02,  
                               max_depth=5,
                            min_child_weight=3,# can be 1,10,100 etc it parctically works
                             n_estimators=300, 
                               objective='multi:softprob',
                               random_state=0,
                               reg_alpha=0, 
                               reg_lambda=1,  
                               colsample_bylevel=1,
                            colsample_bynode=1,
                             colsample_bytree=1 # it works better than other two,
                              ) 

In [None]:
# Using 10 fold Cross-Validation to train our Logistic Regression Model
from sklearn.model_selection import cross_val_score
#model = LogisticRegression(multi_class='multinomial',solver='newton-cg',class_weight = None)

#The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its 
#best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. 
scores = cross_val_score(xgb_ht ,X, y, cv=10,scoring='f1_micro')
print(scores)
#The mean score and the 95% confidence interval of the score estimate are hence given by:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

#### Comparing cross-validation to train/test split

Advantages of cross-validation:

- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data (every observation is used for both training and testing)

Advantages of train/test split:

- Runs K times faster than K-fold cross-validation
- Simpler to examine the detailed results of the testing process

In [84]:
xgb_ht.fit(X_train, y_train)

XGBClassifier(base_score=0.75, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.02, max_delta_step=0, max_depth=5,
              min_child_weight=3, missing=None, n_estimators=300, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

## Step 6 : Export the Trained Model

In [85]:
# Save the model as a pickle in a file 
joblib.dump(xgb_ht, 'Xbgboost_Classifier_INX_performace_predict.pkl')       
#joblib.dump to serialize an object hierarchy

['Xbgboost_Classifier_INX_performace_predict.pkl']