### Predict Model :
* **Importing necessary libraries**
* **Loading Train and Test data**
* **Independent and Dependent variable creation**
* **Importing trained models and predicting results**
* **Exporting the results**
* **User-defined Performance Ratings Prediction**

#### Importing necessary libraries

In [112]:
#Importing necessary libraries
import pandas as pd
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)
import numpy as np

# Import Trained models
import pickle

import warnings
warnings.filterwarnings("ignore")

#### Loading Train and Test data 

In [113]:
#Loading the data
processed_file_location = 'C:/Users/User/Desktop/E10901-PR2-V18_Certified Data Scientist - Project/data/processed/'

# Files are generated from src-> Data Processing -> data_processing.ipynb
train_data = pd.read_csv(processed_file_location +'train_data.csv')
test_data = pd.read_csv(processed_file_location +'test_data.csv')

In [114]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 21 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Unnamed: 0                    960 non-null    int64 
 1   Age                           960 non-null    int64 
 2   MaritalStatus                 960 non-null    object
 3   EmpDepartment                 960 non-null    object
 4   EmpJobRole                    960 non-null    object
 5   BusinessTravelFrequency       960 non-null    object
 6   DistanceFromHome              960 non-null    int64 
 7   EmpEducationLevel             960 non-null    int64 
 8   EmpEnvironmentSatisfaction    960 non-null    int64 
 9   EmpHourlyRate                 960 non-null    int64 
 10  EmpJobInvolvement             960 non-null    int64 
 11  EmpJobLevel                   960 non-null    int64 
 12  NumCompaniesWorked            960 non-null    int64 
 13  OverTime            

In [115]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 21 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Unnamed: 0                    240 non-null    int64 
 1   Age                           240 non-null    int64 
 2   MaritalStatus                 240 non-null    object
 3   EmpDepartment                 240 non-null    object
 4   EmpJobRole                    240 non-null    object
 5   BusinessTravelFrequency       240 non-null    object
 6   DistanceFromHome              240 non-null    int64 
 7   EmpEducationLevel             240 non-null    int64 
 8   EmpEnvironmentSatisfaction    240 non-null    int64 
 9   EmpHourlyRate                 240 non-null    int64 
 10  EmpJobInvolvement             240 non-null    int64 
 11  EmpJobLevel                   240 non-null    int64 
 12  NumCompaniesWorked            240 non-null    int64 
 13  OverTime            

In [116]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,Age,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,NumCompaniesWorked,OverTime,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition,PerformanceRating
0,437,28,Single,Sales,Sales Executive,Travel_Frequently,7,3,3,55,3,2,0,No,14,4,3,2,1,No,3
1,1091,25,Single,Sales,Sales Executive,Travel_Rarely,4,2,2,99,2,2,1,Yes,11,2,3,4,1,No,2
2,327,25,Single,Research & Development,Research Scientist,Travel_Rarely,1,3,4,40,3,1,1,No,18,4,2,2,2,No,3
3,576,31,Married,Sales,Sales Executive,Travel_Rarely,5,3,1,51,3,2,1,No,19,3,3,2,0,No,4
4,1078,30,Married,Sales,Sales Representative,Travel_Rarely,2,1,3,72,3,1,1,No,18,1,3,0,0,No,3


In [117]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,Age,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,NumCompaniesWorked,OverTime,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition,PerformanceRating
0,811,35,Married,Development,Business Analyst,Travel_Rarely,23,4,3,30,3,1,3,Yes,15,3,3,2,2,No,3
1,1149,26,Single,Development,Developer,Travel_Rarely,24,3,3,66,1,1,1,Yes,18,2,1,0,0,Yes,3
2,662,36,Married,Sales,Sales Executive,Travel_Rarely,17,2,3,33,2,2,2,No,16,3,1,2,1,No,2
3,542,53,Married,Finance,Finance Manager,Travel_Rarely,24,4,2,48,4,3,3,No,15,3,3,3,1,No,2
4,858,34,Divorced,Development,Business Analyst,Travel_Rarely,6,4,3,45,2,2,6,No,15,3,3,0,0,No,3


Unnamed: 0 column is the index from the raw data, hence we can ignore that column

#### Independent and Dependent variable creation

In [118]:
X_train = train_data.iloc[:,1:-1]# Independent variable
X_test = test_data.iloc[:,1:-1] # Independent variable
y_train = train_data.PerformanceRating # Dependent variable
y_test = test_data.PerformanceRating # Dependent variable

In [119]:
X_train.head()

Unnamed: 0,Age,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,NumCompaniesWorked,OverTime,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition
0,28,Single,Sales,Sales Executive,Travel_Frequently,7,3,3,55,3,2,0,No,14,4,3,2,1,No
1,25,Single,Sales,Sales Executive,Travel_Rarely,4,2,2,99,2,2,1,Yes,11,2,3,4,1,No
2,25,Single,Research & Development,Research Scientist,Travel_Rarely,1,3,4,40,3,1,1,No,18,4,2,2,2,No
3,31,Married,Sales,Sales Executive,Travel_Rarely,5,3,1,51,3,2,1,No,19,3,3,2,0,No
4,30,Married,Sales,Sales Representative,Travel_Rarely,2,1,3,72,3,1,1,No,18,1,3,0,0,No


In [120]:
X_test.head()

Unnamed: 0,Age,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,NumCompaniesWorked,OverTime,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition
0,35,Married,Development,Business Analyst,Travel_Rarely,23,4,3,30,3,1,3,Yes,15,3,3,2,2,No
1,26,Single,Development,Developer,Travel_Rarely,24,3,3,66,1,1,1,Yes,18,2,1,0,0,Yes
2,36,Married,Sales,Sales Executive,Travel_Rarely,17,2,3,33,2,2,2,No,16,3,1,2,1,No
3,53,Married,Finance,Finance Manager,Travel_Rarely,24,4,2,48,4,3,3,No,15,3,3,3,1,No
4,34,Divorced,Development,Business Analyst,Travel_Rarely,6,4,3,45,2,2,6,No,15,3,3,0,0,No


In [121]:
y_train

0      3
1      2
2      3
3      4
4      3
      ..
955    3
956    3
957    3
958    3
959    4
Name: PerformanceRating, Length: 960, dtype: int64

In [122]:
y_test

0      3
1      3
2      2
3      2
4      3
      ..
235    2
236    3
237    3
238    3
239    2
Name: PerformanceRating, Length: 240, dtype: int64

In [123]:
X_train.shape,X_test.shape

((960, 19), (240, 19))

In [124]:
y_train.value_counts(),y_test.value_counts()

(3    699
 2    155
 4    106
 Name: PerformanceRating, dtype: int64,
 3    175
 2     39
 4     26
 Name: PerformanceRating, dtype: int64)

#### Importing trained models and predicting results

In [104]:
model = ['Logistic_Regression',
         'Support Vector Classifier_1',
         'Support_Vector_Classifier_2',
         'Decision_Tree_Classifier',
         'Random_Forest_Classifier',
         'AdaBoost_Classifier',
         'Gradient_Boosting_Classifier',
         'Stacking_Classifier']

# Creating results_df dataframe to store results
train_results_df = X_train
train_results_df['Actual_PerformanceRating'] = train_data.PerformanceRating

test_results_df = X_test 
test_results_df['Actual_PerformanceRating'] = test_data.PerformanceRating

# Trained modeds are generated from src -> models -> train_model
# Predicting the results for all trained models and storing it in results_df
for i in model:
    pickled_model = pickle.load(open(processed_file_location + i + '_trained_model.pkl', 'rb'))
    train_results_df['Predicted_PerformanceRating_'+ i] = pickled_model.predict(X_train)
    test_results_df['Predicted_PerformanceRating_'+ i] = pickled_model.predict(X_test)

In [105]:
# Displaying the result for Training data prediction
train_results_df

Unnamed: 0,Age,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,NumCompaniesWorked,OverTime,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition,Actual_PerformanceRating,Predicted_PerformanceRating_Logistic_Regression,Predicted_PerformanceRating_Support Vector Classifier_1,Predicted_PerformanceRating_Support_Vector_Classifier_2,Predicted_PerformanceRating_Decision_Tree_Classifier,Predicted_PerformanceRating_Random_Forest_Classifier,Predicted_PerformanceRating_AdaBoost_Classifier,Predicted_PerformanceRating_Gradient_Boosting_Classifier,Predicted_PerformanceRating_Stacking_Classifier
0,28,Single,Sales,Sales Executive,Travel_Frequently,7,3,3,55,3,2,0,No,14,4,3,2,1,No,3,3,3,3,3,3,3,3,3
1,25,Single,Sales,Sales Executive,Travel_Rarely,4,2,2,99,2,2,1,Yes,11,2,3,4,1,No,2,2,2,2,2,2,2,2,2
2,25,Single,Research & Development,Research Scientist,Travel_Rarely,1,3,4,40,3,1,1,No,18,4,2,2,2,No,3,4,3,3,3,3,3,3,3
3,31,Married,Sales,Sales Executive,Travel_Rarely,5,3,1,51,3,2,1,No,19,3,3,2,0,No,4,3,4,3,2,4,3,4,4
4,30,Married,Sales,Sales Representative,Travel_Rarely,2,1,3,72,3,1,1,No,18,1,3,0,0,No,3,3,3,3,3,3,3,3,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
955,51,Single,Research & Development,Laboratory Technician,Travel_Frequently,6,2,2,40,2,1,0,No,14,2,2,0,7,No,3,2,3,3,2,2,3,3,3
956,42,Married,Development,Developer,Travel_Rarely,2,4,3,81,3,2,1,No,13,2,3,2,2,No,3,3,3,3,3,3,3,3,3
957,31,Married,Sales,Sales Representative,Travel_Rarely,7,4,2,41,2,1,3,No,15,2,4,7,5,No,3,2,2,2,3,2,3,3,2
958,33,Married,Development,Developer,Travel_Frequently,1,4,3,84,4,2,1,Yes,13,1,3,5,1,No,3,3,3,3,3,3,3,3,3


In [106]:
# Displaying the result for Testing data prediction
test_results_df

Unnamed: 0,Age,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,NumCompaniesWorked,OverTime,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition,Actual_PerformanceRating,Predicted_PerformanceRating_Logistic_Regression,Predicted_PerformanceRating_Support Vector Classifier_1,Predicted_PerformanceRating_Support_Vector_Classifier_2,Predicted_PerformanceRating_Decision_Tree_Classifier,Predicted_PerformanceRating_Random_Forest_Classifier,Predicted_PerformanceRating_AdaBoost_Classifier,Predicted_PerformanceRating_Gradient_Boosting_Classifier,Predicted_PerformanceRating_Stacking_Classifier
0,35,Married,Development,Business Analyst,Travel_Rarely,23,4,3,30,3,1,3,Yes,15,3,3,2,2,No,3,3,3,3,3,3,3,3,3
1,26,Single,Development,Developer,Travel_Rarely,24,3,3,66,1,1,1,Yes,18,2,1,0,0,Yes,3,3,3,3,3,3,3,3,3
2,36,Married,Sales,Sales Executive,Travel_Rarely,17,2,3,33,2,2,2,No,16,3,1,2,1,No,2,3,3,3,3,3,3,3,3
3,53,Married,Finance,Finance Manager,Travel_Rarely,24,4,2,48,4,3,3,No,15,3,3,3,1,No,2,2,2,2,2,2,3,2,2
4,34,Divorced,Development,Business Analyst,Travel_Rarely,6,4,3,45,2,2,6,No,15,3,3,0,0,No,3,3,3,3,3,3,3,3,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,58,Married,Research & Development,Manager R&D,Travel_Frequently,15,4,1,87,3,4,2,Yes,14,2,3,2,2,No,2,2,2,2,2,2,3,3,2
236,30,Divorced,Sales,Sales Executive,Travel_Rarely,15,2,3,94,2,3,2,No,11,1,3,7,1,No,3,3,3,3,3,3,3,3,3
237,33,Single,Development,Developer,Travel_Rarely,13,1,2,53,3,1,3,No,18,1,3,2,0,No,3,3,3,3,3,3,3,3,3
238,27,Single,Sales,Sales Executive,Travel_Frequently,8,3,4,37,3,3,1,No,15,4,3,8,1,No,3,3,3,3,3,3,3,3,3


#### Exporting the results

In [107]:
train_results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 28 columns):
 #   Column                                                    Non-Null Count  Dtype 
---  ------                                                    --------------  ----- 
 0   Age                                                       960 non-null    int64 
 1   MaritalStatus                                             960 non-null    object
 2   EmpDepartment                                             960 non-null    object
 3   EmpJobRole                                                960 non-null    object
 4   BusinessTravelFrequency                                   960 non-null    object
 5   DistanceFromHome                                          960 non-null    int64 
 6   EmpEducationLevel                                         960 non-null    int64 
 7   EmpEnvironmentSatisfaction                                960 non-null    int64 
 8   EmpHourlyRate                 

In [108]:
test_results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 28 columns):
 #   Column                                                    Non-Null Count  Dtype 
---  ------                                                    --------------  ----- 
 0   Age                                                       240 non-null    int64 
 1   MaritalStatus                                             240 non-null    object
 2   EmpDepartment                                             240 non-null    object
 3   EmpJobRole                                                240 non-null    object
 4   BusinessTravelFrequency                                   240 non-null    object
 5   DistanceFromHome                                          240 non-null    int64 
 6   EmpEducationLevel                                         240 non-null    int64 
 7   EmpEnvironmentSatisfaction                                240 non-null    int64 
 8   EmpHourlyRate                 

In [109]:
# Exporting the results
train_results_df.to_csv(processed_file_location + 'train_data_with_predicted_results.csv')
test_results_df.to_csv(processed_file_location + 'test_data_with_predicted_results.csv')

#### User-defined Performance Ratings Prediction

In [125]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 19 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Age                           240 non-null    int64 
 1   MaritalStatus                 240 non-null    object
 2   EmpDepartment                 240 non-null    object
 3   EmpJobRole                    240 non-null    object
 4   BusinessTravelFrequency       240 non-null    object
 5   DistanceFromHome              240 non-null    int64 
 6   EmpEducationLevel             240 non-null    int64 
 7   EmpEnvironmentSatisfaction    240 non-null    int64 
 8   EmpHourlyRate                 240 non-null    int64 
 9   EmpJobInvolvement             240 non-null    int64 
 10  EmpJobLevel                   240 non-null    int64 
 11  NumCompaniesWorked            240 non-null    int64 
 12  OverTime                      240 non-null    object
 13  EmpLastSalaryHikePer

In [126]:
# Provide appropiate data to predict the performance rating
input_data = []
input_data.append(int(input('Age : '))) 
print('\nMaritalStatus can be Single / Married / Divorced')
input_data.append(str(input('MaritalStatus  : '))) 
print('\nEmpDepartment can be Sales / Human Resources / Development / Data Science / Research & Development / Finance')
input_data.append(str(input('EmpDepartment  : ')))
print('\nEmpJobRole can be\nSales Executive / Manager / Developer / Sales Representative / Human Resources / \
Senior Developer / Data Scientist / Senior Manager R&D / Laboratory Technician / Manufacturing Director /\
Research Scientist / Healthcare Representative / Research Director / Manager R&D / Finance Manager /\
Technical Architect / Business Analyst / Technical Lead / Delivery Manager') 
input_data.append(str(input('EmpJobRole  : ')))
print('\nBusinessTravelFrequency can be Travel_Rarely / Travel_Frequently / Non-Travel')
input_data.append(str(input('BusinessTravelFrequency  : ')))
input_data.append(int(input('DistanceFromHome  : '))) 
input_data.append(int(input('EmpEducationLevel [1-5] : ')))
input_data.append(int(input('EmpEnvironmentSatisfaction [1-4]  : '))) 
input_data.append(int(input('EmpHourlyRate  : '))) 
input_data.append(int(input('EmpJobInvolvement [1-4] : ')))
input_data.append(int(input('EmpJobLevel [1-5] : '))) 
input_data.append(int(input('NumCompaniesWorked  : '))) 
input_data.append(str(input('OverTime  [Yes / No] : ')))
input_data.append(int(input('EmpLastSalaryHikePercent  : '))) 
input_data.append(int(input('EmpRelationshipSatisfaction [1-4] : ')))
input_data.append(int(input('EmpWorkLifeBalance [1-4] : '))) 
input_data.append(int(input('ExperienceYearsInCurrentRole  : ')))
input_data.append(int(input('YearsSinceLastPromotion  : '))) 
input_data.append(str(input('Attrition [Yes / No] : ')))
print('\n\n\n')

# Converting required features to be integers to predict the performance rating
input_df =pd.DataFrame(np.array(input_data).reshape(1,19),columns = X_test.columns)
for col in X_test.describe().columns:
    input_df[col] = input_df[col].astype('int')

# Predicting the results for all trained models and storing it in results_df
for i in model:
    pickled_model = pickle.load(open(processed_file_location + i + '_trained_model.pkl', 'rb'))
    print(f'{i} predicted PerformanceRating : {int(pickled_model.predict(input_df))}')

Age : 26

MaritalStatus can be Single / Married / Divorced
MaritalStatus  : Single

EmpDepartment can be Sales / Human Resources / Development / Data Science / Research & Development / Finance
EmpDepartment  : Sales

EmpJobRole can be
Sales Executive / Manager / Developer / Sales Representative / Human Resources / Senior Developer / Data Scientist / Senior Manager R&D / Laboratory Technician / Manufacturing Director /Research Scientist / Healthcare Representative / Research Director / Manager R&D / Finance Manager /Technical Architect / Business Analyst / Technical Lead / Delivery Manager
EmpJobRole  : Manager

BusinessTravelFrequency can be Travel_Rarely / Travel_Frequently / Non-Travel
BusinessTravelFrequency  : Travel_Rarely
DistanceFromHome  : 20
EmpEducationLevel [1-5] : 3
EmpEnvironmentSatisfaction [1-4]  : 2
EmpHourlyRate  : 25
EmpJobInvolvement [1-4] : 2
EmpJobLevel [1-5] : 3
NumCompaniesWorked  : 7
OverTime  [Yes / No] : Yes
EmpLastSalaryHikePercent  : 14
EmpRelationshipSatisf

### Summary

**Importing necessary libraries**
* pandas 
* numpy
* pickle to import trained models in the processed folder.

**Loading Train and Test data**
* Train data and Test data are loaded from file location data-> processed -> train_data.csv,test_data.csv

**Independent and Dependent variable creation**
* Creating X_train and X_test as independent variable and y_train and y_test as dependent variable

**Importing trained models and predicting results**
* Below Models are imported using pickle from ile location data-> processed -> *_trained_model.pkl
	* Logistic_Regression,
	* Support Vector Classifier_1
	* Support_Vector_Classifier_2
	* Decision_Tree_Classifier
	* Random_Forest_Classifier
	* AdaBoost_Classifier
	* Gradient_Boosting_Classifier
	* Stacking_Classifier (LogisticRegression + SVC + RandomForestClassifier)
* Predicting the results for all trained models for both X_train and X_test and storing the results in dataframes.

**Exporting the results**
* Both Train results and Test results are exported to processed folder using pickle.

**User-defined Performance Ratings Prediction**
* Get the input from the user and results are predicted using all trained models in processed folder.