### Dataset Description:

- employee_id : Unique ID for employee
- department : Department of employee
- region : Region of employment (unordered)
- education : Education Level
- gender : Gender of Employee
- recruitment_channel : Channel of recruitment for employee
- no_of_trainings : no of other trainings completed in previous year on soft skills, technical skills etc.
- age : Age of Employee
- previous_year_rating : Employee Rating for the previous year
- length_of_service : Length of service in years
- KPIs_met >80% : if Percent of KPIs(Key performance Indicators) >80% then 1 else 0
- awards_won? : if awards won during previous year then 1 else 0
- avg_training_score : Average score in current training evaluations
- is_promoted : (Target) Recommended for promotion

**In this kernel I am going to use VotingClassifier as ensemble technique. And the algorithms used are XGBoost, LGBM and Catboost.**

### Importing Basic libraries:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [None]:
# loading training data and reading top 5 records

df = pd.read_csv('../input/hr-analytics-analytics-vidya/train.csv')
df.head()

In [None]:
### Reading bottom 5 records

df.tail()

In [None]:
print("There are {} rows and {} columns in the training dataset.".format(df.shape[0],df.shape[1]))

### Exploratory Data Analysis:

In [None]:
# To know the datatypes of the column

df.info()

In [None]:
print("There are {} duplicate records.".format(df.shape[0] - len(df['employee_id'].unique())))

In [None]:
# Droping employee_id column as it doesnot provide any information

df.drop('employee_id',axis=1,inplace=True)

In [None]:
# Name of the columns

print("Column Names: {}".format(list(df.columns)))

In [None]:
# Column names into list

col_name = df.columns.to_list()

In [None]:
# To find out number of unique values and unique vales of a perticular column

for i in col_name:
    print("In the column - {}:".format(i))
    print("There are {0} Unique values".format(len(df[i].unique())))
    print("Unique vales in the column are - \n{}".format(list(df[i].unique())))
    print("")

In [None]:
df.info()

In [None]:
# Correlation Matrix

plt.figure(figsize=(10,5))
sns.heatmap(df.corr(), annot=True)
plt.show()

- 'length_of_service' is highly correlated with 'age'
- 'KPIs_met >80%' is slightly correlated with 'previous_year_rating'

In [None]:
# Count of each values in column
for i in col_name:
    plt.figure(figsize=(15,5))
    plt.title("Count of each values in column '{}'".format(i))
    sns.countplot(df[i])
    plt.show()

In [None]:
# Pair plot

sns.pairplot(df)
plt.show()

##### Finding missing values and imputing it:

In [None]:
print("There are totally {} missing values in the dataset.".format(df.isnull().sum().sum()))

In [None]:
# Count of missing values in column

for i in col_name:
    if df[i].isnull().sum() > 0:
        print("There are {} missing values in the '{}' column.\n".format(df[i].isnull().sum(),i))

In [None]:
# Imputing missing values in column education with forwardfill

df['education'] = df['education'].ffill()

In [None]:
# Value count for column "length_of_service" when "previous_year_rating" isnull

df[df["previous_year_rating"].isnull() == True]['length_of_service'].value_counts()

In [None]:
# Imputing missing values in column "previous_year_rating" with "0" as length of service is 1 for missing values 

df['previous_year_rating'] = df['previous_year_rating'].fillna(0.0)

### Featuring Engineering:

In [None]:
# Binning the age column

df['age'] = pd.cut(x=df['age'], bins=[20, 29, 39, 49], 
                    labels=['20 to 30', '30 to 40', '40+']) 

In [None]:
# Changing datatype 'category' to 'object'

df['age'] = df['age'].astype('object')

### Spliting train data into Predictors(Independent) & Target(Dependent):

In [None]:
X = df.drop('is_promoted',axis=1)
y = df['is_promoted']

### Data encoding using OneHot encoding technique:

In [None]:
X_encode = pd.get_dummies(X,drop_first=True)

### Data scaling using RobustScalar:

In [None]:
from sklearn import preprocessing 

scaler = preprocessing.RobustScaler() 
X_standard = scaler.fit_transform(X_encode) 
X_standard = pd.DataFrame(X_standard, columns =X_encode.columns) 

### Not dividing train dataset to train_test_split as it gives less value of F-1 score.

### Creating Baseline ML Model for Binary Classification Problem:

In [None]:
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier


Classifiers = {'0._XGBoost' : XGBClassifier(learning_rate =0.1, n_estimators=500, max_depth=5,subsample = 0.70,
                                            verbosity = 0, scale_pos_weight = 2.5,updater ="grow_histmaker",
                                            base_score  = 0.2),
               
               '1.CatBoost' : CatBoostClassifier(learning_rate=0.15, n_estimators=500, subsample=0.085, max_depth=5,
                                                 scale_pos_weight=2.5),
               
               '2.LightGBM' : LGBMClassifier(subsample_freq = 2, objective ="binary",importance_type = "gain",verbosity = -1,
                                             max_bin = 60,num_leaves = 300, boosting_type = 'dart',learning_rate=0.15, 
                                             n_estimators=500, max_depth=5, scale_pos_weight=2.5)}

- Parameters values are taken from tuning and trail and error method.

### Improving Model with Voting Classifier with MODEL Evaluation METRIC - "F1" and Predict Target "is_promoted":

In [None]:
from sklearn.ensemble import VotingClassifier

vc_model = VotingClassifier(estimators=[('XGBoost_Best', list(Classifiers.values())[0]), 
                                        ('CatBoost_Best', list(Classifiers.values())[1]),
                                        ('LightGBM_Best', list(Classifiers.values())[2]),
                                       ], 
                            voting='soft',weights=[2, 1, 3])

vc_model.fit(X_standard,y)

- Weights are taken from tuning.

## Scoring:

In [None]:
# Loading test dataset

df1 = pd.read_csv('../input/hr-analytics-analytics-vidya/test.csv')
df1.head()

In [None]:
# Performing all the step on the unseen data that was performed on historical data

df2 = df1.copy()

df1.drop('employee_id',axis=1,inplace=True)

df1['education'] = df1['education'].ffill()

df1['previous_year_rating'] = df1['previous_year_rating'].fillna(0.0)

df1['age'] = pd.cut(x=df1['age'], bins=[20, 29, 39, 49], labels=['20 to 30', '30 to 40', '40+']) 
df1['age'] = df1['age'].astype('object')

df1_encode = pd.get_dummies(df1,drop_first=True)

scaler = preprocessing.RobustScaler() 
df_standard = scaler.fit_transform(df1_encode) 
df_standard = pd.DataFrame(df_standard, columns =df1_encode.columns)

### Predicting and storing the submission file:

In [None]:
df2['is_promoted'] = vc_model.predict(df_standard)

df1=df2[['employee_id','is_promoted']]
df1.to_csv('Predict19.csv')