<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>
XGboost using Python (sklearn):</p><br>
<p style="font-family: Arial; font-size:2.25em;color:green; font-style:bold"><br>
Kumar Rahul</p><br>

### We will be using HR data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the HR data and answer the below questions.

1.	Load the dataset in Jupyter Notebook using pandas
2.	Build a correlation matrix between all the numeric features in the dataset. Report the features, which are correlated at a cut-off of 0.70. What actions will you take on the features, which are highly correlated?
3.	Build a new feature named LOB_Hike_Offered using LOB and percentage hike offered. Include this as a part of the data frame created in step 1. What assumption are you trying to test with such variables?
4.	Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model building and why?
5.	Split the data into training set and test set. Use 80% of data for model training and 20% for model testing. 
6.	Build a model using Gender and Age as independent variable and Status as dependent variable.
    >* Are Gender and Age a significant feature in this model?
    * What inferences can be drawn from this model? 
8.	Build a model using sklearn package to predict the probability of Not Joining.


**PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them missing.**

**Exhibit 1**


|Sl.No.|Name of Variable|Variable Description|
|:-------|----------------|:--------------------|
|1	|Candidate reference number|	Unique number to identify the candidate|
|2	|DOJ extended|Binary variable identifying whether candidate asked for date of joining extension (Yes/No)|
|3	|Duration to accept the offer|	Number of days taken by the candidate to accept the offer (continuous variable)|
|4	|Notice period|	Notice period to be served in the parting company before candidate can join this company (continuous variable)|
|5	|Offered band|	Band offered to the candidate based on experience and performance in interview rounds (categorical variable labelled C0/C1/C2/C3/C4/C5/C6)|
|6	|Percentage hike (CTC) expected|	Percentage hike expected by the candidate (continuous variable)|
|7	|Percentage hike offered (CTC)| Percentage hike offered by the company (continuous variable)|
|8	|Percent difference CTC|	Percentage difference between offered and expected CTC (continuous variable)|
|9	|Joining bonus|	Binary variable indicating if joining bonus was given or not (Yes/No)|
|10	|Gender|	Gender of the candidate (Male/Female)|
|11	|Candidate source|	Source from which resume of the candidate was obtained (categorical variables with categories  Employee referral/Agency/Direct)|
|12	|REX (in years)|	Relevant years of experience of the candidate for the position offered (continuous variable)|
|13	|LOB|	Line of business for which offer was rolled out (categorical variable)|
|14	|DOB|	Date of birth of the candidate|
|15	|Joining location|	Company location for which offer was rolled out for candidate to join (categorical variable)|
|16	|Candidate relocation status|	Binary variable indicating whether candidate has to relocate from one city to another city for joining (Yes/No)|
|17 |HR status|	Final joining status of candidate (Joined/Not-Joined)|

***

# Code starts here

To know the environment with the pyhton kernal



In [None]:
import sys, os

sys.executable

Supress the warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

We are going to use below mentioned libraries for **data import, processing and visulization**. As we progress, we will use other specific libraries for model building and evaluation. 

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt

#the output of plotting commands is displayed inline within Jupyter notebook
%matplotlib inline 


## Data Import and Manipulation

### 1. Importing a data set

_Give the correct path to the data_



modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [None]:
import os

os.getcwd()

#os.chdir()

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
raw_df = pd.read_csv( "../HR_case/data/HR_Data_No_Missing_Value.csv", 
                        sep = ',', na_values = ['', ' '])

raw_df.columns = raw_df.columns.str.lower().str.replace(' ', '_')
raw_df.head()

In [None]:
#?pd.read_csv

Dropping SLNo and Candidate.Ref as these will not be used for any analysis or model building.

In [None]:
#?raw_df.drop()

In [None]:
if set(['slno','candidate_ref']).issubset(raw_df.columns):
    raw_df.drop(['slno','candidate_ref'],axis=1, inplace=True)
    
raw_df.head()


### 2. Structure of the dataset



In [None]:
raw_df.info()

In [None]:
raw_df.status.value_counts()
#raw_df.describe(include='all').transpose()
raw_df.describe().transpose()

To get a help on the features of a object

In [None]:
#?raw_df.status.value_counts()

### 2. Summarizing the dataset
Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The *dropna()* function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.


In [None]:
filter_df = raw_df.dropna(axis=0, how='any', thresh=None, 
                             subset=None, inplace=False)

list(filter_df.columns )

In [None]:
#?raw_df.dropna

We will first start by printing the unique labels in categorical features

In [None]:
numerical_features = ['duration_to_accept_offer','notice_period','pecent_hike_expected_in_ctc',
                      'percent_hike_offered_in_ctc','percent_difference_ctc','rex_in_yrs','age']

categorical_features = ['doj_extended','offered_band','joining_bonus','candidate_relocate_actual',
                        'gender','candidate_source','lob','location','status']

for f in categorical_features:
    print("\nThe unique labels in {} is {}\n".format(f, filter_df[f].unique()))
    print("The values in {} is \n{}\n".format(f,  filter_df[f].value_counts()))


Looking at the feature **line of business** it seems that *EAS, Healthcare and MMS* does not have enough observations and may be clubbed together

In [None]:
filter_df['lob']=np.where(filter_df['lob'] =='EAS', 'Others', filter_df['lob'])
filter_df['lob']=np.where(filter_df['lob'] =='Healthcare', 'Others', filter_df['lob'])
filter_df['lob']=np.where(filter_df['lob'] =='MMS', 'Others', filter_df['lob'])
filter_df.lob.value_counts()

We will use **groupby** function of pandas to get deeper insights of the behaviour of people **Joining** or **Not Joining** the company. We will write a generic function to report the mean by any categorical variable.

In [None]:
def group_by (categorical_features):
    return filter_df.groupby(categorical_features).mean()



In [None]:
group_by("doj_extended")
group_by("status")
group_by("location")

### 3. Visualizing the Data

Plot can be done using the callable functions of 

>1. pandas library (http://pandas.pydata.org/pandas-docs/stable/visualization.html)
2. matplotlib library (https://matplotlib.org/) or
3. seaborn library (https://seaborn.pydata.org/) which is based on matplotlib and provides interface for drawing attractive statistical graphics.

#### 3a. Visualizing the Data using pandas

In [None]:
def hist_plot(data, group_by, xlabel,ylabel):
    pd.crosstab(data,group_by).plot(kind='density')
    plt.xlabel(xlabel, size = 14)
    plt.ylabel(ylabel, size = 14)
    plt.title('Plot', size = 18)
    plt.grid(True)
    x1,x2,y1,y2 = plt.axis()
    plt.axis((0,x2,y1,y2))
    plt.show()
    #plt.subplot(1, 2)

In [None]:
numerical_features_set = ['duration_to_accept_offer','notice_period','age']
categorical_features_set = ['offered_band','gender','status']

for c in categorical_features_set:
    for n in numerical_features_set:
        hist_plot(filter_df[n], filter_df[c], n,c)

## Model Building: 

### Dummy Variable coding

Remove the response variable from the dataset¶


In [None]:
X_features = list(filter_df.columns)
X_features.remove('status')
X_features.remove('pecent_hike_expected_in_ctc')
X_features.remove('percent_hike_offered_in_ctc')
X_features.remove('candidate_relocate_actual')

In [None]:
X_features

In [None]:
categorical_features = ['doj_extended','offered_band','joining_bonus','gender','candidate_source','lob','location']

In [None]:
encoded_X_df = pd.get_dummies( filter_df[X_features], columns = categorical_features, drop_first = False )
encoded_Y_df = pd.get_dummies( filter_df['status'], drop_first=False)

In [None]:
#?pd.get_dummies

In [None]:
pd.options.display.max_columns = None
encoded_X_df.info()

In [None]:
Y = encoded_Y_df.filter(['Joined'], axis =1)
X = encoded_X_df
Y.head()

### Train and test data split using Python

The train and test split can also be done using the **sklearn module**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 42)

## Model Building: Using the **xgboost** 



To install this package with conda run the following:

* conda install -c conda-forge xgboost

If the above does not work try below one:

* conda install -c conda-forge/label/gcc7 xgboost
* conda install -c conda-forge/label/cf201901 xgboost

In [None]:
import xgboost as xgb

from scipy.stats import uniform, randint

#from xgboost.sklearn import XGBClassifier

In [None]:
xgb.XGBClassifier?

## XGBoost parameters

The overall parameters have been divided into 3 categories by XGBoost authors:

* General Parameters: Guide the overall functioning
* Booster Parameters: Guide the individual booster (tree/regression) at each step
* Learning Task Parameters: Guide the optimization performed

### General Parameters
These define the overall functionality of XGBoost.

1. booster [default=gbtree]
       Select the type of model to run at each iteration. It has 2 options:
           gbtree: tree-based models
           gblinear: linear models
2. silent [default=0]:
       Silent mode is activated is set to 1, i.e. no running messages will be printed. It’s generally good to keep it 0 as the messages might help in understanding the model.
3. nthread [default to maximum number of threads available if not set]
        This is used for parallel processing and number of cores in the system should be entered. If you wish to run on all cores, value should not be entered and algorithm will detect automatically

### Booster Parameters
Though there are 2 types of boosters. Below are the parameters of tree booster (it always outperforms the linear booster and thus the later is rarely used).

1. learning_rate or eta [default=0.1]
        * Analogous to learning rate in GBM. 
        * Makes the model more robust by shrinking the weights on each step. 
        * Typical final values to be used: 0.01-0.2
2. min_child_weight [default=1]
        * Defines the minimum sum of weights of all observations required in a child. 
        * This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
        * Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
        * Too high values can lead to under-fitting hence, it should be tuned using CV.
3. max_depth [default=3]
        * The maximum depth of a tree, same as GBM.
        * Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
        * Should be tuned using CV.
        * Typical values: 3-10
4. max_leaf_nodes
        * The maximum number of terminal nodes or leaves in a tree.
        * Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
        * If this is defined, GBM will ignore max_depth.
5. gamma [default=0]
        * A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
        * Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
6. max_delta_step [default=0]
        * In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
        * Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
        * This is generally not used but you can explore further if you wish.
7. subsample [default=1]
        * Denotes the fraction of observations to be randomly samples for each tree.
        * Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
        * Typical values: 0.5-1
8. colsample_bytree [default=1]
        * Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
        * Typical values: 0.5-1
9. colsample_bylevel [default=1]
        * Denotes the subsample ratio of columns for each split, in each level.
        * May not be needed because subsample and colsample_bytree will do the job.
10. reg_alpha [default=0]
        * L1 regularization term on weight (analogous to Lasso regression)
        * Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
11. reg_lambda [default=1]
        * L2 regularization term on weights (analogous to Ridge regression)
        * This used to handle the regularization part of XGBoost. 
        * It should be explored to reduce overfitting.

12. scale_pos_weight [default=1]
        * A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
        

### Learning Task Parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.

1. objective [default=reg:linear]
        * This defines the loss function to be minimized. Mostly used values are:
                binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
                multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities). you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
                multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
2. seed [default=0]
       * The random number seed.
       * Can be used for generating reproducible results and also for parameter tuning.
       

More details about the parameters: https://xgboost.readthedocs.io/en/latest/parameter.html

## Implement  XG boost with RandomizedSearchCV

Refer here to understand the details of parallel processing: https://stackoverflow.com/questions/32673579/scikit-learn-general-question-about-parallel-computing while using RandomizedSearchCV

In [None]:
# Create the random grid
random_grid = { "colsample_bytree": uniform(0.7, 0.3),
               "gamma": uniform(0, 0.5),
               "learning_rate": uniform(0.03, 0.3), # default 0.1 
               "max_depth": randint(2, 6), # default 3
               "n_estimators": randint(100, 150), # default 100
               "subsample": uniform(0.6, 0.4)}

random_grid

In [None]:
from sklearn.model_selection import RandomizedSearchCV

To report the performance on the selected KPI use `sklearn.metrics.SCORERS.keys()` to get the list of all the metrics and pass the relevant one in `RandomizedSearchCV` or `GridSearchCV`

In [None]:
from sklearn.metrics import SCORERS

SCORERS.keys()

### Handle Imbalanced Dataset

For common cases such as ads clickthrough log, the dataset is extremely imbalanced. This can affect the training of XGBoost model, and there are two ways to improve it.

1. If you care only about the overall performance metric (AUC) of your prediction

        * Balance the positive and negative weights via scale_pos_weight
        * Use AUC for evaluation

2. If you care about predicting the right probability

        * In such a case, you cannot re-balance the dataset
        * Set parameter max_delta_step to a finite number (say 1) to help convergence

In [None]:
# Use the random grid to search for best hyperparameters
from sklearn.model_selection import RandomizedSearchCV

xgb_model = xgb.XGBClassifier(scale_pos_weight=1)

# Random search of parameters, using 3 fold cross validation, 
xg_best_model = RandomizedSearchCV(estimator = xgb_model, 
                                   param_distributions = random_grid, scoring = "precision",
                                   n_iter = 50, cv = 3, verbose=2, 
                                   return_train_score=True,random_state=42, n_jobs = -2, pre_dispatch =2)


xg_best_model.fit(X_train, y_train.values.ravel())
# Fit the random search model
xg_best_model

### Steps to fine tuning the parameters

In general, the fine tuning of all the parameters in one step can be time consuming and limited by the infrastructure. We can follow the below sequence to fine tune the parameters of xgboost:

1. Start with a fixed learning rate (say 0.1) and fixed number of trees (n_estimator = 500)

Iteration 1: 
* max_depth
* min_child_weight

Iteration 2: Keeping the parameter values from above iterations.
* gamma

**Note the above two steps are critical to come out of model overfitting issue.**

Iteration 3: Keeping the parameter values from above iterations.
* subsample:
* colsample_bytree

The that subsample, colsample_bytree will lead to adding randomness to make training robust to noise (another way to deal with overfitting). You can also reduce stepsize learning_rate/eta to deal with overfitting issue. Remember to increase num_round when you do so.

Iteration 4: Keeping the parameter values from above iterations.
* reg_alpha


2. At the end with all the paramters fixed, change learning rate and n_estimator using cross validation.

### Report the parameter

The best model has the following parameter selected from the random search grid

In [None]:
xg_best_model.best_params_

xg_best_model.best_estimator_

xg_best_model.best_score_

### Grid Search with Cross Validation

Random search allows to narrow down the range for each hyperparameter. Thus we know where to concentrate the search, to fine tune the model further. 

`GridSearchCV`, is a method that instead of sampling randomly from a distribution, evaluates all combinations which are defined. The grid search can be called `from sklearn.model_selection import GridSearchCV`

You are encouraged to fine tune the model further using Grid Search

## Model Evaluation


### 1. The prediction on train data.

To predict the outcome on the **train set**
> * Use **predict** function of the model object 


In [None]:
# Make predictions using the testing set
#pd.options.display.max_rows = None

predict_class_train_df = pd.DataFrame(xg_best_model.predict(X_train))
predict_class_train_df.head()

predict_porb_train_df = pd.DataFrame(xg_best_model.predict_proba(X_train))
predict_porb_train_df.iloc[:,:].head()

The above output clearly shows that the predcited class is the one for which the calculated probability is more compared to the calculated probability of the other class.

### 2. The prediction on test data.

The prediction can be carried out by **defining functions** as well. Below is one such instance wherein a function is defined and is used for prediction

In [None]:
def get_predictions ( test_class, model, test_data ):
    predicted_df = pd.DataFrame(model.predict(test_data))
    y_pred_df = pd.concat([test_class.reset_index(drop=True), predicted_df], axis =1)
    return y_pred_df

Giving label to the Y column of the test set by using the dictionary data type in python. This is being done for the model which was built using dummy variable coding. It will be used to generate confusion matrix at a later time

In [None]:
test_series = y_test
train_series = y_train



class_test_df = test_series.replace({'Joined':{1:"Joined", 0:"Not Joined"}})
class_test_df.rename({'Joined': 'actual'}, axis='columns', inplace=True )

status_dict = {1:"Joined", 0:"Not Joined"}
class_train_df = train_series.replace({'Joined':status_dict})
class_train_df.rename({'Joined': 'actual'}, axis='columns', inplace=True )

class_test_df.actual.value_counts()
class_train_df.actual.value_counts()

In [None]:
predict_test_df = pd.DataFrame(get_predictions(class_test_df.actual, xg_best_model, X_test))
predict_test_df.rename(columns = {0:'prediction'}, inplace=True)

predict_test_df = predict_test_df.replace(dict(prediction=status_dict))
predict_test_df.head()

### 3. Confusion Matrix

We will built classification matrix using the **metrics** method from **sklearn** package. We will also write a custom function to build a classification matrix and use it for reporting the performance measures.

To understand the concept of micro average and macro average:

https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin

#### 3a. Confusion Matrix using sklearn

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
print("The model with dummy variable coding output: ")
confusion_matrix(predict_test_df.actual, predict_test_df.prediction)
lg_reg_report = (classification_report(predict_test_df.actual, predict_test_df.prediction))
print(lg_reg_report)


#### 3b Confusion Matrix using generic function

In [None]:
def draw_cm( actual, predicted ):
    plt.figure(figsize=(9,9))
    cm = metrics.confusion_matrix( actual, predicted )
    sn.heatmap(cm, annot=True,  fmt='.0f', xticklabels = ["Joined", "Not Joined"] , 
               yticklabels = ["Joined", "Not Joined"],cmap = 'Blues_r')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.title('Classification Matrix Plot', size = 15);
    plt.show()

The classification matrix plot as reported with dummy variable coding is:

In [None]:
draw_cm( predict_test_df.actual, predict_test_df.prediction)

### 4. Performance Measure on the test set


In [None]:
def measure_performance (clasf_matrix):
    measure = pd.DataFrame({
                        'sensitivity': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)], 
                        'specificity': [round(clasf_matrix[1,1]/(clasf_matrix[1,0]+clasf_matrix[1,1]),2)],
                        'recall': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)],
                        'precision': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[1,0]),2)],
                        'overall_acc': [round((clasf_matrix[0,0]+clasf_matrix[1,1])/
                                              (clasf_matrix[0,0]+clasf_matrix[0,1]+clasf_matrix[1,0]+clasf_matrix[1,1]),2)]
                       })
    return measure

In [None]:
cm = metrics.confusion_matrix(predict_test_df.actual, predict_test_df.prediction)

lg_reg_metrics_df = pd.DataFrame(measure_performance(cm))
lg_reg_metrics_df

print( 'Total Accuracy sklearn: ',np.round( metrics.accuracy_score( predict_test_df.actual, predict_test_df.prediction ), 2 ))




#### End of Document

***
***
