***
# Week 8 Bank Data Case Study
MSDS 7333 Quantifying the World  
*Allison Roderick, Jenna Ford, and Will Arnost* 
***

## Table of Contents

<a href='#Section_1'> 1. Introduction </a>  
<a href='#Section_2'> 2. Question </a>  
<a href='#Section_3'> 3. Methods </a>  
<a href='#Section_3_a'> &nbsp;&nbsp;&nbsp; a. Dataset </a>  
<a href='#Section_3_b'> &nbsp;&nbsp;&nbsp; b. Dataset Preparation </a>  
<a href='#Section_3_c'> &nbsp;&nbsp;&nbsp; c. Weight of Evidence </a>  
<a href='#Section_4'> 4. Modeling </a>  
<a href='#Section_4_a'> &nbsp;&nbsp;&nbsp; a. Random Forest </a>  
<a href='#Section_4_b'> &nbsp;&nbsp;&nbsp; b. XGBoost </a>  
<a href='#Section_4_c'> &nbsp;&nbsp;&nbsp; c. SVM </a>  
<a href='#Section_5'> 5. Results </a>  
<a href='#Section_5_a'> &nbsp;&nbsp;&nbsp; a. Random Forest </a>  
<a href='#Section_5_b'> &nbsp;&nbsp;&nbsp; b. XGBoost </a>  
<a href='#Section_5_c'> &nbsp;&nbsp;&nbsp; c. SVM </a>  
<a href='#Section_6'> 6. Conclusion </a>  
<a href='#Section_7'> 7. References </a>  
<a href='#Section_8'> 8. Code </a>  

In [213]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

<a id = 'Section_1'></a>

## 1. Introduction

This week's case study involves binary classification. The dataset provided is banking data. No information about the dataset was provided and all column headings have been masked. Random Forest, XGBoost and SVM will be used to classify the 'target' variable found in the dataset.

<a id = 'Section_2'></a>

## 2. Question

1. Build 3 tuned models: An XGBoost, a Random Forest, and an SVM.
2. Show the log loss and accuracy for XGBoost and Random Forest models on out of fold predictions. Show the accuracy of the SVM on a validation set.
3. Time how long it takes to do a sample of 1000, 2000, 5000 and 10000 rows in the SVM. What is the rough scaling of SVM with sample size?

<a id = 'Section_3'></a>

## 3. Methods

This section will give an overview of what we know about the data and how we prepared the dataset for modeling.

<a id = 'Section_3_a'></a>

### 3a. Dataset

The dataset contains 114,320 observations and 133 columns (including an 'ID' column and the 'target' column). The following is a breakout of the columns by data type:

* Float: 108 columns
* Integer: 6 columns ('ID' and 'target' are included)
* Object: 19 columns

The next 2 sections detail how the dataset is preparted for modeling.

<a id = 'Section_3_b'></a>

### 3b. Dataset Preparation

The float columns appear to have mainly been standardized with a range of values between 0 and 20. There are a few columns where the maximum value is not 20. All float variables are standardized to take care of this.

For the 4 remaining integer columns (after removing 'ID' and 'target'), we suspect that these columns may be month of the year or day of the week. If this were truly the case, it might be appropriate to one-hot encode these variables. However, since we do not have any information about the dataset, we leave these variables as integer.

The object columns will all be one-hot encoded, with the exception of 'v22'. 'v22' has 18,210 unique values, which would be a very large number of columns to add back into the dataset if we one-hot encoded this variable. Instead, we chose to use weight of evidence for this variable. This is described in more detail in Section 3c. The weight of evidence column is not standardized like the other float columns since standardization is already captured.

Once the columns have been prepared to model, we split the dataset into a training dataset with 80% of the data and a test dataset with the remaining 20%. 5-fold cross validation will be used within the training dataset for Random Forest and XGBoost. Using cross validation and a separate test dataset will help check to ensure overfitting is not occurring.

<a id = 'Section_3_c'></a>

### 3c. Weight of Evidence

Weight of Evidence helps show categorical variables in a numeric representation by calculating the number of events that occur in each category. The formula is:

WOE = ln(Event % / Non-Event %)

Using this representation we can summarize v22, which has over 18K categories, using a single numeric column instead of using dummy variables.  

Python does not have a package for WOE. We used an implementation by Sundar Krishnan, found here: https://github.com/Sundar0989/WOE-and-IV/blob/master/WOE_IV.ipynb

Sundar also wrote a medium article explaining weight of evidence, you can read that here: https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb

<a id = 'Section_4'></a>

## 4. Modeling

In this section, a brief overview is given of the three modeling techniques that will be used to classify the 'target' variable in the dataset.

<a id = 'Section_4_a'></a>

### 4a. Random Forest

<a id = 'Section_4_b'></a>

### 4b. XGBoost

XGBoost, which stands for eXtreme Gradient Boosting, is an optimized distributed gradient boosting library. It was initially released in March 2014 as part of the Distributed Machine Learning Community (DMLC). Since its release it has spread throughout the machine learning community rapidly due to its speed and performance. As such, it is a natural choice for a classification algorithm for this case study.

XGBoost creates trees, where each tree attempts to reduce the residuals from previous trees. Trees are produced until no improvement can be made (or until not enough improvement is made). Special attention needs to be given to the parameters used to tune the model and the results to ensure that overfitting does not occur.

5-fold cross validation is used during parameter tuning to identify the parameter values that obtain the best fit, without overfitting. Log loss is used as the evaluation metric. The resulting parameters are used in a model that is scored against a separate test dataset to obtain the accuracy of the winning model.

Below are a list of parameters and corresponding values that will be used to tune the XGBoost model for this case study:

* objective - 'binary:logistic', used for binary classification and outputs a probability. 0.5 is used as the cutoff for this case study.
* booster - 'gbtree', the default learner which is a tree ensemble. 
* eval_method - 'logloss', the lower the log loss the better the resulting predictions
* tree_method - 'hist', a faster optimized approximate algorithm often used with large datasets
* max_depth - 6, 7, maximum depth of a tree. The larger this number the higher the likelihood of overfitting (default = 6).
* min_child_weight - 6, 7, instance weight needed in a leaf node. As this number increases, the likelihood of overfitting decreases because it encourages smaller trees (default = 1).
* subsample - 0.5, 0.6, ratio of observations sampled in each boosting iteration (default = 1, or all observations).
* colsample_bytree - 0.5, 0.5, ratio of columns to be sampled for each tree constructed (default = 1).
* eta - 0,01, 0.05, learning rate that identifies how much the model needs to be adjusted based on the error from each tree. This is a critical parameter to tune because a small number will increase the time it take to run the model, but too large a number may result in not enough training. The learning rate is used to prevent overfitting (default = .3).
* gamma - 0, 0.1, minimum loss reduction required to make another split in the tree. The higher the value, the more conservative the model (default = 0).
* alpha - 0, L1 regularization on weights. The higher the value, the more conservative the model (default = 0). This parameter was not tuned initially. However, other values were tested to try and improve upon the winning model.
* lambda - 1, L2 regularization on weights. The higher the value, the more conservative the model (default = 1). This parameter was not tuned initially. However, other values were tested to try and improve upon the winning model (unsuccessfully).

The number of boosting rounds was set to 499 due to certain parameters causing the model to not stop early. It was discovered that is was the learning rate of 0.01 causing this issue. Using a learning rate of 0.05 resulted in early stopping in the upper 100s or lower 200s. For parameter tuning, we set early stopping rounds to 5, indicating that the model will stop after 5 boosting rounds of no improvement in the log loss against the training dataset. This helps control for overfitting, and speeds up modeling.

Note: We first attempted to use GradientBoostingClassifier from scikit-learn to model this data. However, without an explicit setting for boosting rounds and early stopping rounds, we struggled to adjust the parameters available in a manner to produce models in a reasonable time period. It was taking approximately 20-30 minutes to train a model (without grid searching parameters). Switching to XGBoost, where we could explicitly set the boosting rounds and early stopping rounds, enabled us to grid search on the parameters with greater success.

<a id = 'Section_4_c'></a>

### 4c. SVM

Support Vector Machines are supervised learning algorithms that can be applied to classification problems. Points in the dataset are mapped to a space, and the algorithm tries to separate areas from each category by the widest space possible. SVM can use the ‘kernel trick’ to solve non-linear problems by projecting the data into higher dimensional spaces. 

For our problem, we used the SVC implementation of SVM in sklearn. We do a grid search across 3 parameters: kernel, gamma, and C. Gamma and C are parameters to the kernel function. Gamma determines how sensitive the model is to a single observation. C determines how smooth the decision boundary is. A low value of C will have a smooth boundary, where a high value will emphasize classifying more observations correctly. Due to long training times, we use a randomized search to test 12 different parameter combinations in our grid. We use a 2-fold cv, then use the best parameters from the search to make predictions on our hold out data set.

We run the search for samples of 1000, 2000, 5000, and 10000 from our data set. The hold out set contains half the original data, or 57,160 observations.

<a id = 'Section_5'></a>

## 5. Results

In Sections 5a - 5c we provide the results from the three different classification methods used.

<a id = 'Section_5_a'></a>

### 5a. Random Forest

<a id = 'Section_5_b'></a>

### 5b. XGBoost

Table 2 below shows the log loss for each set of parameter values used in the grid search. There are a few extra models where we tried to do further adjustment after the grid search, unsuccessfully. A total of 70 different models were created. The results are sorted in ascending log loss order, where the winning model is at the top. 5-fold cross validation was used for the grid search.

#### Table 2: XGBoost Performance with 5-Fold Cross Validation
*Sorted by ascending log loss*

| ID | max_depth | min_child_weight | subsample | colsample_bytree | eta | gamma | alpha | lambda | logloss | Stopping Rounds |
| :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | 
|1|**7**|**6**|**0.6**|**0.6**|**0.05**|**0**|**0**|**1**|**0.44227**|**197**|
|2|8|6|0.6|0.6|0.05|0|0|1|0.44237|154|
|3|7|6|0.6|0.6|0.05|0.01|0|1|0.44252|178|
|4|7|6|0.6|0.6|0.05|0.1|0|1|0.44268|179|
|5|7|5|0.6|0.5|0.05|0.1|0|1|0.44269|205|
|6|7|5|0.6|0.5|0.05|0|0|1|0.44271|203|
|7|7|5|0.6|0.6|0.05|0|0|1|0.44272|172|
|8|7|5|0.6|0.6|0.05|0.1|0|1|0.44274|178|
|9|9|6|0.6|0.6|0.05|0|0|1|0.44276|152|
|10|6|6|0.6|0.5|0.05|0.1|0|1|0.44276|256|
|11|6|5|0.6|0.6|0.05|0|0|1|0.44277|205|
|12|6|5|0.6|0.6|0.05|0.1|0|1|0.4428|205|
|13|7|6|0.6|0.5|0.05|0|0|1|0.44285|211|
|14|7|6|0.6|0.6|0.05|0|0|10|0.44286|209|
|15|6|6|0.6|0.5|0.05|0|0|1|0.44292|226|
|16|6|5|0.6|0.5|0.05|0|0|1|0.44302|227|
|17|7|6|0.5|0.6|0.05|0.1|0|1|0.44309|205|
|18|6|5|0.6|0.5|0.05|0.1|0|1|0.44316|206|
|19|9|6|0.6|0.6|0.06|0|0.1|1|0.44318|119|
|20|6|5|0.5|0.5|0.05|0|0|1|0.44321|243|
|21|7|6|0.6|0.5|0.05|0.1|0|1|0.44331|154|
|22|6|5|0.5|0.6|0.05|0.1|0|1|0.44333|209|
|23|7|6|0.5|0.5|0.05|0|0|1|0.44334|194|
|24|9|6|0.6|0.6|0.06|0|0|1|0.44344|118|
|25|6|6|0.6|0.6|0.05|0.1|0|1|0.44345|172|
|26|6|6|0.5|0.6|0.05|0.1|0|1|0.44347|190|
|27|7|6|0.5|0.6|0.05|0|0|1|0.44348|169|
|28|7|5|0.5|0.6|0.05|0.1|0|1|0.44349|169|
|29|6|6|0.6|0.6|0.05|0|0|1|0.44353|172|
|30|6|5|0.5|0.6|0.05|0|0|1|0.44355|205|
|31|7|5|0.5|0.6|0.05|0|0|1|0.44356|151|
|32|7|6|0.5|0.5|0.05|0.1|0|1|0.4436|168|
|33|6|5|0.5|0.5|0.05|0.1|0|1|0.44366|178|
|34|7|5|0.5|0.5|0.05|0.1|0|1|0.44369|178|
|35|7|5|0.5|0.5|0.05|0|0|1|0.44372|178|
|36|6|6|0.5|0.5|0.05|0|0|1|0.44383|178|
|37|6|6|0.5|0.6|0.05|0|0|1|0.44392|151|
|38|9|6|0.6|0.6|0.075|0|0|1|0.44405|95|
|39|7|5|0.6|0.6|0.01|0|0|1|0.44414|498|
|40|7|5|0.6|0.6|0.01|0.1|0|1|0.44414|498|
|41|7|6|0.6|0.6|0.01|0|0|1|0.44418|498|
|42|7|6|0.6|0.6|0.01|0.1|0|1|0.44419|498|
|43|7|5|0.5|0.6|0.01|0|0|1|0.44473|498|
|44|7|5|0.5|0.6|0.01|0.1|0|1|0.44473|498|
|45|7|6|0.5|0.6|0.01|0|0|1|0.44474|498|
|46|7|6|0.5|0.6|0.01|0.1|0|1|0.44475|498|
|47|7|5|0.6|0.5|0.01|0.1|0|1|0.44483|498|
|48|7|5|0.6|0.5|0.01|0|0|1|0.444861|498|
|49|7|6|0.6|0.5|0.01|0.1|0|1|0.44491|498|
|50|7|6|0.6|0.5|0.01|0|0|1|0.444915|498|
|51|7|6|0.6|0.6|0.05|0|0|100|0.44513|210|
|52|7|5|0.5|0.5|0.01|0|0|1|0.44536|498|
|53|7|6|0.5|0.5|0.01|0.1|0|1|0.44536|498|
|54|7|6|0.5|0.5|0.01|0|0|1|0.44538|498|
|55|6|5|0.6|0.6|0.01|0.1|0|1|0.4454|498|
|56|6|5|0.6|0.6|0.01|0|0|1|0.44541|498|
|57|6|6|0.6|0.6|0.01|0|0|1|0.44546|498|
|58|6|6|0.6|0.6|0.01|0.1|0|1|0.44547|498|
|59|6|5|0.5|0.6|0.01|0|0|1|0.44582|498|
|60|6|5|0.5|0.6|0.01|0.1|0|1|0.44583|498|
|61|6|6|0.5|0.6|0.01|0.1|0|1|0.44586|498|
|62|6|6|0.5|0.6|0.01|0|0|1|0.44587|498|
|63|6|6|0.6|0.5|0.01|0|0|1|0.44606|498|
|64|6|6|0.6|0.5|0.01|0.1|0|1|0.44607|498|
|65|6|5|0.6|0.5|0.01|0|0|1|0.4461|498|
|66|6|5|0.6|0.5|0.01|0.1|0|1|0.4461|498|
|67|6|5|0.5|0.5|0.01|0|0|1|0.44639|498|
|68|6|5|0.5|0.5|0.01|0.1|0|1|0.44639|498|
|69|6|6|0.5|0.5|0.01|0|0|1|0.44639|498|
|70|6|6|0.5|0.5|0.01|0.1|0|1|0.4464|498|

Table 3 below shows the model results, using the parameters from the winning model from Table 2. This model was trained on all the data in the training dataset at once, not using 5-fold cross validation. The model was scored against the test dataset and the resulting accuracy was 80.02%.

The log loss from 5-fold cross validation was 0.44227. Training on the whole training dataset produced a slightly better log loss of 0.43536. This model does not appear to have an issue with overfitting.

#### Table 3: Winning XGBoost Performance on Test Dataset

| max_depth | min_child_weight | subsample | colsample_bytree | eta | gamma | alpha | lambda | logloss | Stopping Rounds | Accuracy |
| :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
| 7 | 6 | 0.6 | 0.6 | 0.05 | 0 | 0 | 1 | 0.43536 | 198 | 80.02% |

<a href='#XGB_CV'>Click Here</a>  to see the cross validation out-of-fold average log loss. 

<a id = 'Section_5_c'></a>

### 5c. SVM

The best model came from the 10,000-observation sample, and it achieved 76.172% accuracy. This is only slightly better than a naïve model that always predicts 1. It had a rbf kernel, gamma = 1, and C = 100. 

Accuracy did not change much over the different sample sizes. The same three parameters were chosen for the best model at each sample size {'kernel': 'rbf', 'gamma': 1.0, 'C': 100}. Training time for the parameter search, however, was much different. The increase from 1000 to 2000 observations saw an increase in time from 4 seconds to 118 seconds (Almost 30x). The training time for 5000 observations was 557 seconds, about 4.72x the amount of time of the 2000 observation sample. The 10,000-observation sample took 1392 seconds, about 2.5x the 5000 observation sample. In general, doubling the sample size takes more than double the time to train. 

<a id = 'Section_6'></a>

## 6. Conclusion

<a id = 'Section_7'></a>

<a id = 'Section_7'></a>

## 7. References

1. https://xgboost.readthedocs.io/en/latest/parameter.html
2. https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f
3. https://towardsdatascience.com/selecting-optimal-parameters-for-xgboost-model-training-c7cd9ed5e45e
4. https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/

<a id = 'Section_8'></a>

## 8. Code

### Load Packages

In [1]:
import pandas as pd
import numpy as np

### Read In the Data

In this section we read in the data. 

In [3]:
#df = pd.read_csv("../../../case_8.csv")
df = pd.read_csv("../Data/case_8.csv")
df.head()

Unnamed: 0,ID,target,v1,v2,v3,v4,v5,v6,v7,v8,...,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131
0,3,1,1.335739,8.727474,C,3.921026,7.915266,2.599278,3.176895,0.012941,...,8.0,1.98978,0.035754,AU,1.804126,3.113719,2.024285,0,0.636365,2.857144
1,4,1,1.630686,7.464411,C,4.145098,9.191265,2.436402,2.483921,2.30163,...,6.822439,3.549938,0.598896,AF,1.672658,3.239542,1.957825,0,1.925763,1.739389
2,5,1,0.943877,5.310079,C,4.410969,5.326159,3.979592,3.928571,0.019645,...,9.333333,2.477596,0.013452,AE,1.773709,3.922193,1.120468,2,0.883118,1.176472
3,6,1,0.797415,8.304757,C,4.22593,11.627438,2.0977,1.987549,0.171947,...,7.018256,1.812795,0.002267,CJ,1.41523,2.954381,1.990847,1,1.677108,1.034483
4,8,1,1.630686,7.464411,C,4.145098,8.742359,2.436402,2.483921,1.496569,...,6.822439,3.549938,0.919812,Z,1.672658,3.239542,2.030373,0,1.925763,1.739389


In [4]:
# Print out the data types
df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114321 entries, 0 to 114320
Data columns (total 133 columns):
ID        int64
target    int64
v1        float64
v2        float64
v3        object
v4        float64
v5        float64
v6        float64
v7        float64
v8        float64
v9        float64
v10       float64
v11       float64
v12       float64
v13       float64
v14       float64
v15       float64
v16       float64
v17       float64
v18       float64
v19       float64
v20       float64
v21       float64
v22       object
v23       float64
v24       object
v25       float64
v26       float64
v27       float64
v28       float64
v29       float64
v30       object
v31       object
v32       float64
v33       float64
v34       float64
v35       float64
v36       float64
v37       float64
v38       int64
v39       float64
v40       float64
v41       float64
v42       float64
v43       float64
v44       float64
v45       float64
v46       float64
v47       object
v48       float64


Object data types will need to be one-hot encoded. Next, we check for missing values (there are none).

In [174]:
df.isnull().values.any()

False

Below are counts for the target variable. The target is binary and a little unbalanced.

In [6]:
counts = df.target.value_counts()
print(counts)
print(round(counts[0]/sum(counts),4))

1    87021
0    27300
Name: target, dtype: int64
0.2388


### Data Cleaning

In [7]:
# Print out statistics for the numeric variables
pd.set_option('display.max_columns', 500)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

df.describe()

Unnamed: 0,ID,target,v1,v2,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v23,v25,v26,v27,v28,v29,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v48,v49,v50,v51,v53,v54,v55,v57,v58,v59,v60,v61,v62,v63,v64,v65,v67,v68,v69,v70,v72,v73,v76,v77,v78,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v108,v109,v111,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v126,v127,v128,v129,v130,v131
count,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0,114321.0
mean,114228.92823,0.7612,1.63069,7.46441,4.1451,8.74236,2.4364,2.48392,1.49657,9.03186,1.88305,15.44741,6.8813,3.7984,12.09428,2.08091,4.92322,3.83227,0.84105,0.2223,17.77359,7.02974,1.09309,1.69813,1.87603,2.74345,5.09333,8.20642,1.62215,2.16163,6.40624,8.12239,13.3756,0.74147,0.09093,1.23718,10.46593,7.18255,12.92497,2.2166,10.79517,9.14223,1.63053,12.53802,8.01655,1.50426,7.19816,15.7113,1.25386,1.55956,4.07783,7.70165,10.58794,1.71429,14.58303,1.03069,1.68733,6.34371,15.84756,9.28728,17.56412,9.44934,12.26996,1.43177,2.4333,2.40506,7.30737,13.33448,2.2097,7.28717,6.20836,2.17381,1.60796,2.82225,1.22018,10.18022,1.92418,1.51843,0.96691,0.58237,5.47518,3.85288,0.66576,6.45795,7.62255,7.66762,1.25072,12.09162,6.86641,2.89029,5.29672,2.64283,1.08105,11.79136,2.15262,4.18128,3.36531,13.57445,10.54805,2.29122,8.30386,8.36465,3.16897,1.29122,2.7376,6.82244,3.54994,0.91981,1.67266,3.23954,2.03037,0.31014,1.92576,1.73939
std,65934.48736,0.42635,0.81326,2.22504,0.86266,1.54344,0.45061,0.44271,2.10979,1.44954,1.39347,0.59338,0.92415,0.88317,1.44392,0.55045,1.34464,1.43607,0.46286,0.12868,0.86743,1.0694,2.98732,2.24158,0.41398,0.62666,2.01131,0.96545,0.42324,0.7397,2.0242,1.00628,1.78573,0.40657,0.58348,1.77108,3.16764,0.75443,0.7488,0.48667,1.58586,1.55058,2.19532,1.64993,0.67797,1.16789,1.87306,0.60036,1.7546,0.62668,0.50925,5.13806,1.5564,0.40378,1.59344,0.69624,2.24951,1.89742,1.4105,0.84371,1.71983,1.4267,1.75436,0.92227,0.59981,1.03956,0.94339,1.38423,0.80726,1.68567,2.78821,0.79785,0.70691,1.06186,0.34985,2.27357,0.78753,2.13245,0.13438,0.1804,1.23201,0.64216,0.19835,0.84155,1.44498,1.76276,0.34655,5.17341,1.76901,1.35412,0.92291,0.66527,1.70317,2.21935,0.69222,2.81395,1.11715,2.61288,1.42744,0.5034,2.74269,1.50358,3.1636,0.55455,1.0186,1.3487,1.94343,1.59155,0.37791,1.22123,0.81434,0.69326,0.94964,0.85182
min,3.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,1.51678,0.10618,-0.0,0.04104,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.06935,-0.0,-0.0,-0.0,-0.0,-0.0,0.01306,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.05306,-0.0,0.6593,-0.0,1.50136,-0.0,0.42709,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.8724,-0.0,0.02237,-0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,9e-05,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.01914,-0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0
25%,57280.0,1.0,1.34615,6.57577,4.0687,8.39409,2.34097,2.37659,0.26531,8.81356,1.05033,15.39823,6.32262,3.46409,11.25602,1.90569,4.70588,3.37983,0.71949,0.19241,17.77359,6.41875,0.0,0.27448,1.75553,2.56647,4.74236,8.11437,1.49129,1.83074,5.0558,7.89615,13.09935,0.59072,0.0,0.30522,8.41039,7.06762,12.81317,2.06897,10.54256,8.88634,0.2563,12.15693,7.914,0.65879,6.83727,15.66772,0.20819,1.2766,3.97665,4.06136,10.2168,1.5996,14.58303,1.0,0.27094,5.85294,15.84756,9.16798,17.56412,9.27007,12.08748,1.0,2.23612,2.06003,7.19023,13.15175,1.95122,7.20499,3.59298,1.81535,1.31804,2.44395,1.10841,9.55184,1.62351,0.22454,0.94962,0.51793,5.13587,3.65008,0.59461,6.36225,7.18954,7.29994,1.17271,12.09162,6.34055,2.32905,4.98815,2.43202,0.17355,11.70203,1.85201,2.72122,2.92386,11.99667,10.26667,2.13904,7.604,7.86517,1.16943,1.05263,2.28261,6.51961,2.57105,0.08471,1.57097,2.7625,1.68126,0.0,1.44948,1.46341
50%,114189.0,1.0,1.63069,7.46441,4.1451,8.74236,2.4364,2.48392,1.49657,9.03186,1.31291,15.44741,6.61324,3.7984,11.96783,2.08091,4.92322,3.83227,0.84105,0.2223,17.77359,7.03937,0.33059,1.69813,1.87603,2.74345,5.09333,8.20642,1.62215,2.16163,6.53443,8.12239,13.3756,0.74147,0.0,1.23718,10.33934,7.18255,12.92497,2.2166,10.79517,9.14223,1.63053,12.53802,8.01655,1.21194,7.19816,15.7113,1.25386,1.55956,4.07783,7.70165,10.58794,1.71429,14.58303,1.0,1.68733,6.34371,15.84756,9.28728,17.56412,9.44934,12.26996,1.0,2.4333,2.40506,7.30737,13.33448,2.2097,7.28717,6.20836,2.17381,1.60796,2.82225,1.22018,10.18022,1.92418,1.51843,0.96691,0.58237,5.47518,3.85288,0.66576,6.45795,7.62255,7.66762,1.25072,12.09162,6.86641,2.89029,5.29672,2.64283,1.08105,11.79136,2.15262,4.18128,3.36531,14.03888,10.54805,2.29122,8.30386,8.36465,3.16897,1.29122,2.7376,6.82244,3.54994,0.91981,1.67266,3.23954,2.03037,0.0,1.92576,1.73939
75%,171206.0,1.0,1.63069,7.5515,4.34023,8.9248,2.4847,2.52845,1.49657,9.30233,2.10066,15.5939,7.0194,3.7984,12.71577,2.08091,5.14286,3.83227,0.84105,0.2223,18.1546,7.66652,1.09309,1.69813,1.89891,2.7791,5.33034,8.47939,1.62215,2.16163,7.70145,8.25076,14.32492,0.74147,0.0,1.23718,12.76246,7.34477,13.04965,2.23749,11.0221,9.41516,1.63053,12.67463,8.13559,2.00572,7.41788,15.87156,1.25386,1.55956,4.15366,7.70165,10.83954,1.73502,15.31291,1.0,1.68733,6.3844,16.47085,9.46899,18.4375,9.73384,12.9166,2.0,2.43665,2.40506,7.55221,13.55932,2.24359,7.82301,6.20836,2.17381,1.60796,2.82225,1.22018,10.43359,1.92418,1.51843,0.9901,0.58237,5.47518,3.85288,0.66576,6.669,7.71084,8.00612,1.30167,15.69721,6.93119,2.89029,5.29672,2.64283,1.08105,12.44363,2.15262,4.18128,3.36531,15.37219,10.71895,2.31017,8.64537,8.41772,3.16897,1.29122,2.7376,7.0,3.54994,0.91981,1.67266,3.23954,2.03037,0.0,1.92576,1.73939
max,228713.0,1.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,18.53392,20.0,18.71055,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,19.29605,20.0,20.0,20.0,20.0,19.84819,20.0,17.56098,20.0,20.0,20.0,20.0,20.0,12.0,19.91553,20.0,20.0,20.0,20.0,19.83168,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,18.84696,7.0,20.0,20.0,20.0,20.0,20.0,20.0,19.81631,12.0,20.0,20.0,15.97351,20.0,20.0,20.0,20.0,20.0,20.0,20.0,17.56098,19.84275,20.0,20.0,6.30577,8.92384,20.0,19.01631,9.07054,20.0,20.0,19.0588,20.0,20.0,20.0,20.0,18.77525,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,10.39427,20.0,20.0,19.68607,20.0,15.63161,20.0,20.0,11.0,20.0,20.0


In [8]:
# Print out unique values for objects
# v22 has 18,210 unique values, we will use weight of evidence for this instead of on-hot encoding
df.describe(include='object')

Unnamed: 0,v3,v22,v24,v30,v31,v47,v52,v56,v66,v71,v74,v75,v79,v91,v107,v110,v112,v113,v125
count,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321,114321
unique,3,18210,5,7,3,10,12,122,3,9,3,4,18,7,7,3,22,36,90
top,C,AGDF,E,C,A,C,J,BW,A,F,B,D,C,A,E,A,F,G,BM
freq,114041,2886,55177,92288,91804,55425,11106,18233,70353,75094,113560,75087,34561,27082,27082,55688,22053,71556,5836


### Weight of Evidence

In [9]:
#https://github.com/Sundar0989/WOE-and-IV/blob/master/WOE_IV.ipynb
import pandas.core.algorithms as algos
from pandas import Series
import scipy.stats.stats as stats
import re
import traceback
import string

max_bin = 20
force_bin = 3

# define a binning function
def mono_bin(Y, X, n = max_bin):
    
    df1 = pd.DataFrame({"X": X, "Y": Y})
    justmiss = df1[['X','Y']][df1.X.isnull()]
    notmiss = df1[['X','Y']][df1.X.notnull()]
    r = 0
    while np.abs(r) < 1:
        try:
            d1 = pd.DataFrame({"X": notmiss.X, "Y": notmiss.Y, "Bucket": pd.qcut(notmiss.X, n)})
            d2 = d1.groupby('Bucket', as_index=True)
            r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
            n = n - 1 
        except Exception as e:
            n = n - 1

    if len(d2) == 1:
        n = force_bin         
        bins = algos.quantile(notmiss.X, np.linspace(0, 1, n))
        if len(np.unique(bins)) == 2:
            bins = np.insert(bins, 0, 1)
            bins[1] = bins[1]-(bins[1]/2)
        d1 = pd.DataFrame({"X": notmiss.X, "Y": notmiss.Y, "Bucket": pd.cut(notmiss.X, np.unique(bins),include_lowest=True)}) 
        d2 = d1.groupby('Bucket', as_index=True)
    
    d3 = pd.DataFrame({},index=[])
    d3["MIN_VALUE"] = d2.min().X
    d3["MAX_VALUE"] = d2.max().X
    d3["COUNT"] = d2.count().Y
    d3["EVENT"] = d2.sum().Y
    d3["NONEVENT"] = d2.count().Y - d2.sum().Y
    d3=d3.reset_index(drop=True)
    
    if len(justmiss.index) > 0:
        d4 = pd.DataFrame({'MIN_VALUE':np.nan},index=[0])
        d4["MAX_VALUE"] = np.nan
        d4["COUNT"] = justmiss.count().Y
        d4["EVENT"] = justmiss.sum().Y
        d4["NONEVENT"] = justmiss.count().Y - justmiss.sum().Y
        d3 = d3.append(d4,ignore_index=True)
    
    d3["EVENT_RATE"] = d3.EVENT/d3.COUNT
    d3["NON_EVENT_RATE"] = d3.NONEVENT/d3.COUNT
    d3["DIST_EVENT"] = d3.EVENT/d3.sum().EVENT
    d3["DIST_NON_EVENT"] = d3.NONEVENT/d3.sum().NONEVENT
    d3["WOE"] = np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["IV"] = (d3.DIST_EVENT-d3.DIST_NON_EVENT)*np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["VAR_NAME"] = "VAR"
    d3 = d3[['VAR_NAME','MIN_VALUE', 'MAX_VALUE', 'COUNT', 'EVENT', 'EVENT_RATE', 'NONEVENT', 'NON_EVENT_RATE', 'DIST_EVENT','DIST_NON_EVENT','WOE', 'IV']]       
    d3 = d3.replace([np.inf, -np.inf], 0)
    d3.IV = d3.IV.sum()
    
    return(d3)

def char_bin(Y, X):
        
    df1 = pd.DataFrame({"X": X, "Y": Y})
    justmiss = df1[['X','Y']][df1.X.isnull()]
    notmiss = df1[['X','Y']][df1.X.notnull()]    
    df2 = notmiss.groupby('X',as_index=True)
    
    d3 = pd.DataFrame({},index=[])
    d3["COUNT"] = df2.count().Y
    d3["MIN_VALUE"] = df2.sum().Y.index
    d3["MAX_VALUE"] = d3["MIN_VALUE"]
    d3["EVENT"] = df2.sum().Y
    d3["NONEVENT"] = df2.count().Y - df2.sum().Y
    
    if len(justmiss.index) > 0:
        d4 = pd.DataFrame({'MIN_VALUE':np.nan},index=[0])
        d4["MAX_VALUE"] = np.nan
        d4["COUNT"] = justmiss.count().Y
        d4["EVENT"] = justmiss.sum().Y
        d4["NONEVENT"] = justmiss.count().Y - justmiss.sum().Y
        d3 = d3.append(d4,ignore_index=True)
    
    d3["EVENT_RATE"] = d3.EVENT/d3.COUNT
    d3["NON_EVENT_RATE"] = d3.NONEVENT/d3.COUNT
    d3["DIST_EVENT"] = d3.EVENT/d3.sum().EVENT
    d3["DIST_NON_EVENT"] = d3.NONEVENT/d3.sum().NONEVENT
    d3["WOE"] = np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["IV"] = (d3.DIST_EVENT-d3.DIST_NON_EVENT)*np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["VAR_NAME"] = "VAR"
    d3 = d3[['VAR_NAME','MIN_VALUE', 'MAX_VALUE', 'COUNT', 'EVENT', 'EVENT_RATE', 'NONEVENT', 'NON_EVENT_RATE', 'DIST_EVENT','DIST_NON_EVENT','WOE', 'IV']]      
    d3 = d3.replace([np.inf, -np.inf], 0)
    d3.IV = d3.IV.sum()
    d3 = d3.reset_index(drop=True)
    
    return(d3)

def data_vars(df1, target):
    
    stack = traceback.extract_stack()
    filename, lineno, function_name, code = stack[-2]
    vars_name = re.compile(r'\((.*?)\).*$').search(code).groups()[0]
    final = (re.findall(r"[\w']+", vars_name))[-1]
    
    x = df1.dtypes.index
    count = -1
    
    for i in x:
        if i.upper() not in (final.upper()):
            if np.issubdtype(df1[i], np.number) and len(Series.unique(df1[i])) > 2:
                conv = mono_bin(target, df1[i])
                conv["VAR_NAME"] = i
                count = count + 1
            else:
                conv = char_bin(target, df1[i])
                conv["VAR_NAME"] = i            
                count = count + 1
                
            if count == 0:
                iv_df = conv
            else:
                iv_df = iv_df.append(conv,ignore_index=True)
    
    iv = pd.DataFrame({'IV':iv_df.groupby('VAR_NAME').IV.max()})
    iv = iv.reset_index()
    return(iv_df,iv)

In [10]:
pd.set_option('display.max_rows', 500)
forWOE = df[["v22","target"]].copy()

final_iv, IV = data_vars(forWOE , forWOE.target)
final_iv.sort_values("WOE",ascending=False)

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,VAR_NAME,MIN_VALUE,MAX_VALUE,COUNT,EVENT,EVENT_RATE,NONEVENT,NON_EVENT_RATE,DIST_EVENT,DIST_NON_EVENT,WOE,IV
9055,v22,JEA,JEA,36,35,0.97222,1,0.02778,0.00040,0.00004,2.39609,0.29734
14740,v22,TPX,TPX,25,24,0.96000,1,0.04000,0.00028,0.00004,2.01879,0.29734
2624,v22,AEUR,AEUR,23,22,0.95652,1,0.04348,0.00025,0.00004,1.93178,0.29734
12366,v22,PEF,PEF,23,22,0.95652,1,0.04348,0.00025,0.00004,1.93178,0.29734
16518,v22,WWN,WWN,22,21,0.95455,1,0.04545,0.00024,0.00004,1.88526,0.29734
...,...,...,...,...,...,...,...,...,...,...,...,...
731,v22,ABIM,ABIM,6,1,0.16667,5,0.83333,0.00001,0.00018,-2.76870,0.29734
4876,v22,BMV,BMV,6,1,0.16667,5,0.83333,0.00001,0.00018,-2.76870,0.29734
1112,v22,ACBD,ACBD,6,1,0.16667,5,0.83333,0.00001,0.00018,-2.76870,0.29734
12304,v22,PAZ,PAZ,7,1,0.14286,6,0.85714,0.00001,0.00022,-2.95102,0.29734


### One-Hot Encoding and Standardization

In [11]:
from sklearn.preprocessing import StandardScaler
def transform_data(data):
    #OH encode
    label_encode = data.select_dtypes(include='object').columns
    normalize = data.drop(columns=["ID","target"]).select_dtypes(include='number').columns

    data_OHE = pd.get_dummies(data, columns=label_encode)
    
    #Standardize the variables
    scaler = StandardScaler()
    data_OHE[normalize] = scaler.fit_transform(data_OHE[normalize])
 
    return data_OHE

In [12]:
# add weight of evidence column for v22 back into dataset to get final dataset
df['v22'] = df['v22'].astype('category')
df2 = transform_data(df)
df3 = df2.merge(final_iv[["MIN_VALUE","WOE"]], how='left', left_on="v22",right_on="MIN_VALUE")
preModel_data = df3.drop(columns=["v22","MIN_VALUE"]).copy()

Print out a sample of the final dataset

In [13]:
preModel_data.head()

Unnamed: 0,ID,target,v1,v2,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v23,v25,v26,v27,v28,v29,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v48,v49,v50,v51,v53,v54,v55,v57,v58,v59,v60,v61,v62,v63,v64,v65,v67,v68,v69,v70,v72,v73,v76,v77,v78,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v108,v109,v111,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v126,v127,v128,v129,v130,v131,v3_A,v3_B,v3_C,v24_A,v24_B,v24_C,v24_D,v24_E,v30_A,v30_B,v30_C,v30_D,v30_E,v30_F,v30_G,v31_A,v31_B,v31_C,v47_A,v47_B,v47_C,v47_D,v47_E,v47_F,v47_G,v47_H,v47_I,v47_J,v52_A,v52_B,v52_C,v52_D,v52_E,v52_F,v52_G,v52_H,v52_I,v52_J,v52_K,v52_L,v56_A,v56_AA,v56_AB,v56_AC,v56_AE,v56_AF,v56_AG,v56_AH,v56_AI,v56_AJ,v56_AK,v56_AL,v56_AM,v56_AN,v56_AO,v56_AP,v56_AR,v56_AS,v56_AT,v56_AU,v56_AV,v56_AW,v56_AX,v56_AY,v56_AZ,v56_B,v56_BA,v56_BC,v56_BD,v56_BE,v56_BF,v56_BG,v56_BH,v56_BI,v56_BJ,v56_BK,v56_BL,v56_BM,v56_BN,v56_BO,v56_BP,v56_BQ,v56_BR,v56_BS,v56_BT,v56_BU,v56_BV,v56_BW,v56_BX,v56_BY,v56_BZ,v56_C,v56_CA,v56_CB,v56_CC,v56_CD,v56_CE,v56_CF,v56_CG,v56_CH,v56_CI,v56_CJ,v56_CK,v56_CL,v56_CM,v56_CN,v56_CO,v56_CP,v56_CQ,v56_CS,v56_CT,v56_CV,v56_CW,v56_CX,v56_CY,v56_CZ,v56_D,v56_DA,v56_DB,v56_DC,v56_DD,v56_DE,v56_DF,v56_DG,v56_DH,v56_DI,v56_DJ,v56_DK,v56_DL,v56_DM,v56_DN,v56_DO,v56_DP,v56_DQ,v56_DR,v56_DS,v56_DT,v56_DU,v56_DV,v56_DW,v56_DX,v56_DY,v56_DZ,v56_E,v56_F,v56_G,v56_H,v56_I,v56_L,v56_M,v56_N,v56_O,v56_P,v56_Q,v56_R,v56_T,v56_U,v56_V,v56_W,v56_X,v56_Y,v56_Z,v66_A,v66_B,v66_C,v71_A,v71_B,v71_C,v71_D,v71_F,v71_G,v71_I,v71_K,v71_L,v74_A,v74_B,v74_C,v75_A,v75_B,v75_C,v75_D,v79_A,v79_B,v79_C,v79_D,v79_E,v79_F,v79_G,v79_H,v79_I,v79_J,v79_K,v79_L,v79_M,v79_N,v79_O,v79_P,v79_Q,v79_R,v91_A,v91_B,v91_C,v91_D,v91_E,v91_F,v91_G,v107_A,v107_B,v107_C,v107_D,v107_E,v107_F,v107_G,v110_A,v110_B,v110_C,v112_A,v112_B,v112_C,v112_D,v112_E,v112_F,v112_G,v112_H,v112_I,v112_J,v112_K,v112_L,v112_M,v112_N,v112_O,v112_P,v112_Q,v112_R,v112_S,v112_T,v112_U,v112_V,v113_A,v113_AA,v113_AB,v113_AC,v113_AD,v113_AE,v113_AF,v113_AG,v113_AH,v113_AI,v113_AJ,v113_AK,v113_B,v113_C,v113_D,v113_E,v113_F,v113_G,v113_H,v113_I,v113_J,v113_L,v113_M,v113_N,v113_O,v113_P,v113_Q,v113_R,v113_S,v113_T,v113_U,v113_V,v113_W,v113_X,v113_Y,v113_Z,v125_A,v125_AA,v125_AB,v125_AC,v125_AD,v125_AE,v125_AF,v125_AG,v125_AH,v125_AI,v125_AJ,v125_AK,v125_AL,v125_AM,v125_AN,v125_AO,v125_AP,v125_AQ,v125_AR,v125_AS,v125_AT,v125_AU,v125_AV,v125_AW,v125_AX,v125_AY,v125_AZ,v125_B,v125_BA,v125_BB,v125_BC,v125_BD,v125_BE,v125_BF,v125_BG,v125_BH,v125_BI,v125_BJ,v125_BK,v125_BL,v125_BM,v125_BN,v125_BO,v125_BP,v125_BQ,v125_BR,v125_BS,v125_BT,v125_BU,v125_BV,v125_BW,v125_BX,v125_BY,v125_BZ,v125_C,v125_CA,v125_CB,v125_CC,v125_CD,v125_CE,v125_CF,v125_CG,v125_CH,v125_CI,v125_CJ,v125_CK,v125_CL,v125_D,v125_E,v125_F,v125_G,v125_H,v125_I,v125_J,v125_K,v125_L,v125_M,v125_N,v125_O,v125_P,v125_Q,v125_R,v125_S,v125_T,v125_U,v125_V,v125_W,v125_X,v125_Y,v125_Z,WOE
0,3,1,-0.36267,0.56766,-0.25975,-0.53588,0.36146,1.56529,-0.70322,0.6679,-0.99017,1.66284,-0.8609,-1.0548,-0.31712,-1.31874,2.71315,-0.11275,-1.58649,-0.57054,1.26315,0.65568,-0.36591,-0.69537,-0.37492,1.03733,-2.23895,0.69859,-1.27378,-1.55579,0.42679,0.25149,-1.14744,-0.70572,-0.15584,1.5668,-0.86957,0.62416,-0.29031,-0.4132,-0.18717,0.4556,-0.691,-0.222,0.10339,-0.5179,0.04252,1.72675,-0.69347,-0.41475,-0.20954,-1.39583,0.19471,-0.31167,0.80023,-0.04409,-0.68187,0.01026,1.7415,0.03177,-1.35614,5.39255,-0.27669,-0.46816,-1.36429,-0.16749,-0.01584,-3.44096,0.97899,0.14306,0.95162,-1.91013,-0.43615,-1.04999,-1.01117,-0.27639,1.77406,-0.66719,-0.45817,-0.7767,0.27503,-0.52193,-1.02718,1.16318,-1.50037,0.68631,-0.17136,1.42626,0.86084,-0.09816,-0.99947,-1.60655,-0.63044,0.35498,0.33237,-0.08897,-2.62463,0.78858,-5.38792,-0.6754,-0.62415,-1.63315,-1.0017,-0.41766,-1.89871,0.87311,-0.80279,-0.55547,0.34788,-0.10303,-0.00748,-0.44737,-1.35778,1.3122,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0
1,4,1,0.0,0.0,0.0,0.29085,0.0,0.0,0.38159,0.0,-0.40915,0.0,-0.40433,0.0,-0.31712,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.24933,-0.0,0.60583,0.0,-0.0,0.0,0.0,0.0,0.0,-1.3789,0.0,0.67417,0.0,-0.15584,-0.0,1.21221,-0.0,-0.0,-0.0,0.0,-0.0,0.37327,-0.0,0.0,-0.10708,-0.0,0.0,-0.07089,-0.0,-0.0,0.0,-0.0,0.0,-0.0,1.3922,0.38116,0.0,0.0,-0.0,0.0,0.0,-0.12347,0.61613,0.0,0.0,0.0,0.0,0.0,-0.00565,-0.99622,0.0,0.0,0.0,0.0,-0.14612,-0.0,0.54405,0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.36099,-0.0,0.0,0.0,0.0,0.0,0.0,0.24912,0.0,-0.47277,0.02366,0.0,-1.25012,0.0,-0.0,0.8355,0.0,0.0,0.0,0.0,-0.0,0.0,-0.20164,0.0,-0.0,-0.08909,-0.44737,-0.0,-0.0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.02343
2,5,1,-0.84451,-0.96823,0.3082,-2.21338,3.42466,3.26318,-0.70004,2.50757,-0.80173,-1.16504,-0.5374,-1.46383,-1.72499,-0.17583,0.7133,-0.46058,-1.28873,-0.60648,0.20606,-1.66889,-0.36591,-0.70671,0.89102,4.08945,-2.1167,-0.7317,-0.3971,-0.57719,-1.16707,-0.16218,-0.36124,-1.18487,-0.15584,3.46784,0.82436,-1.33756,-0.77195,1.45938,-1.19659,-2.44968,-0.68532,0.10504,-1.74021,-0.77042,1.3024,-1.01483,-0.66584,-1.26739,-0.09271,-0.66644,-0.9525,1.08244,0.9367,-0.04409,-0.69512,-0.43521,0.37551,-1.09034,-3.79321,-2.50017,-2.17135,1.70042,-0.03282,-0.4243,-1.47236,-1.13405,1.39191,1.7247,0.73806,-0.80617,-0.10941,-0.36949,-0.4252,-0.76213,1.83253,-0.65982,-1.15688,-1.72334,-0.25862,0.56382,-1.79533,-0.94432,-1.14087,2.22449,-1.17988,0.65715,-0.55798,0.29724,0.68018,-1.46471,-0.62987,-0.05442,-1.12229,-1.06508,0.00182,-0.90662,1.67652,1.66475,-1.75921,-1.41721,-1.0017,1.52829,-0.48968,1.86172,-0.55178,-0.56948,0.26739,0.55899,-1.11736,2.43755,-1.09794,-0.66084,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0
3,6,1,-1.0246,0.37768,0.0937,1.86926,-0.75165,-1.1212,-0.62785,-0.04577,3.34392,1.51685,2.99234,0.11878,1.38543,-0.24683,0.44177,-0.15423,0.82761,0.0725,0.69495,0.45576,-0.36591,-0.69116,-1.37146,-0.70185,1.9059,0.69202,-0.08153,-0.66915,1.13493,0.77124,-1.16076,-0.75691,-0.15584,-0.53632,0.33372,0.99213,0.0145,-1.5323,1.20655,0.3409,-0.69336,-0.20799,0.84872,1.56258,-1.29084,1.51642,-0.63498,-0.6086,-0.22054,-1.16183,0.76457,-1.20128,0.84369,-0.04409,-0.68774,-0.02674,0.82531,0.49281,0.58379,-0.01686,0.75513,0.61613,-0.26802,-0.20861,0.96062,0.08221,-0.3251,-1.47675,2.54913,-0.61763,-0.32345,-1.16326,0.06469,0.24936,-0.65539,-0.69375,0.56192,1.00643,0.01925,-0.66797,0.84075,1.09029,-0.60594,-0.41164,0.24208,1.19162,0.92757,-0.28597,-0.46002,-0.10396,-0.56781,0.34376,0.11288,-0.77868,-0.64596,0.07777,0.01868,-1.54977,-1.22298,-0.78741,-0.82295,-0.2253,-0.76681,0.14519,-0.89386,-0.57651,-0.68118,-0.2335,-0.04854,0.99509,-0.26184,-0.82753,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30707
4,8,1,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.59759,0.0,-0.60728,0.0,-0.76402,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.57525,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.15961,0.0,-0.0,0.0,-0.15584,-0.0,-0.10323,-0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.11964,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.04409,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.46816,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.20003,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.44737,-0.0,-0.0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.23863


###  Model Preparation

In [30]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score 
import pickle
#https://scikit-learn.org/stable/modules/model_evaluation.html

Create an X dataset by removing 'ID' and 'target'. Create a Y dataset containing only 'target'.

In [34]:
X = preModel_data.copy().drop(columns=["ID","target"]).select_dtypes(include=['number'])
print("The shape of X is: ", X.shape)

y = preModel_data.loc[:,"target"].copy()
print("The shape of y is: ", y.shape)

The shape of X is:  (114321, 477)
The shape of y is:  (114321,)


Even though we use cross validation we still split the data into training (80%) and test (20%). Cross validation will be used on the training set and the selected models will be run against the training set to help ensure overfitting has not occurred.

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2, random_state=42)

### XGBoost

In [71]:
#https://xgboost.readthedocs.io/en/latest/parameter.html
#https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f

# Create D matrices from the training and test datasets
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

  if getattr(data, 'base', None) is not None and \


#### Parameter Tuning

In [145]:
# set variables that will not be tuned
params = {
   "objective": "binary:logistic",
   "booster": "gbtree",
   "eval_metric": "logloss",
    "tree_method": "hist"
}

In [146]:
# create a gridsearch for the variables that will be tuned
gridsearch_params = [
    (max_depth, min_child_weight, subsample, colsample, eta, gamma, alpha, lambd)
    for max_depth in range(6,8)
    for min_child_weight in range(5,7)
    for subsample in [i/10. for i in range(5,7)]
    for colsample in [i/10. for i in range(5,7)]
    for eta in [.01, .05]
    for gamma in [0, .1]
    for alpha in [0]
    for lambd in [1]
]

# set the number of boosting rounds
num_boost_round = 499

In [147]:
%%time

# Define initial best params and MAE
min_logloss = float("Inf")
best_params = None
for max_depth, min_child_weight, subsample, colsample, eta, gamma, alpha, lambd in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}, subsample={}, colsample={}, eta={}, gamma={}, alpha={}, lambd={}".format(
        max_depth, min_child_weight, subsample, colsample, eta, gamma, alpha, lambd))
    # Update our parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample
    params['eta'] = eta
    params['gamma'] = gamma
    params['alpha'] = alpha
    params['lambda'] = lambd
    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'logloss'},
        early_stopping_rounds=5
    )
    # Update best MAE
    mean_logloss = cv_results['test-logloss-mean'].min()
    boost_rounds = cv_results['test-logloss-mean'].argmin()
    print("\tLogLoss {} for {} rounds".format(mean_logloss, boost_rounds))
    if mean_logloss < min_logloss:
        min_logloss = mean_logloss
        best_params = (max_depth, min_child_weight, subsample, colsample, eta, gamma, alpha, lambd)
print("Best params: {}, {}, {}, {}, {}, {}, {}, {} Logloss: {}".format(best_params[0], best_params[1], best_params[2], best_params[3], best_params[4], best_params[5], best_params[6], best_params[7], min_logloss))

CV with max_depth=6, min_child_weight=5, subsample=0.5, colsample=0.5, eta=0.01, gamma=0, alpha=0, lambd=1


The current behaviour of 'Series.argmin' is deprecated, use 'idxmin'
instead.
The behavior of 'argmin' will be corrected to return the positional
minimum in the future. For now, use 'series.values.argmin' or
'np.argmin(np.array(values))' to get the position of the minimum
row.


	LogLoss 0.44639059999999997 for 498 rounds
CV with max_depth=6, min_child_weight=5, subsample=0.5, colsample=0.5, eta=0.01, gamma=0.1, alpha=0, lambd=1
	LogLoss 0.44639380000000006 for 498 rounds
CV with max_depth=6, min_child_weight=5, subsample=0.5, colsample=0.5, eta=0.05, gamma=0, alpha=0, lambd=1
	LogLoss 0.44321400000000005 for 243 rounds
CV with max_depth=6, min_child_weight=5, subsample=0.5, colsample=0.5, eta=0.05, gamma=0.1, alpha=0, lambd=1
	LogLoss 0.4436626 for 178 rounds
CV with max_depth=6, min_child_weight=5, subsample=0.5, colsample=0.6, eta=0.01, gamma=0, alpha=0, lambd=1
	LogLoss 0.44582879999999997 for 498 rounds
CV with max_depth=6, min_child_weight=5, subsample=0.5, colsample=0.6, eta=0.01, gamma=0.1, alpha=0, lambd=1
	LogLoss 0.4458362 for 498 rounds
CV with max_depth=6, min_child_weight=5, subsample=0.5, colsample=0.6, eta=0.05, gamma=0, alpha=0, lambd=1
	LogLoss 0.4435594 for 205 rounds
CV with max_depth=6, min_child_weight=5, subsample=0.5, colsample=0.6, eta

	LogLoss 0.44491519999999996 for 498 rounds
CV with max_depth=7, min_child_weight=6, subsample=0.6, colsample=0.5, eta=0.01, gamma=0.1, alpha=0, lambd=1
	LogLoss 0.4449168 for 498 rounds
CV with max_depth=7, min_child_weight=6, subsample=0.6, colsample=0.5, eta=0.05, gamma=0, alpha=0, lambd=1
	LogLoss 0.4428656 for 211 rounds
CV with max_depth=7, min_child_weight=6, subsample=0.6, colsample=0.5, eta=0.05, gamma=0.1, alpha=0, lambd=1
	LogLoss 0.44331339999999997 for 154 rounds
CV with max_depth=7, min_child_weight=6, subsample=0.6, colsample=0.6, eta=0.01, gamma=0, alpha=0, lambd=1
	LogLoss 0.4441844 for 498 rounds
CV with max_depth=7, min_child_weight=6, subsample=0.6, colsample=0.6, eta=0.01, gamma=0.1, alpha=0, lambd=1
	LogLoss 0.44419339999999996 for 498 rounds
CV with max_depth=7, min_child_weight=6, subsample=0.6, colsample=0.6, eta=0.05, gamma=0, alpha=0, lambd=1
	LogLoss 0.4422738 for 197 rounds
CV with max_depth=7, min_child_weight=6, subsample=0.6, colsample=0.6, eta=0.05, gam

Take the winning model from grid searching and train the model to find the best iteration.

In [207]:
params = {
   "objective": "binary:logistic",
   "booster": "gbtree",
   "eval_metric": "logloss",
   "tree_method": "hist",
    "max_depth": 7,
    "min_child_weight": 6,
    "subsample": 0.6,
    "colsample_bytree": 0.6,
    "eta": .05,
    "gamma": 0,
    "alpha": 0,
    "lambda":1
}

In [208]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=5
)

print("Best LogLoss: {:.5f} in {} rounds".format(model.best_score, model.best_iteration+1))

[0]	Test-logloss:0.674669
Will train until Test-logloss hasn't improved in 5 rounds.
[1]	Test-logloss:0.656831
[2]	Test-logloss:0.641536
[3]	Test-logloss:0.62685
[4]	Test-logloss:0.614337
[5]	Test-logloss:0.602826
[6]	Test-logloss:0.59235
[7]	Test-logloss:0.581552
[8]	Test-logloss:0.572877
[9]	Test-logloss:0.563684
[10]	Test-logloss:0.556792
[11]	Test-logloss:0.54989
[12]	Test-logloss:0.542251
[13]	Test-logloss:0.535272
[14]	Test-logloss:0.529616
[15]	Test-logloss:0.523708
[16]	Test-logloss:0.518598
[17]	Test-logloss:0.514059
[18]	Test-logloss:0.510259
[19]	Test-logloss:0.506305
[20]	Test-logloss:0.501895
[21]	Test-logloss:0.498091
[22]	Test-logloss:0.494206
[23]	Test-logloss:0.490993
[24]	Test-logloss:0.488448
[25]	Test-logloss:0.485897
[26]	Test-logloss:0.483345
[27]	Test-logloss:0.480892
[28]	Test-logloss:0.478709
[29]	Test-logloss:0.476263
[30]	Test-logloss:0.47424
[31]	Test-logloss:0.472092
[32]	Test-logloss:0.47052
[33]	Test-logloss:0.468616
[34]	Test-logloss:0.466876
[35]	Test-l

Use the best iteration found previously and run the model again to get accuracy.

In [209]:
num_boost_round = model.best_iteration + 1
best_model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")]
)

[0]	Test-logloss:0.674669
[1]	Test-logloss:0.656831
[2]	Test-logloss:0.641536
[3]	Test-logloss:0.62685
[4]	Test-logloss:0.614337
[5]	Test-logloss:0.602826
[6]	Test-logloss:0.59235
[7]	Test-logloss:0.581552
[8]	Test-logloss:0.572877
[9]	Test-logloss:0.563684
[10]	Test-logloss:0.556792
[11]	Test-logloss:0.54989
[12]	Test-logloss:0.542251
[13]	Test-logloss:0.535272
[14]	Test-logloss:0.529616
[15]	Test-logloss:0.523708
[16]	Test-logloss:0.518598
[17]	Test-logloss:0.514059
[18]	Test-logloss:0.510259
[19]	Test-logloss:0.506305
[20]	Test-logloss:0.501895
[21]	Test-logloss:0.498091
[22]	Test-logloss:0.494206
[23]	Test-logloss:0.490993
[24]	Test-logloss:0.488448
[25]	Test-logloss:0.485897
[26]	Test-logloss:0.483345
[27]	Test-logloss:0.480892
[28]	Test-logloss:0.478709
[29]	Test-logloss:0.476263
[30]	Test-logloss:0.47424
[31]	Test-logloss:0.472092
[32]	Test-logloss:0.47052
[33]	Test-logloss:0.468616
[34]	Test-logloss:0.466876
[35]	Test-logloss:0.465593
[36]	Test-logloss:0.46407
[37]	Test-logloss

In [210]:
# accuracy
accuracy_score(y_test,(best_model.predict(dtest)>.5).astype(int))

0.8001749398644216

Below we run the same model but with cross validation to get logloss for the trainging/test folds.

In [211]:
%%time
best_model_cv = xgb.cv(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    seed=42,
    nfold=5,
    metrics={'logloss'},
    early_stopping_rounds=5,
    verbose_eval=False,
    maximize=False
)

Wall time: 2min 55s


<a id = 'XGB_CV'></a>

In [212]:
# Average out-of-fold cross validation results for winning XGB model
best_model_cv

Unnamed: 0,train-logloss-mean,train-logloss-std,test-logloss-mean,test-logloss-std
0,0.67452,6e-05,0.67489,0.00025
1,0.65646,0.00011,0.65721,0.00034
2,0.64105,0.00015,0.64215,0.00048
3,0.62611,0.0002,0.6276,0.00057
4,0.61348,0.00023,0.61526,0.0007
5,0.6019,0.00028,0.60402,0.00081
6,0.59104,0.00028,0.59345,0.00092
7,0.5801,0.00031,0.58287,0.00101
8,0.5712,0.00032,0.57431,0.00113
9,0.56177,0.00041,0.56529,0.00116


### Random Forest

In [32]:
# Use Stratified K-Fold cross validation with 5 folds
cv = StratifiedKFold(n_splits=5)

In [20]:
n_estimators= list(range(80, 110, 10))
max_features = list(range(5, 50, 5))
min_samples_split = list(range(500, 701, 100))
min_samples_leaf = [10, 20]
print(f'n_estimator_grid_search:{n_estimators}')
print(f'max_features_grid_search:{max_features}')
print(f'min_samples_split_grid_search:{min_samples_split}')
print(f'min_samples_leaf_grid_search:{min_samples_leaf}')


param_dist = {'n_estimators': n_estimators,
              'max_features': max_features,
              'min_samples_split': min_samples_split,
              'min_samples_leaf': min_samples_leaf}

scoring = {  'Accuracy':'accuracy'
            , 'Log Loss':'neg_log_loss'}

n_estimator_grid_search:[80, 90, 100]
max_features_grid_search:[5, 10, 15, 20, 25, 30, 35, 40, 45]
min_samples_split_grid_search:[500, 600, 700]
min_samples_leaf_grid_search:[10, 20]


### SVC

In [None]:
param_grid = {'C':[1,10,100,1000],
              'gamma':[1.0,0.1,0.001,0.0001], 
              'kernel':['linear','poly','rbf']}

svc = SVC(cache_size = 1000,class_weight = 'balanced', random_state=42)

In [None]:
start = time.time()

n_iter_search = 12
svc_random_search = RandomizedSearchCV(
    svc, 
    param_distributions=param_grid, 
    cv = 2, 
    random_state=42,
    n_iter=n_iter_search, 
    refit=True, 
    n_jobs=-1)

svc_random_search.fit(X_1k, y_1k)

end = time.time()
time_1k = end - start
print(time_1k)

filename = 'svc_random_search_1k.p'
pickle.dump(svc_random_search, open(filename, 'wb'))

preds = svc_random_search.predict(validation.drop(columns=["ID","target"]))
acc1k = accuracy_score(y_pred=preds,y_true=validation.target)
print("accuracy: " + str(round(acc1k,4)))
pd.crosstab(preds,validation.target)

In [None]:
start = time.time()

n_iter_search = 12
svc_random_search = RandomizedSearchCV(
    svc, 
    param_distributions=param_grid, 
    cv = 2, 
    random_state=42,
    n_iter=n_iter_search, 
    refit=True, 
    n_jobs=-1)

svc_random_search.fit(X_2k, y_2k)

end = time.time()
time_2k = end - start
print(time_2k)

filename = 'svc_random_search_2k.p'
pickle.dump(svc_random_search, open(filename, 'wb'))

preds = svc_random_search.predict(validation.drop(columns=["ID","target"]))
acc2k = accuracy_score(y_pred=preds,y_true=validation.target)
print("accuracy: " + str(round(acc2k,4)))
pd.crosstab(preds,validation.target)

In [None]:
start = time.time()

n_iter_search = 12
svc_random_search = RandomizedSearchCV(
    svc, 
    param_distributions=param_grid, 
    cv = 2, 
    random_state=42,
    n_iter=n_iter_search, 
    refit=True, 
    n_jobs=-1)

svc_random_search.fit(X_5k, y_5k)

end = time.time()
time_5k = end - start
print(time_5k)

filename = 'svc_random_search_5k.p'
pickle.dump(svc_random_search, open(filename, 'wb'))

preds = svc_random_search.predict(validation.drop(columns=["ID","target"]))
acc5k = accuracy_score(y_pred=preds,y_true=validation.target)
print("accuracy: " + str(round(acc5k,4)))
pd.crosstab(preds,validation.target)

In [None]:
start = time.time()

n_iter_search = 12
svc_random_search = RandomizedSearchCV(
    svc, 
    param_distributions=param_grid, 
    cv = 2, 
    random_state=42,
    n_iter=n_iter_search, 
    refit=True, 
    n_jobs=-1)

svc_random_search.fit(X_10k, y_10k)

end = time.time()
time_10k = end - start
print(time_10k)

filename = 'svc_random_search_10k.p'
pickle.dump(svc_random_search, open(filename, 'wb'))

preds = svc_random_search.predict(validation.drop(columns=["ID","target"]))
acc10k = accuracy_score(y_pred=preds,y_true=validation.target)
print("accuracy: " + str(round(acc10k,4)))
pd.crosstab(preds,validation.target)

In [None]:
times = [time_1k,time_2k,time_5k,time_10k]
accuracies = [acc1k,acc2k,acc5k,acc10k]

In [None]:
svc_random_search_summary = pd.DataFrame(data={'Training Time':times, 'Accuracy':accuracies})
filename = 'svc_random_search_summary.p'
pickle.dump(svc_random_search_summary, open(filename, 'wb'))
svc_random_search_summary