<br>
<br>

# Table of Contents

* [Introduction](#intro)
* [Load Libraries](#1)
* [Find Data Files In Input Folder](#2)
* [Read In The Data Files](#3)
* [Exploratory Data Analysis (EDA)](#4)
    * [Dataset Dimensions](#5)
    * [Dataset Head/Tail](#6)
    * [Target Column ("claim")](#target)
        * [Head](#target1)
        * [Unique Values In The Target Column](#target2)
        * [Target Column Data Type](#target3)
        * [Value Counts Of The Target Column](#target4)
        * [How Balanced Is The Target Column?](#target5)
    * [Column Names And Check For Differences Between Train and Test](#7)
    * [Describe](#8)
    * [Info](#9)
    * [Missing Values](#10)
        * [Nullity Matrices](#11)
        * [Nullity Bar Charts](#12)
        * [Percent Missing In Each Column Bar Charts](#13)
        * [How Many Missing Data Points Are There Total?  What Percent of Total Is Missing?](#14)
        * [What If We Just Drop All Rows With Missing Values From The Datasets?](#15)
    * [Feature Column Distributions](#16)
        * [Train](#17)
        * [Test](#18)
    * [Feature Column Skewness and Kurtosis](#19)
        * [Skewness](#20)
        * [Kurtosis](#21)
    * [Correlations](#22)
        * [Correlations of Subsets](#23)
* [TLDR: EDA Findings/Conclusions](#24)
* [Preprocess](#25)
    * [Train/Test Split](#28)
    * [Min-Max Scaling](#26)
    * [Impute Using Mean](#27)
* [Baseline Model](#29)
    * [Baseline Predictions](#30)
    * [Define A Scoring Function](#31)
    * [Choosing N Estimators](#32)
    * [Tuning Results](#33)
    * [Model Selection](#34)
    * [Final Model](#35)
    * [Final Baseline Prediction](#36)
    

<br>
<br>

<a id=intro></a>

## Introduction

The goal of this notebook is to explore the Tabular Playground Series - Sep 2021 dataset, make some decisions about pre-processing steps based on my exploration, look for promising feature engineering opportunties and then make a benchmark submission to the competition.

This is only my second public notebook so I'm sure mistakes will be made.  I'm open to suggestions for improvement.  Thanks!

***From the TPS September 2021 Competition Description Page:***

*The dataset used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting whether a claim will be made on an insurance policy. Although the features are anonymized, they have properties relating to real-world features.*

*Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.*

*For each id in the test set, you must predict a probability for the claim variable. The file should contain a header and have the following format:  id, claim*

<br>
<br>
<a id=1><a/>
    
## Load Libraries  

In [None]:
import numpy as np #working with matrices, arrays, data science-friendly arrays
import pandas as pd #data processing, CSV file I/O, preprocessing
import matplotlib.pyplot as plt  #data viz library
#jupyter notebook magic function to make plots show in a notebook cell
%matplotlib inline  
plt.style.use('seaborn-whitegrid') #set my default matplotlib style to 'seaborn-whitegrid'

import seaborn as sns  #additional data viz helper library
import scipy.stats as st  #used to fit non-normal distributions with seaborn
import missingno as msno  #visualize missing values in the dataset

import os  #working with the operating system, filepaths, folders,etc.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer #replace missing values with the mean
from sklearn.feature_selection import mutual_info_regression #used to create a ranking with a feature utility metric 
from xgboost import XGBClassifier #first classifier model
from sklearn.metrics import roc_auc_score

<br>
<br>

<a id=2><a/>

## Find Data Files In Input Folder

In [None]:
#This is default from Kaggle.  Basically uses os.walk to recursively 
#print the full filepath and filename for all files stored in the kaggle/input folder.
#Some people modify this to not just print the full filepath but to also read them into a dataframe.  
#I'm just going to use the print statement to inform my pd.read_csv function call later.

####### DEFAULT COMMENTS AND CODE FROM KAGGLE ###############


# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
####### DEFAULT COMMENTS AND CODE FROM KAGGLE ###############

<br>
<br>

<a id ="3"></a>

## Read In The Data Files

In [None]:
#using the above print statements, set the filepath of the csv files I want to read.
train_filepath = "../input/tabular-playground-series-sep-2021/train.csv"
test_filepath = "../input/tabular-playground-series-sep-2021/test.csv"
sample_solution_filepath = "../input/tabular-playground-series-sep-2021/sample_solution.csv"

#read train.csv, test.csv, sample_solution.csv into a pandas dataframe
#Need to use index_col = 0 because the id column is located at column index 0.  I found this below
#after reading in the dataset and looking at the head of the dataframe.
train_df = pd.read_csv(train_filepath, index_col = 0)  
test_df = pd.read_csv(test_filepath, index_col =0)
ss_df = pd.read_csv(sample_solution_filepath)

<br>
<br>

<a id="4"></a>

## Exploratory Data Analysis (EDA)

<a id="5"></a>

### Dataset Dimensions

In [None]:
# training dataset dimensions using shape method
train_df.shape

In [None]:
# test dataset dimensions using shape method
test_df.shape

In [None]:
# sample submission dataset dimensions using shape method
ss_df.shape

#### OK, huge dataset.  Time to use the GPU.  Sklearn's RandomForest classifier is out because it does not allow GPU usage.  So maybe, XGBoost Classifier, Catboost,  with gpu enabled or RAPIDS package version of RandomForest classifier instead of sklearn's.  Using Sklearn will take too long to train models on a dataset this size.  

<a id='6'></a>

### Dataset Head/Tail

#### Training dataset

In [None]:
#training dataset head and tail to get a feel for what the data looks like

train_df.head()

In [None]:
train_df.tail()

#### Test Dataset

In [None]:
#training dataset head and tail to get a feel for what the data looks like
test_df.head()

In [None]:
test_df.tail()

#### Sample Submission 

In [None]:
ss_df.head()

In [None]:
ss_df.tail()

In [None]:
## the sample solution dataframe is no longer need. delete the variable from memory to conserve a little RAM
del ss_df

<br>
<br>

<a id="target"></a>

### Target Column ("claim")

<a id="target1"></a>

### Head

In [None]:
train_df['claim'].head()

<a id="target2"></a>

### Unique Values In The Target Column

In [None]:
train_df['claim'].unique()

<a id="target3"></a>

### Target Column Data Type

In [None]:
train_df['claim'].dtype

<a id="target4"></a>

### Value Counts Of The Target Column

In [None]:
#value counts of 'claim' column
train_df['claim'].astype('str').value_counts()

<a id="target5"></a>

### How Balanced Is The Target Column?

In [None]:
print('There is a {}% difference between "0" counts and "1" counts in the target "claim" column.'.format(((train_df['claim'].astype('str').value_counts()[0] - train_df['claim'].astype('str').value_counts()[1])/train_df['claim'].astype('str').value_counts()[0])*100))

In [None]:
plt.subplots(1,1, figsize=(5,5))
plt.subplot(1,1,1)
target_data_obj = train_df['claim'].astype('str').value_counts()
plt.bar(x=target_data_obj.index, height=target_data_obj.values, alpha=0.75, color='#7571B0')
plt.title("Target ('claim') Column Value Counts", fontsize=12,fontweight='bold')

#### The target "claim" column is a binary 1 or 0 integer column.  It looks like the claim column in the  submission file should be a probability between 0 and 1.
#### The target "claim" column is very balanced with only a 0.6% difference in counts between the two classes (0 or 1).

In [None]:
del target_data_obj

<br>

<a id="7"></a>

### Column Names And Check For Differences Between Train and Test

In [None]:
print("There are {} columns in the training dataset.".format(len(train_df.columns)))
train_df.columns

In [None]:
print("There are {} columns in the test dataset.".format(len(test_df.columns)))
test_df.columns

In [None]:
#find any columns in train that do not appear in test
col_diff_list = [x for x in train_df.columns if x not in test_df.columns]
col_diff_list2 = [x for x in test_df.columns if x not in train_df.columns]
print('The column(s) that are in the training dataset but not in the test dataset are: {}'.format(col_diff_list))
print('The column(s) that are in the test dataset but not in the train dataset are: {}'.format(col_diff_list2))

#### Both the training and test datasets appear to have same columns except the training dataset also includes the target "claim" column.

<a id="8"></a>

### Describe

In [None]:
train_df.describe()

In [None]:
test_df.describe()

<a id="9"></a>

### Info

There are lots of columns so breaking the dataset into 3 subsets

In [None]:
train_df.iloc[:,:round(len(train_df.columns)/3)].info()

In [None]:
train_df.iloc[:,round(len(train_df.columns)/3):round(len(train_df.columns) * (2/3))].info()

In [None]:
train_df.iloc[:,round(len(train_df.columns) * (2/3)):round(len(train_df.columns) * (3/3))].info()

In [None]:
train_df['claim'].dtype

In [None]:
test_df.iloc[:,:round(len(test_df.columns)/3)].info()

In [None]:
test_df.iloc[:,round(len(test_df.columns)/3):round(len(test_df.columns) * (2/3))].info()

In [None]:
test_df.iloc[:,round(len(test_df.columns) * (2/3)):round(len(test_df.columns) * (3/3))].info()

#### All features columns are float65 type with the target "claim" column being an int64 type column.

<a id="10"></a>

### Missing Values

<br>

#### Which columns have missing values?

In [None]:
print('There are {} column(s) in {} with NULL values.'.format(len([col for col in train_df.columns if train_df[col].isnull().any()]),'train_df'))

In [None]:
print('There are {} column(s) in {} with NULL values.'.format(len([col for col in test_df.columns if train_df[col].isnull().any()]),'test_df'))

<br>

<a id="11"></a>

### Nullity Matrices

In [None]:
#nullity matrix on training dataset to understand which columns have NULL values and where NULL values are dispersed throughtout the dataset.  Is there a pattern?
msno.matrix(train_df,color=(0.27, 0.52, 1.0))

In [None]:
#nullity matrix test dataset
msno.matrix(test_df,color=(0.27, 0.52, 1.0))

#### At first glance, it appears the NULL values are relatively randomly dispersed throughout all feature columns.

#### Row 105 and 119 appear to have the maximum nullity in the training dataset.

#### Visually, it does not appear a large percentage of the dataset is NULL, but all feature columns appear to have at least have some missing values.

<br>

<a id="12"></a>

### Nullity Bar Charts

#### Due to the large number of columns, the Missingno library nullity bar charts cannot handle so many columns by default.  I'm going to break up the dataset into thirds to get a better look.  Using this plot to understand, approximately, how many NULLS are in each column.

#### Train

In [None]:
msno.bar(train_df.iloc[:,:round(len(train_df.columns)/3)],color=(0.27, 0.52, 1.0))

In [None]:
msno.bar(train_df.iloc[:,round(len(train_df.columns)/3):round(len(train_df.columns) * (2/3))],color=(0.27, 0.52, 1.0))

In [None]:
msno.bar(train_df.iloc[:,round(len(train_df.columns) * (2/3)):],color=(0.27, 0.52, 1.0))

#### Test

In [None]:
msno.bar(test_df.iloc[:,:round(len(test_df.columns)/3)],color=(0.27, 0.52, 1.0))

In [None]:
msno.bar(test_df.iloc[:,round(len(test_df.columns)/3):round(len(train_df.columns) * (2/3))],color=(0.27, 0.52, 1.0))

In [None]:
msno.bar(test_df.iloc[:,round(len(test_df.columns) * (2/3)):],color=(0.27, 0.52, 1.0))

#### Based on the above, there appears to be, at least, some NULL values in each feature column and no NULL values in the target 'claim' column.  The NULL values appears to be randomly spread out throughout the datset in each column and appear to represent a very small percentage of each column's data.

<br>

<a id="13"></a>

### Percent Missing In Each Column Bar Charts

In [None]:
percent_missing_train_df = train_df.isnull().sum() * 100 / len(train_df)
missing_value_train_df = pd.DataFrame({'column_name': train_df.columns,
                                 'percent_missing': percent_missing_train_df})

percent_missing_test_df = test_df.isnull().sum() * 100 / len(test_df)
missing_value_test_df = pd.DataFrame({'column_name': test_df.columns,
                                 'percent_missing': percent_missing_test_df})

In [None]:
missing_value_train_df.sort_values('percent_missing', inplace=True, ascending=True)
missing_value_test_df.sort_values('percent_missing', inplace=True, ascending=True)

In [None]:
missing_value_train_df.plot.barh(x='column_name', y='percent_missing', rot=5,figsize=(10, 40),alpha=0.85,legend=False,color='#4F66AF')
plt.title('Training Dataframe Percent Missing Values By Column Descending Order')

In [None]:
missing_value_test_df.plot.barh(x='column_name', y='percent_missing', rot=5,figsize=(10, 40),alpha=0.85,legend=False,color='#4F66AF')
plt.title('Test Dataframe Percent Missing Values By Column Descending Order')

#### Every feature column is a float64 type column.  All columns have, at least, some missing values.  No column has more than 2% missing values in either the training or test datasets.

#### There are no missing values in the target "claim" column

#### f31 and f46 have the most NaN values but still not more than 2%

In [None]:
#delete the above dataframes from memory to help conserve RAM
del percent_missing_train_df
del missing_value_train_df
del percent_missing_test_df
del missing_value_test_df

<br>
<br>

<a id="14"></a>

### How Many Missing Data Points Are There Total?  What Percent of Total Is Missing?

In [None]:
train_df_missing_values_count = train_df.isnull().sum()
train_df_total_cells = np.product(train_df.shape)
train_df_total_missing = train_df_missing_values_count.sum()
train_df_percent_missing = (train_df_total_missing/train_df_total_cells) * 100
print("There are {} missing data points in the training dataset out of {} total possible cells.".format(train_df_total_missing,train_df_total_cells))
print("{}% of the training dataset is missing.".format(round(train_df_percent_missing,3)))

In [None]:
test_df_missing_values_count = test_df.isnull().sum()
test_df_total_cells = np.product(test_df.shape)
test_df_total_missing = test_df_missing_values_count.sum()
test_df_percent_missing = (test_df_total_missing/test_df_total_cells) * 100
print("There are {} missing data points in the training dataset out of {} total possible cells.".format(test_df_total_missing,test_df_total_cells))
print("{}% of the training dataset is missing.".format(round(test_df_percent_missing,3)))

In [None]:
#delete the above variables to help conserve RAM
del train_df_missing_values_count
del train_df_total_cells
del train_df_total_missing
del train_df_percent_missing

del test_df_missing_values_count
del test_df_total_cells
del test_df_total_missing
del test_df_percent_missing

#### There isn't really documentation or column name clues to tell me why there are missing values in this dataset.  Overall, there is only about 1.6% missing values in botht the training and test datasets which is pretty balanced.  It may be reasonable to just drop these NULL values and move forward.

<br>
<br>

<a id="15"></a>

### What If We Just Drop All Rows With Missing Values From The Datasets?

In [None]:
print("{}% of the rows would remain in the training dataset if we simply dropped all rows with any missing value!".format(round((train_df.dropna().shape[0]/train_df.shape[0])*100),2))

In [None]:
print("{}% of the rows would remain in the test dataset if we simply dropped all rows with any missing value!".format(round((test_df.dropna().shape[0]/test_df.shape[0])*100),2))

#### Wow, so 62% of the rows have at least 1 missing value.  We also know from above missing value analysis that all (100%) columns also have at least one missing value.  Therefore, simply dropping all rows or all columns with a missing value would remove too much of the original dataset.  In this case, it's better to impute, in my opinion.

<br>
<br>

<a id="16"></a>

### Feature Column Distributions

##### We know all feature columns are numeric so we don't have any categorical columns in the intitial dataset to look at.  Let's look at each feature's distribution to see if we could possibly transform any.

In [None]:
feature_cols = [col for col in train_df.columns if col != 'claim']

In [None]:
#define a function to loop over numeric feature columns and plot their distribution
def plot_feature_distributions(figrows,figcols,colstart,colend,collist,df_to_plot):
    plt.figure(1)
    plt.subplots(figrows,figcols, figsize=(20,20))
    for i, item in enumerate(collist[colstart:colend]):
        plt.subplot(figrows,figcols,i+1)
        plt.hist(x=df_to_plot[item],color='#7571B0',alpha=0.75)
        plt.title(item)
        plt.grid(True)
    plt.subplots_adjust(top=1.5, bottom=0.2, left=0.10, right=0.95, hspace=0.3,
        wspace=0.35)

<br>

<a id="17"></a>

#### Train

In [None]:
#plot first 39 columns of train_df
plot_feature_distributions(13,3,0,39,feature_cols,train_df)

In [None]:
#plot columns 40-78 of train_df
plot_feature_distributions(13,3,39,78,feature_cols,train_df)

In [None]:
#plot columns 79-119 of train_df
plot_feature_distributions(14,3,78,119,feature_cols,train_df)

<a id="18"></a>

#### Test

In [None]:
#plot first 39 columns of test_df
plot_feature_distributions(13,3,0,39,feature_cols,test_df)

In [None]:
#plot columns 40-78 of test_df
plot_feature_distributions(13,3,39,78,feature_cols,test_df)

In [None]:
#plot columns 79-119 of test_df
plot_feature_distributions(14,3,78,119,feature_cols,test_df)

#### With this many columns, it is sort of hard to wrap my mind around all the different distributions in the dataset.  In short, the 118 numeric feature columns have a variety of different distributions.  Some that look pretty normal and some that are pretty skewed and some even look sort of binary or like some kind of categorical column with most values falling into a narrow bin.

####  All the different columns appear to be on different scales so some sort of scaling such as min-max scaling could be useful with this dataset. Some features are on a 0-1 scale while some are in the tens of thousands.  Some are have negative numbers while some have only positive numbers.

#### Spot-checking a few features in both the train and test seem to show similar distributions in both train and test which makes me feel the train/test split is reasonably balanced.  For example, the distribution for f1 and f57 look pretty similar in both the training dataset and the test dataset.

#### With so many skewed-looking columns, we could try some log, exponential, boxcox transformations during feature engineering to see if that may improve the model.

<br>
<br>

<a id="19"></a>

### Feature Column Skewness and Kurtosis

<a id="20"></a>

### Skewness

In [None]:
skewness_df = pd.DataFrame(train_df.skew(),columns=['skewness'])
skewness_test_df = pd.DataFrame(test_df.skew(),columns=['skewness'])

In [None]:
#which features are extremely skewed either positively or negatively?
skewness_df[(skewness_df['skewness'] < -1) | (skewness_df['skewness'] > 1)]

In [None]:
#which features are extremely skewed either positively or negatively?
skewness_test_df[(skewness_test_df['skewness'] < -1) | (skewness_test_df['skewness'] > 1)]

#### There are 67 columns in both the training and testing datasets that are extremely skewed!

<a id="21"></a>

### Kurtosis

##### This is the degree of presence of outliers in the distribution

In [None]:
kurt_df = pd.DataFrame(train_df.kurt(),columns=['kurtosis'])
kurt_test_df = pd.DataFrame(test_df.kurt(),columns=['kurtosis'])

In [None]:
kurt_df[kurt_df['kurtosis'] < 0].shape

In [None]:
#how many features have a kurtosis less than 0?
kurt_df[kurt_df['kurtosis'] < 0]

In [None]:
#how many features have a kurtosis less than 0?
kurt_test_df[kurt_test_df['kurtosis'] < 0]

#### 38 columns have kurtosis less than 0 and some have a kurtosis less than -1 which may indicate outliers.

<br>
<br>

<a id="22"></a>

### Correlations

In [None]:
def highlight_abs_max(s):
    '''
    highlight the absolute maximum in a Series yellow.
    '''
    is_max = s == s.abs().max()
    return ['background-color: yellow' if v else '' for v in is_max]

In [None]:
correlations =train_df.corr()
corrs_sorted = correlations['claim'].sort_values(ascending=False, key=abs).to_frame(name='Correlations With Target')
corrs_sorted[~corrs_sorted.index.isin(['claim','id'])].style.apply(highlight_abs_max)

#### Performing a correlation heatmap on the entire dataset is very difficult to see so I was not able to make one big correlation matrix between all 118 features plus the 'claim' column.  However, the largest correlation with the target 'claim' column is only -0.021.  None of the feature columns are very correlated with the 'claim' column

<br>

<a id="23"></a>

### Correlations of Subsets

In [None]:
#correlations of subsections
correlations = train_df.iloc[:,:round(len(train_df.columns)/4)].corr()
f , ax = plt.subplots(figsize = (14,14))
plt.title('Correlation of Numeric Variables f1 - f30',y=1,size=16)
sns.heatmap(correlations,square = True,  vmax=0.8, cmap='viridis',linewidths=0.01,annot=True,annot_kws = {'size':7})

In [None]:
#correlations of subsections
correlations = train_df.iloc[:,round(len(train_df.columns)/4):round(len(train_df.columns) * (2/4))].corr()
f , ax = plt.subplots(figsize = (14,14))
plt.title('Correlation of Numeric Variables f31 - f60',y=1,size=16)
sns.heatmap(correlations,square = True,  vmax=0.8, cmap='viridis',linewidths=0.01,annot=True,annot_kws = {'size':7})

In [None]:
#correlations of subsections
correlations = train_df.iloc[:,round(len(train_df.columns) * (2/4)):round(len(train_df.columns) * (3/4))].corr()
f , ax = plt.subplots(figsize = (14,14))
plt.title('Correlation of Numeric Variables f61 - f89',y=1,size=16)
sns.heatmap(correlations,square = True,  vmax=0.8, cmap='viridis',linewidths=0.01,annot=True,annot_kws = {'size':8})

In [None]:
#correlations of subsections
correlations = train_df.iloc[:,round(len(train_df.columns) * (3/4)):round(len(train_df.columns) * (4/4))].corr()
f , ax = plt.subplots(figsize = (14,14))
plt.title('Correlation of Numeric Variables f90 - claim',y=1,size=16)
sns.heatmap(correlations,square = True,  vmax=0.8, cmap='viridis',linewidths=0.01,annot=True,annot_kws = {'size':8})

#### Lots of purple meaning very low correlation less than 0.1 on the color scale.  Across all subsections I could not find any columns even mildly correlated with another.

<br>
<br>

<a id="24"></a>

### TLDR: EDA Findings/Conclusions

1. Training Dataset Shape:  ***(957919, 119)***
2. Test Dataset Shape: ***(493474, 118)***
3. This is a ***huge*** dataset that will likely test the default Kaggle CPU and RAM allocation.  GPU will likely be needed for faster iteration.  
4. Due to necessity of GPU, models such as sklearn's RandomForest Classifier may not be appropriate due to its inability to work with GPU. Better choices might be XGBoost, Catboost, RAPIDS RandomForest, and other GPU-friendly classifier models/packages.  Check out this discussion for more [tips on using GPU in Kaggle](https://www.kaggle.com/c/tabular-playground-series-sep-2021/discussion/271900#1511854)
5. The target column "claim" is a binary integer column with no missing values.  This is a probabilistic classification problem.  Models such as logistic regression, tree based models such as DecisionTrees, RandomForest, XGBoostClassifier, etc.
6. All feature columns are float64 datatype
7. All feature columns have at least one missing value with the most sparse column only missing 1.6% of its data.
8. 62% of the rows have at least one missing value.
9. The target "claim" column has balanced classes.  Only a 0.6% difference between class value counts.
10. The training and test datasets have the same columns, column types and similar distributions of all feature columns.
11. All 118 feature columns have at least one missing value.  Missing values appear randomly dispersed throughout the dataset.
12. There are 67 feature columns with extremely skewed distributions.  Some distributions even look like categorical due to appeared binning of values with short ranges.
13. Several feature columns have very negative kurtosis indicating possible outliers.
14. None of the columns are correlated with each other (correlation is very small).
15. None of the feature columns are correlated with the target column (correlation is very small).
16. The feature columns (all numeric) appear to be on different scales.  For example, some features are on a 0-1 scale, some are in the 10,000s, some features have negative values.
17. Ideas for feature engineering:  Lots of skewed numeric columns.  I'd like to try some transformations to see if those would improve the model.  Binning numeric columns.  Clustering the feature columns to created a new cluster feature.

<br>
<br>

<a id="25"></a>

### Preprocess

<br>

<a id="28"></a>

### Train/Test Split

In [None]:
#copy the training dataset
X = train_df.copy()
y = X.pop('claim')

In [None]:
#split the dataset into a training/validation set 
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0,train_size=0.8, test_size=0.2)

<br>

<a id="26"></a>

### Min-Max Scaling

We saw above all feature columns are float64 type with the target column ("claim") being a binary integer column.  The numeric feature columns vary greatly in scale so I would like to put all the feature columns on the same scale

In [None]:
float_cols = [col for col in train_df if col != 'claim']

In [None]:
#save the minmax scaler function as a variable
mm_scaler = MinMaxScaler()

In [None]:
#min-max scale the numeric columns only.  In this case, that is every column.

#fit and transform the training df
scaled_cols_train = pd.DataFrame(mm_scaler.fit_transform(X_train[float_cols]),index = X_train.index, columns = X_train.columns)

#just transform the validation and test df.  
scaled_cols_valid = pd.DataFrame(mm_scaler.transform(X_valid[float_cols]),index = X_valid.index, columns = X_valid.columns)
scaled_cols_test = pd.DataFrame(mm_scaler.transform(test_df),index= test_df.index, columns = test_df.columns)

<br>

<a id="27"></a>


### Impute Using Mean


In [None]:
#1.6% of the dataset is missing, however, 62% of the rows and 100% of the columns have at least 1 missing value.  This means
#I will impute rather than drop.

# Imputing AFTER min-max scaling so the mean imputation is on the same scale.

#set simple imputer variable.  By default, this imputs using the mean to replace missing values
my_imputer = SimpleImputer()

#fit and transform the training df
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(scaled_cols_train), index=X_train.index)

imputed_X_valid = pd.DataFrame(my_imputer.transform(scaled_cols_valid), index=X_valid.index)
imputed_X_test = pd.DataFrame(my_imputer.transform(scaled_cols_test), index=test_df.index)


# Imputation removed column names; put them back

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
imputed_X_test.columns = test_df.columns

In [None]:
# deleting some variables to save memory
del train_df
del test_df
del X_train
del X_valid
del scaled_cols_valid
del scaled_cols_train
del scaled_cols_test


<br>
<br>

<a id="29"></a>

### Baseline Model

In [None]:
baseline_model = XGBClassifier(random_state=0, verbosity=0, tree_method='gpu_hist',use_label_encoder=False,n_estimators=500,learning_rate=0.05,n_jobs=4)
baseline_model.fit(imputed_X_train, y_train,
             verbose = False,
             eval_set = [(imputed_X_valid, y_valid)],
             eval_metric = "auc",
             early_stopping_rounds = 200)
preds_valid = baseline_model.predict_proba(imputed_X_valid)[:,1]
print(roc_auc_score(y_valid, preds_valid))

<br>

<a id="30"></a>

### Baseline Predictions

In [None]:
#commented out:  Score:  0.78903  Rank:  1117


predictions = baseline_model.predict_proba(imputed_X_test)[:,1]

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': imputed_X_test.index,
                       'claim': predictions})
output.to_csv('submission.csv', index=False)


<br>

<a id="31"></a>

### Define A Scoring Function

In [None]:
def get_score(x_t,y_t,x_v,y_v,n_estimator_var):
    """Return the area under the curve for each model

    """
    
    baseline_model = XGBClassifier(random_state=0, verbosity=0, tree_method='gpu_hist',use_label_encoder=False,n_estimators=n_estimator_var,learning_rate=0.05,n_jobs=4)
    baseline_model.fit(x_t,y_t,
             verbose = False,
             eval_set = [(x_v, y_v)],
             eval_metric = "auc",
             early_stopping_rounds = 200)
    preds_valid = baseline_model.predict_proba(x_v)[:,1]
    print(roc_auc_score(y_v, preds_valid))
    return(roc_auc_score(y_v, preds_valid))

<a id="32"></a>

### Choosing N Estimators

In [None]:
#function commented out to save on runtime when saving.  results saved below in results variable.
#results = dict((key, get_score(imputed_X_train, y_train, imputed_X_valid, y_valid, key)) for key in range(50,5000,500))

results = {50: 0.6903504543569663, 550: 0.7900936067312764, 1050: 0.7938331780560485, 1550: 0.7941154722257339, 2050: 0.7941154722257339, 2550: 0.7941154722257339, 3050: 0.7941154722257339, 3550: 0.7941154722257339, 4050: 0.7941154722257339, 4550: 0.7941154722257339}

<a id="33"></a>

### Tuning Results

In [None]:
#plotting the results of all get_score() results found above.  Plotting number of trees vs. auc
plt.plot(list(results.keys()), list(results.values()))
plt.title("XG Boost Classifier Model N Trees Vs. Area under the ROC Curve")
plt.xlabel("N Trees")
plt.ylabel("AUC")
plt.show()

<br>

<a id="34"></a>

### Model Selection

Based on the above chart, performance stopped improving at 1550 estimators.

{50: 0.6903504543569663,
 550: 0.7900936067312764,
 1050: 0.7938331780560485,
 1550: 0.7941154722257339,
 2050: 0.7941154722257339,
 2550: 0.7941154722257339,
 3050: 0.7941154722257339,
 3550: 0.7941154722257339,
 4050: 0.7941154722257339,
 4550: 0.7941154722257339}

In [None]:
results

<br>

<a id="35"></a>

### Final Model

In [None]:
final_model = XGBClassifier(random_state=0, verbosity=0, tree_method='gpu_hist',use_label_encoder=False,n_estimators=1550,learning_rate=0.05,n_jobs=4)
final_model.fit(imputed_X_train, y_train,
             verbose = False,
             eval_set = [(imputed_X_valid, y_valid)],
             eval_metric = "auc",
             early_stopping_rounds = 200)
preds_valid = final_model.predict_proba(imputed_X_valid)[:,1]
print(roc_auc_score(y_valid, preds_valid))

<br>

<a id="36"></a>

### Final Baseline Prediction

In [None]:
predictions = final_model.predict_proba(imputed_X_test)[:,1]

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': imputed_X_test.index,
                       'claim': predictions})
output.to_csv('submission.csv', index=False)