<a id="top"></a>

# Data Exploration Part 2:  Feature Selection

<br>
<br>

## Table of Contents
* [Introduction And Findings From Exploratory Data Analysis](#1)
* [Load Libraries](#1.1)
* [Find Data Files In Input Folder and Read Them Into Pandas DataFrame](#2)
* [Dataset Dimensions](#3)
* [Preprocess The Data (Same As EDA Notebook)](#4)
    * [Train/Test Split](#p1)
    * [Min-Max Scaling](#p2)
    * [Imputation](#p3)
* [Baseline Mutual Information](#5)
    * [Mutual Information Scores Plot And Findings](#6)
    * [Mutual Information Findings/Conclusions](#6a)
* [Creating Some Features](#7)
    * [Counting Missing Values Per Row](#8)
    * [Flagging Missing Value Rows](#8a)
    * [Counting Feature Values Greater Than Zero and Less Than Zero](#9)
    * [Average Across Features](#10)
    * [Min Across Features](#11)
    * [Max Across Features](#12)
    * [Check For Transformation Ideas](#13)
        * [Positively Skewed Features](#14)
        * [Negatively Skewed Features](#15)
        * [ Transformation Findings](#16)
        * [Transforming Features](#17)
    * [Looking For Binning Opportunities](#23)
        * [10 Bins](#24)
        * [5 Bins](#25)
        * [3 Bins](#27)
        * [Binning Findings/Conclusions and Bin Creation](#28)
* [Preprocess Number 2](#18)
    * [Train/Test Split](#19)
    * [Min-Max Scaling](#20)
    * [Imputation](#21)
* [BorutaShap](#22)
    * [An inital run of BorutaShap, found the following features to be important](#30)
    * [Final Feature Set](#31)
* [Final Preprocess](#32)
    * [Filter Features And Final Train/Test Split](#33)
    * [Final Min-Max Scaling](#34)
    * [Final Imputation](#35)
* [Baseline Model With Selected Features (Previous Best Score:  0.79414 AUC).](#36)
    * [Final Baseline Predictions With Selected Features](#37)
* [TLDR:  Summary/Findings](#38)

<br>
<br>

***[back to top](#top)***



<a id="1"></a>

## Introduction And Findings From Exploratory Data Analysis

The goal of this notebook is to expand upon my earlier exporatory data analysis (EDA) of the Tabular Playground Series - Sep 2021 dataset, make some decisions about feature importance, possibly creates some new features, and train a model to beat my previous submissions.  [Link to my previous notebook covering EDA, preprocessing, and my baseline model](https://www.kaggle.com/abrambeyer/tps-sep21-eda-preprocess-baseline-model)

If you see any areas for improvement, I'm happy to hear about it.  Thanks!

<br>

***From the TPS September 2021 Competition Description Page:***

*The dataset used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting whether a claim will be made on an insurance policy. Although the features are anonymized, they have properties relating to real-world features.*

*Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.*

*For each id in the test set, you must predict a probability for the claim variable. The file should contain a header and have the following format:  id, claim*

<br>

***Findings From Previous EDA:***
1. Training Dataset Shape:  ***(957919, 119)***
2. Test Dataset Shape: ***(493474, 118)***
3. This is a ***huge*** dataset that will likely test the default Kaggle CPU and RAM allocation.  GPU will likely be needed for faster iteration.  
4. Due to necessity of GPU, models such as sklearn's RandomForest Classifier may not be appropriate due to its inability to work with GPU. Better choices might be XGBoost, Catboost, RAPIDS RandomForest, and other GPU-friendly classifier models/packages.  Check out this discussion for more [tips on using GPU in Kaggle](https://www.kaggle.com/c/tabular-playground-series-sep-2021/discussion/271900#1511854)
5. The target column "claim" is a binary integer column with no missing values.  This is a probabilistic classification problem.  Models such as logistic regression, tree based models such as DecisionTrees, RandomForest, XGBoostClassifier, etc.
6. All feature columns are float64 datatype
7. All feature columns have at least one missing value with the most sparse column only missing 1.6% of its data.
8. 62% of the rows have at least one missing value.
9. The target "claim" column has balanced classes.  Only a 0.6% difference between class value counts.
10. The training and test datasets have the same columns, column types and similar distributions of all feature columns.
11. All 118 feature columns have at least one missing value.  Missing values appear randomly dispersed throughout the dataset.
12. There are 67 feature columns with extremely skewed distributions.  Some distributions even look like categorical due to appeared binning of values with short ranges.
13. Several feature columns have very negative kurtosis indicating possible outliers.
14. None of the columns are correlated with each other (correlation is very small).
15. None of the feature columns are correlated with the target column (correlation is very small).
16. The feature columns (all numeric) appear to be on different scales.  For example, some features are on a 0-1 scale, some are in the 10,000s, some features have negative values.
17. Ideas for feature engineering:  Lots of skewed numeric columns.  I'd like to try some transformations to see if those would improve the model.  Binning numeric columns.  Clustering the feature columns to created a new cluster feature.

***Inspiration Acknowledgments:***

I recently noted another Kaggler with similar (but much better) notebook organization style to myself.  Although I did not take the table of contents idea from his/her notebooks, I am borrowing the idea to add links "back to top." of the notebook and also to hide my code from view unless toggled.  I think it looks nicer and is a better experience for the reader.  [Link to awesome example notebook from Kaggler @dwin183287](https://www.kaggle.com/dwin183287/kagglers-seen-by-continents)

<br>
<br>

***[back to top](#top)***

<a id="1.1"></a>

## Load Libraries

In [None]:
import numpy as np #working with matrices, arrays, data science-friendly arrays
import pandas as pd #data processing, CSV file I/O, preprocessing
import matplotlib.pyplot as plt  #data viz library
#jupyter notebook magic function to make plots show in a notebook cell
%matplotlib inline  
plt.style.use('seaborn-whitegrid') #set my default matplotlib style to 'seaborn-whitegrid'

import seaborn as sns  #additional data viz helper library
import scipy.stats as st  #used to fit non-normal distributions with seaborn
import os  #working with the operating system, filepaths, folders,etc.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer #replace missing values with the mean
from sklearn.feature_selection import mutual_info_classif #used to create a ranking with a feature utility metric 
from xgboost import XGBClassifier #first classifier model
from sklearn.metrics import roc_auc_score

<br>
<br>

***[back to top](#top)***

<a id="2"></a>

## Find Data Files In Input Folder and Read Them Into Pandas DataFrame

In [None]:
#This is default from Kaggle.  Basically uses os.walk to recursively 
#print the full filepath and filename for all files stored in the kaggle/input folder.


####### DEFAULT COMMENTS AND CODE FROM KAGGLE ###############

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        if 'train.csv' in os.path.join(dirname, filename):
            train_df = pd.read_csv(os.path.join(dirname, filename), index_col = 0)  
        elif 'test.csv' in os.path.join(dirname, filename):
            test_df = pd.read_csv(os.path.join(dirname, filename), index_col = 0)
        elif 'sample_solution.csv' in os.path.join(dirname, filename):
            ss_df = pd.read_csv(os.path.join(dirname, filename), index_col = 0)
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<br>
<br>


***[back to top](#top)***


<a id="3"></a>

## Dataset Dimensions

Just as a check, make sure we read in our datasets successfully.

In [None]:
print('There are {} rows and {} columns in {}.'.format(train_df.shape[0],train_df.shape[1],'train_df'))

In [None]:
print('There are {} rows and {} columns in {}.'.format(test_df.shape[0],test_df.shape[1],'test_df'))

In [None]:
print('There are {} rows and {} columns in {}.'.format(ss_df.shape[0],ss_df.shape[1],'ss_df'))

<br>
<br>
<br>


***[back to top](#top)***


<a id="4"></a>

## Preprocess The Data (Same As EDA Notebook)

***[back to top](#top)***

<a id="p1"></a>

### Train/Test Split

In [None]:
#copy the training dataset
X = train_df.copy()
X['claim'] = X['claim'].astype('str')
y = X.pop('claim')

In [None]:
#split the dataset into a training/validation set 
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0,train_size=0.3, test_size=0.0125)

***[back to top](#top)***

<a id="p2"></a>

### Min-Max Scaling
<br>
<br>

During EDA, I found all the feature columns are numeric (float64) but they are on several different scales.  I'm using min-max scaling to put all the numbers on the same scale.

In [None]:
float_cols = [col for col in train_df if col != 'claim']
#save the minmax scaler function as a variable
mm_scaler = MinMaxScaler()

#min-max scale the numeric columns only.  In this case, that is every column.

#fit and transform the training df
scaled_cols_train = pd.DataFrame(mm_scaler.fit_transform(X_train[float_cols]),index = X_train.index, columns = X_train.columns)

#just transform the validation and test df.  
scaled_cols_valid = pd.DataFrame(mm_scaler.transform(X_valid[float_cols]),index = X_valid.index, columns = X_valid.columns)
scaled_cols_test = pd.DataFrame(mm_scaler.transform(test_df),index= test_df.index, columns = test_df.columns)

***[back to top](#top)***

<a id="p3"><a/>
    
### Imputation

In [None]:
#1.6% of the dataset is missing, however, 62% of the rows and 100% of the columns have at least 1 missing value.  This means
#I will impute rather than drop.

# Imputing AFTER min-max scaling so the mean imputation is on the same scale.

#set simple imputer variable.  By default, this imputs using the mean to replace missing values
my_imputer = SimpleImputer()

#fit and transform the training df
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(scaled_cols_train), index=X_train.index)

imputed_X_valid = pd.DataFrame(my_imputer.transform(scaled_cols_valid), index=X_valid.index)
imputed_X_test = pd.DataFrame(my_imputer.transform(scaled_cols_test), index=test_df.index)


# Imputation removed column names; put them back

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
imputed_X_test.columns = test_df.columns


<br>
<br>

***[back to top](#top)***


<a id="5"></a>

## Baseline Mutual Information
<br>
<br>
I saw in my EDA process there was very little correlation in the dataset between any columns.  None of the feature columns were even mildly correlated with the 'claim' column nor were any correlated with any other feature column.  Kind of a bummer.
<br>
<br>
I'm going to try mutual information regression first to see if I can create some sort of ranking of feature importance so I can, hopefully, eliminate noise from the data.  
<br>
<br>
Per Kaggle's Feature Engineering Course:

***Mutual information describes relationships in terms of uncertainty. The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. If you knew the value of a feature, how much more confident would you be about the target?***

Code Taken From Kaggle's Feature Engineering Course:  [Link To Kaggle's Feature Engineering Course Page Here](https://www.kaggle.com/ryanholbrook/mutual-information)

In [None]:
def make_mi_scores(X, y):
    mi_scores = mutual_info_classif(X, y)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

In [None]:
def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

In [None]:
mi_scores = make_mi_scores(imputed_X_train, y_train)

***[back to top](#top)***
<a id="6"><a/>

### Mutual Information Scores Plot And Findings

In [None]:
plt.figure(dpi=100, figsize=(10, 30))
plot_mi_scores(mi_scores)

***[back to top](#top)***

<br>

<a id="6a"></a>

### Mutual Information Findings/Conclusions

##### According to the mutual information algorithm, no feature has a mutual information score of greater than 0.005.

##### A Mutual Information Score of 0 indicates no dependencies with a score of 1 being the maximum.  All the features ***appear to have almost zero***
##### relationship with the y variable.

##### With Mutual Information scores so low, it doesn't appear super helpful.

In [None]:
# deleting some variables to save memory
del X_train
del y_train
del X_valid
del y_valid
del scaled_cols_valid
del scaled_cols_train
del scaled_cols_test
del imputed_X_train
del imputed_X_valid
del imputed_X_test
del X
del y

<br>
<br>

***[back to top](#top)***

<a id="7"></a>
## Creating Some Features

<a id="8"></a>

### Counting Missing Values Per Row

Per @craigmthomas, he found counting missing values per row to be informative to the model.  [Link to discussion here](https://www.kaggle.com/c/tabular-playground-series-sep-2021/discussion/274758)

<br>

Starting with that based on the suggestion!

In [None]:
#There are no null values in the target column so I shouldn't need to exclude it here.
train_df['nullcount'] = train_df.isnull().sum(axis=1)
test_df['nullcount'] = test_df.isnull().sum(axis=1)

In [None]:
train_df.groupby(['nullcount','claim']).size().unstack().plot(kind = 'bar', legend=True, title="Row Null Counts By Claim Group")

##### As @craigmthomas mention, the more missing values in a row, the more it seems it is likely a person made a claim.  Shout out to him for the suggestion.

In [None]:
not_claim_col = [col for col in train_df.columns if col != 'claim' and col != 'nullcount']

***[back to top](#top)***

<a id="9"></a>

### Counting Feature Values Greater Than Zero and Less Than Zero

In [None]:
train_df["gt_count"] = train_df[not_claim_col].gt(0).sum(axis=1)
test_df["gt_count"] = test_df[not_claim_col].gt(0).sum(axis=1)

In [None]:
train_df.groupby(['gt_count','claim']).size().unstack().plot(kind = 'bar', legend=True)

In [None]:
train_df["lt_count"] = train_df[not_claim_col].lt(0).sum(axis=1)
test_df["lt_count"] = test_df[not_claim_col].lt(0).sum(axis=1)

In [None]:
train_df.groupby(['lt_count','claim']).size().unstack().plot(kind = 'bar', legend=True)

***[back to top](#top)***

<a id="10"></a>

### Average Across Features

In [None]:
train_df['avg'] = train_df[not_claim_col].mean(axis=1)
test_df['avg'] = test_df[not_claim_col].mean(axis=1)

<a id="11"></a>

### Min Across Features

In [None]:
train_df["min_val"] = train_df[not_claim_col].min(axis=1)
test_df["min_val"] = test_df[not_claim_col].min(axis=1)

<a id="12"></a>

### Max Across Features

In [None]:
train_df["min_val"] = train_df[not_claim_col].max(axis=1)
test_df["min_val"] = test_df[not_claim_col].max(axis=1)

***[back to top](#top)***

<a id="8a"></a>

### Flagging Missing Value Rows

<br>

We now know counting all the columns with missing values per row appears to be pretty informative to the model (Confirmed by performing BorutaShap below).  The new 'nullcount' column is actually the most informative column in the dataset.  This was not difficult since none of the features were correlated with the target column and all of the features also had tiny mutual information scores.  

<br>

When I impute the missing values during pre-processing, I lose this signal because now all the missing value cells are imputed with a mean.  To extend upon this, what if we create new features flagging which rows were missing in the original dataset then impute.  This way, we can still see where things were missing in the original dataset.

The idea for this feature column came from Kaggle's and Alexis Cook's Intermediate Machine Learning Course "Missing Values" section.  [Link to the lesson here.](https://www.kaggle.com/alexisbcook/missing-values)

In [None]:
#iterating over all features columns and marking True or False if the value is missing (null).  Then converting that into an integer so it can be 
#used in our models universally.
for col in not_claim_col:
    train_df[col + '_was_missing'] = train_df[col].isnull().astype(int)
    test_df[col + '_was_missing'] = test_df[col].isnull().astype(int)

***[back to top](#top)***

<a id="13"></a>

### Check For Transformation Ideas

#### checking for any possible transformations of positively skewed features.  Checking Cubed Root, Squared Root, or Log transforms

<a id="14"></a>

### Positively Skewed Features

In [None]:
counter=1
num_rows = len([col for col in train_df[not_claim_col].columns if train_df[col].skew() > 1][:20])
plt.figure(1)
plt.subplots(num_rows,3,figsize=(25,25))


for i, item in enumerate([col for col in train_df[not_claim_col].columns if train_df[col].skew() > 1][:20]):
    
    plt.subplot(num_rows,3,counter)
    plt.hist(train_df[item]**(1/3),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' **(1/3)',fontsize=12,fontweight='bold')
    counter+=1
    plt.subplot(num_rows,3,counter)
    plt.hist(np.log10(train_df[item]),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' log10',fontsize=12,fontweight='bold')
    counter+=1
    plt.subplot(num_rows,3,counter)
    plt.hist(train_df[item]**(1/2),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' **(1/2)',fontsize=12,fontweight='bold')
    counter+=1
    plt.grid(True)
plt.subplots_adjust(top=1.5, bottom=0.2, left=0.10, right=0.95, hspace=0.45,
        wspace=0.4)

#### Potential choices:

* f4:  **(1/3)
* f8:  **(1/3)
* f14:  **(1/3)
* f30:  **(1/3)
* f38: **(1/3)
* f39: **(1/3)

***[back to top](#top)***

In [None]:
counter=1
num_rows = len([col for col in train_df[not_claim_col].columns if train_df[col].skew() > 1][20:40])
plt.figure(1)
plt.subplots(num_rows,3,figsize=(20,20))


for i, item in enumerate([col for col in train_df[not_claim_col].columns if train_df[col].skew() > 1][20:40]):
    
    plt.subplot(num_rows,3,counter)
    plt.hist(train_df[item]**(1/3),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' **(1/3)',fontsize=12,fontweight='bold')
    counter+=1
    plt.subplot(num_rows,3,counter)
    plt.hist(np.log10(train_df[item]),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' log10',fontsize=12,fontweight='bold')
    counter+=1
    plt.subplot(num_rows,3,counter)
    plt.hist(train_df[item]**(1/2),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' **(1/2)',fontsize=12,fontweight='bold')
    counter+=1
    plt.grid(True)
plt.subplots_adjust(top=1.5, bottom=0.2, left=0.10, right=0.95, hspace=0.45,
        wspace=0.4)

#### Potential Choices:
* f44:  **(1/3)
* f52:  **(1/3)
* f63:  **(1/3)
* f64:  **(1/3)
* f68:  **(1/3)
* f78:  **(1/3)
* f82:  **(1/3)
* f87:  **(1/3)

***[back to top](#top)***

In [None]:
counter=1
num_rows = len([col for col in train_df[not_claim_col].columns if train_df[col].skew() > 1][40:])
plt.figure(1)
plt.subplots(num_rows,3,figsize=(20,20))


for i, item in enumerate([col for col in train_df[not_claim_col].columns if train_df[col].skew() > 1][40:]):
    
    plt.subplot(num_rows,3,counter)
    plt.hist(train_df[item]**(1/3),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' **(1/3)',fontsize=12,fontweight='bold')
    counter+=1
    plt.subplot(num_rows,3,counter)
    plt.hist(np.log10(train_df[item]),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' log10',fontsize=12,fontweight='bold')
    counter+=1
    plt.subplot(num_rows,3,counter)
    plt.hist(train_df[item]**(1/2),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' **(1/2)',fontsize=12,fontweight='bold')
    counter+=1
    plt.grid(True)
plt.subplots_adjust(top=1.5, bottom=0.2, left=0.10, right=0.95, hspace=0.45,
        wspace=0.4)

### Potential Choices:
* f89:  **(1/3)
* f95:  **(1/3)
* f101: **(1/3)
* f102: **(1/3)
* f103: **(1/3)
* f118: **(1/3)

***[back to top](#top)***

<a id="15"></a>

### Negatively Skewed Features

<br>

Try square, cube root, and log transformations

In [None]:
counter=1
num_rows = len([col for col in train_df if train_df[col].skew() < -1])
plt.figure(1)
plt.subplots(num_rows,3,figsize=(20,20))


for i, item in enumerate([col for col in train_df if train_df[col].skew() < -1]):
    
    plt.subplot(num_rows,3,counter)
    plt.hist(train_df[item]**(1/3),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' **(1/3)',fontsize=12,fontweight='bold')
    counter+=1
    plt.subplot(num_rows,3,counter)
    plt.hist(np.log10(train_df[item]),color='#7571B0',alpha=0.75)
    plt.title(str(item)+' log10',fontsize=12,fontweight='bold')
    counter+=1
    plt.subplot(num_rows,3,counter)
    plt.hist(train_df[item]**2,color='#7571B0',alpha=0.75)
    plt.title(str(item)+' **2',fontsize=12,fontweight='bold')
    counter+=1
    plt.grid(True)
plt.subplots_adjust(top=1.5, bottom=0.2, left=0.10, right=0.95, hspace=0.45,
        wspace=0.4)

### Potential Choices:
* f58:  **(1/3)
* f110: **(1/3)

***[back to top](#top)***

<a id="16"></a>

## Transformation Findings

##### I think, based on the above charts, there are several features that could we could try transforming using a cubed root transformation.  

***[back to top](#top)***

<a id="17"></a>

## Transforming Features

In [None]:
list_of_features_to_transform = ['f58','f110','f4','f8','f14','f30','f38','f39','f44','f52','f63','f64','f68','f78','f82','f87','f89','f95','f101','f102','f103','f118']

In [None]:
for i, item in enumerate(list_of_features_to_transform):
    train_df[str(item) + '_cubed_root'] = train_df[item] ** (1/3)

In [None]:
for i, item in enumerate(list_of_features_to_transform):
    test_df[str(item) + '_cubed_root'] = test_df[item] ** (1/3)

<br>
<br>

***[back to top](#top)***

<a id="23"></a>

### Looking For Binning Opportunities

<br>

This dataset is supposed to be based on insurance claim data.  Even though it is simulated data, I would assume there's got to be some financial information about income or amounts previously claimed.  I would imagine someone's salary or amount previously claimed would be informative to the model.  Even though I already found none of the features are correlated with the target variable, perhaps from bin categorical variables would be.  Let's check.

In [None]:
#visually looking at each original feature's distribution, below is a list of features that may be financially-based.

list_of_possible_financials = train_df.iloc[:,:118].columns
#list_of_possible_financials = ['f3','f7','f9','f10','f12','f16','f20','f25','f26','f27','f28','f32','f33','f35','f36','f37','f39','f41','f52','f62','f65','f67','f72','f73','f74','f77','f82','f83','f86','f89','f92','f96','f98','f102','f103','f104','f108','f114','f116','f117']

In [None]:
#define a visualization function.
def plot_bin_bars(num_bins,df):
    
    df2 = df.copy()
    
    num_rows = 24
    num_cols = 5

    row_ax_counter = 0
    col_ax_counter = 0

    plt.figure(1)

    fig, ax = plt.subplots(num_rows,num_cols,figsize=(35,30))

    width=0.35

    for i, item in enumerate(list_of_possible_financials):

        #bins_list = [bins for bins in range(int(train_df[item].min()),int(train_df[item].max()), int(round((train_df[item].max() - train_df[item].min())/num_bins)))]
        bins_list = list(np.arange(df2[item].min(),df2[item].max() + (df2[item].max() - df2[item].min())/num_bins,(df2[item].max() - df2[item].min())/num_bins))
        
        labels_test = ['bin'+str(lab+1) for lab in range(len(bins_list)-1)]

        new_col_name = str(item) + '_binned'

        df2[new_col_name] = pd.cut(df2[item], bins=bins_list, labels=labels_test)

        #plt.subplot(num_rows,4,i+1)

        groups_df = df2.groupby([new_col_name,'claim']).size().unstack()

        rects1 = ax[row_ax_counter,col_ax_counter].bar(x=[ind for ind in range(len(groups_df.index))],height=groups_df[0],color='#4F66AF',width=0.35,label='0',alpha=0.75)
        rects2 = ax[row_ax_counter,col_ax_counter].bar(x=[width + ind for ind in range(len(groups_df.index))],height=groups_df[1],color='#EDAC1A',width=0.35,label='1',alpha=0.75)
        ax[row_ax_counter,col_ax_counter].set_title(new_col_name + ' Bins By Claim Status')
        ax[row_ax_counter,col_ax_counter].legend()

        ax[row_ax_counter,col_ax_counter].set_xticks([ind + width/2 for ind in range(len(groups_df.index))])
        ax[row_ax_counter,col_ax_counter].set_xticklabels(groups_df.index)

        #fig.tight_layout()

        if col_ax_counter == 4:
            row_ax_counter+=1

        if col_ax_counter < 4:
            col_ax_counter+=1
        else:
            col_ax_counter = 0

        #plt.show()

        #train_df.groupby([new_col_name,'claim']).size().unstack().plot(kind = 'bar', legend=True, title=new_col_name + " Counts By Claim Group")

        plt.grid(True)
    plt.subplots_adjust(top=1.5, bottom=0.2, left=0.10, right=0.95, hspace=0.60,wspace=0.60)
    del df2

<br>
<br>

***[back to top](#top)***

<a id="24"></a>

### 10 Bins

In [None]:
plot_bin_bars(10,train_df)

<br>
<br>

***[back to top](#top)***

<a id="26"></a>

### 5 Bins

In [None]:
plot_bin_bars(5,train_df)

<br>
<br>

***[back to top](#top)***

<a id="27"></a>

### 3 Bins

In [None]:
plot_bin_bars(3,train_df)

<br>
<br>

***[back to top](#top)***

<a id="28"></a>

### Binning Findings/Conclusions and Bin Creation

<br>

Is there signal here?  Not sure.  The bars all seem very close throughout the bin distributions.  But, several of the features seem to have a little separation in bin1 so I'm just going to try those and see if it helps.

In [None]:
def make_bin_cols(num_bins,bin_col_list,df):

    df2=df.copy()
    
    for i, item in enumerate(bin_col_list):

        bins_list = list(np.arange(df2[item].min(),df2[item].max() + (df2[item].max() - df2[item].min())/num_bins,(df2[item].max() - df2[item].min())/num_bins))[:-1]
        
        labels_test = [lab for lab in range(len(bins_list)-1)]

        new_col_name = str(item) + '_binned'

        df2[new_col_name] = pd.cut(df2[item], bins=bins_list, labels=labels_test)
    
    return(df2)

#### The below list of columns to bin comes from the above bar charts.  I visually scanned all charts to look for charts with more visual separation between claim levels.  Also, during the first iteration of BorutaShap done previously on the default dataset, several features were tagged as important to the XG Boost Classifier model.  I'm going to bin those as well.

In [None]:
cols_to_bin = \
['f2',
 'f32',
 'f21',
 'f86',
 'f42',
 'f110',
 'f94',
 'f103',
 'f34',
 'f102',
 'f5',
 'f40',
 'f46',
 'f12',
 'f83',
 'f111',
 'f112',
 'f36',
 'f30',
 'f57',
 'f9',
 'f95',
 'f52',
 'f107',
 'f78',
 'f90',
 'f70',
 'f91',
 'f14',
 'f35',
 'f25',
 'f3',
 'f81',
 'f65',
 'f48',
 'f31',
 'f47',
 'f71',
 'f92',
 'f24',
 'f69',
 'f56',
 'f11',
 'f28',
 'f7',
 'f23',
 'f62',
 'f104',
 'f39',
 'f16',
 'f77',
 'f96',
 'f1',
 'f43',
 'f45',
 'f4',
 'f118',
 'f8',
 'f27',
 'f53',
 'f10',
 'f38']

In [None]:
#run the above function to create binned columns
train_df = make_bin_cols(10,cols_to_bin,train_df)
test_df = make_bin_cols(10,cols_to_bin,test_df)

<br>
<br>

***[back to top](#top)***

<a id="18"></a>

## Preprocess Number 2

<br>

Now we have new features added to the dataset, I will re-preprocess the data with scaling and imputation

<a id="19"><a/>
    
### Train/Test Split

In [None]:
#copy the training dataset
X = train_df.copy()
y = X.pop('claim')

In [None]:
#split the dataset into a training/validation set 
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0,train_size=0.25, test_size=0.1)

***[back to top](#top)***

<a id="20"></a>

### Min-Max Scaling

In [None]:
float_cols = [col for col in train_df if col != 'claim']
#save the minmax scaler function as a variable
mm_scaler = MinMaxScaler()

#min-max scale the numeric columns only.  In this case, that is every column.

#fit and transform the training df
scaled_cols_train = pd.DataFrame(mm_scaler.fit_transform(X_train[float_cols]),index = X_train.index, columns = X_train.columns)

#just transform the validation and test df.  
scaled_cols_valid = pd.DataFrame(mm_scaler.transform(X_valid[float_cols]),index = X_valid.index, columns = X_valid.columns)
scaled_cols_test = pd.DataFrame(mm_scaler.transform(test_df),index= test_df.index, columns = test_df.columns)

In [None]:
#removing variables to save memory
del X
del y
del X_train
del X_valid

***[back to top](#top)***

<a id="21"><a/>
    
### Imputation

In [None]:
# Imputing AFTER min-max scaling so the mean imputation is on the same scale.

#set simple imputer variable.  By default, this imputs using the mean to replace missing values
my_imputer = SimpleImputer()

#fit and transform the training df
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(scaled_cols_train), index=scaled_cols_train.index)

imputed_X_valid = pd.DataFrame(my_imputer.transform(scaled_cols_valid), index=scaled_cols_valid.index)
imputed_X_test = pd.DataFrame(my_imputer.transform(scaled_cols_test), index=test_df.index)


# Imputation removed column names; put them back

imputed_X_train.columns = scaled_cols_train.columns
imputed_X_valid.columns = scaled_cols_valid.columns
imputed_X_test.columns = test_df.columns

In [None]:
#removing variables to save memory
del scaled_cols_valid
del scaled_cols_test

<br>
<br>

***[back to top](#top)***

<a id="22"><a/>

## BorutaShap
    
<br>
    
From the [documentation site](https://pypi.org/project/BorutaShap/)

<br>
    
***BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values. This combination has proven to out perform the original Permutation Importance method in both speed, and the quality of the feature subset produced. Not only does this algorithm provide a better subset of features, but it can also simultaneously provide the most accurate and consistent global feature rankings which can be used for model inference too.***

In [None]:
!pip3 install BorutaShap
from BorutaShap import BorutaShap

<br>
<br>

***[back to top](#top)***

<a id="30"><a/>

#### An inital run of BorutaShap, found the following features to be important

<br>

In my first run of BorutaShap, I included all default feature columns, plus the cubed_root engineered features, max, min, average and nullcount engineered features.  Using that feature set and XGBoost Classifier, the below features were found to be important or tentative.  In the next run of BorutaShap, I will exclude all unimportant features and test all the missing row flag features to see if those are important to the model.


<br>

confirmed_important = ['f86', 'f1', 'f71', 'f2', 'f95', 'f112', 'f10', 'f5', 'f70', 'f32', 'f16', 'f83', 'nullcount', 'f11', 'f14', 'f107', 'f12', 'f69', 'f3', 'f8', 'f62', 'f96', 'f102', 'f34', 'f24', 'f42', 'f21', 'f40', 'f65', 'f48', 'f43', 'f104', 'f25', 'f36', 'f77', 'f35', 'f47', 'f52', 'f9', 'f53', 'f31', 'f111', 'f45', 'f46', 'f78', 'f92', 'f27', 'f103', 'f57', 'f28', 'f38']

<br>

tentative = ['f78_cubed_root', 'f109', 'f106', 'f81', 'f117', 'f73', 'f30', 'f7', 'f60', 'f75', 'f97', 'f116']

In [None]:
confirmed_important = ['f86', 'f1', 'f71', 'f2', 'f95', 'f112', 'f10', 'f5', 'f70', 'f32', 'f16', 'f83', 'nullcount', 'f11', 'f14', 'f107', 'f12', 'f69', 'f3', 'f8', 'f62', 'f96', 'f102', 'f34', 'f24', 'f42', 'f21', 'f40', 'f65', 'f48', 'f43', 'f104', 'f25', 'f36', 'f77', 'f35', 'f47', 'f52', 'f9', 'f53', 'f31', 'f111', 'f45', 'f46', 'f78', 'f92', 'f27', 'f103', 'f57', 'f28', 'f38']
tentative = ['f78_cubed_root', 'f109', 'f106', 'f81', 'f117', 'f73', 'f30', 'f7', 'f60', 'f75', 'f97', 'f116']

### Drop Unimportant Columns

In [None]:
#drop everything that was not important.  Keep the missing value flag columns.
cols_to_drop = [col for col in imputed_X_train.columns if col not in confirmed_important + tentative and not col.endswith('missing') and not col.endswith('binned')]

In [None]:
imputed_X_train.drop(columns = cols_to_drop, axis=1, inplace=True)

In [None]:
test_df.drop(columns = cols_to_drop, axis=1, inplace=True)

### BorutaShap One More Time with features selected during first iteration plus '_missing' engineered features and '_binned' features

In [None]:
#baseline classifier model from Previous EDA notebook
model = XGBClassifier(random_state=0, verbosity=0, tree_method='gpu_hist',use_label_encoder=False,n_estimators=500,learning_rate=0.05,n_jobs=4)

Feature_Selector = BorutaShap(model = model, importance_measure='shap', classification=True)

Feature_Selector.fit(X=imputed_X_train, y=y_train, n_trials=20, random_state=0)

In [None]:
Feature_Selector.plot(which_features='all')

<br>
<br>

***[back to top](#top)***

<a id="31"><a/>

### Final Feature Set

In [None]:
Feature_Selector.Subset().columns

In [None]:

tentative_features_final = ['f116', 'f75', 'f7', 'f11', 'f8', 'f117', 'f12', 'f102', 'f69', 'f78', 'f97', 'f30', 'f104', 'f10']


In [None]:
features_to_keep = list(Feature_Selector.Subset().columns) + tentative_features_final

In [None]:
print('Final feature set includes {} features.'.format(len(features_to_keep)))

##### All those features turned out to not be helpful in the model!  The only engineered feature that was helpful was 'nullcount.'  The rest turned out to be noise.  

<br>
<br>

***[back to top](#top)***

<a id="32"><a/>

### Final Preprocess

<br>

<a id="33"></a>

### Filter Features And Final Train/Test Split

In [None]:
#copy the training dataset
X = train_df.copy()
y = X.pop('claim')

#### Now only include the important feature found above

In [None]:
X = X[features_to_keep]
test_df = test_df[features_to_keep]

In [None]:
#split the dataset into a training/validation set 
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0,train_size=0.8, test_size=0.2)

<br>

***[back to top](#top)***

<a id="34"><a/>

### Final Min-Max Scaling

In [None]:
float_cols = [col for col in X_train if col != 'claim']
#save the minmax scaler function as a variable
mm_scaler = MinMaxScaler()

#min-max scale the numeric columns only.  In this case, that is every column.

#fit and transform the training df
scaled_cols_train = pd.DataFrame(mm_scaler.fit_transform(X_train[float_cols]),index = X_train.index, columns = X_train.columns)

#just transform the validation and test df.  
scaled_cols_valid = pd.DataFrame(mm_scaler.transform(X_valid[float_cols]),index = X_valid.index, columns = X_valid.columns)
scaled_cols_test = pd.DataFrame(mm_scaler.transform(test_df),index= test_df.index, columns = test_df.columns)

In [None]:
#removing variables to save memory
del X
del y
del X_train
del X_valid

***[back to top](#top)***

<a id="35"><a/>
    
### Final Imputation

In [None]:
# Imputing AFTER min-max scaling so the mean imputation is on the same scale.

#set simple imputer variable.  By default, this imputs using the mean to replace missing values
my_imputer = SimpleImputer()

#fit and transform the training df
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(scaled_cols_train), index=scaled_cols_train.index)

imputed_X_valid = pd.DataFrame(my_imputer.transform(scaled_cols_valid), index=scaled_cols_valid.index)
imputed_X_test = pd.DataFrame(my_imputer.transform(scaled_cols_test), index=test_df.index)


# Imputation removed column names; put them back

imputed_X_train.columns = scaled_cols_train.columns
imputed_X_valid.columns = scaled_cols_valid.columns
imputed_X_test.columns = test_df.columns

In [None]:
#removing variables to save memory
del scaled_cols_valid
del scaled_cols_test

***[back to top](#top)***

<a id="36"><a/>



### Baseline Model With Selected Features (Previous Best Score:  0.79414 AUC).

#### In my first submission, including EDA, Preprocess and Baseline Model.  The best baseline model score was only ***0.79414 AUC***.  Hopefully, all this feature selection will improve the baseline model performance. [Link Here To My First Notebook](https://www.kaggle.com/abrambeyer/tps-sep21-eda-preprocess-baseline-model)

<br>

#### I am using the same parameters as my baseline model.  The only difference is the feature selection plus one engineered feature:  nullcount

In [None]:
final_model = XGBClassifier(random_state=0, verbosity=0, tree_method='gpu_hist',use_label_encoder=False,n_estimators=1550,learning_rate=0.05,n_jobs=4)
final_model.fit(imputed_X_train, y_train,
             verbose = False,
             eval_set = [(imputed_X_valid, y_valid)],
             eval_metric = "auc",
             early_stopping_rounds = 200)
preds_valid = final_model.predict_proba(imputed_X_valid)[:,1]
print(roc_auc_score(y_valid, preds_valid))

***[back to top](#top)***

<a id="37"><a/>

### Final Baseline Predictions With Selected Features

In [None]:
predictions = final_model.predict_proba(imputed_X_test)[:,1]

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': imputed_X_test.index,
                       'claim': predictions})
output.to_csv('submission.csv', index=False)

***[back to top](#top)***

<a id="38"><a/>

### TLDR:  Summary/Findings

1.  The default feature set includes 118 variables with no one feature having mutual information greater than 0.005.
2.  There are no highly-correlated features in the default feature set.  None of the features are correlated with the target variable.
3.  After trying binning, flagging missing values, averaging columns, max of columns, mins of columns, the best feature was simplying counting how many columns had a missing value in a row.  I called this feature "nullcount."
4.  Running BorutaShap on the final engineered dataset, the following features were found to be important to the XG Boost Classifier:
    ['f27', 'f47', 'f45', 'f16', 'f34', 'f36', 'f62', 'f106', 'f57', 'f24',
       'f103', 'f2', 'f5', 'f83', 'f21', 'f107', 'f3', 'f28', 'f96', 'f31',
       'f40', 'f42', 'f95', 'nullcount', 'f77', 'f32', 'f92', 'f53', 'f65',
       'f111', 'f38', 'f48', 'f35', 'f52', 'f70']
5.  By creating the 'nullcount' column and using the above feature set, I was able to improve my baseline model performance from ***0.79414*** to ***0.81351***.