# About Tabular Playground Series - Sep 2021

The dataset used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting whether a claim will be made on an insurance policy. Although the features are anonymized, they have properties relating to real-world features.

The aim is predict whether a customer made a claim upon an insurance policy. The ground truth claim is binary valued, but a prediction may be any number from 0.0 to 1.0, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

This competition involves a tabular dataset. Intended as a stepping store between 'Titanic Getting Started' competition and a Featured competition, it should be approachable for all. Competitions masters or grandmasters won't be challenge by this level. 

# About this notebook

This notebook is a work in progress and will be regularly updated. It is my first notebook of this competition and will concentrate on data exploration and creating a baseline for future notebooks that will concentrate on modelling different solutions. This is a beginner level notebook meant for my own use and not indended to be a training aid or tutorial. Some of the code will be taken from other public notebooks, sources will be creditted at the bottom.

# About David Coxon

I work in IT for an art gallery and generally work with inhouse data on visitor numbers, public donations, powerusage, course attendance, sentament analysis and Adwords, much of my work in with time series.

This is my fourth contest, starting with Titanic I moved on to tackle a house pricing contest and then the 30days of ML challenge.

## Set up environment

Lets import all of the libraries that we will need to complete the notebook. (This means loading libraries we might not need if we are only running 1 code block, but it means that we are not having to load the same library over and over if we add them code by code and run the entire notebook).

In [None]:
%%time

import os

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from sklearn import ensemble, linear_model,metrics,model_selection,neighbors,preprocessing, svm, tree
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.feature_selection import mutual_info_regression
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, accuracy_score, log_loss, roc_auc_score
from sklearn.model_selection import cross_validate,cross_val_score,train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import MinMaxScaler,StandardScaler, RobustScaler

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

print('Libraries imported')

# Get data

In [None]:
%%time
# Get data
train=pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv', index_col='id')
test=pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv',index_col='id')
features=[]
for feature in test.columns:
    features.append(feature)

print('Data Import Complete')

## Memory reduction

In [None]:
%%time
## from: https://www.kaggle.com/bextuychiev/how-to-work-w-million-row-datasets-like-a-pro
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

train = reduce_memory_usage(train, verbose=True)
test = reduce_memory_usage(test, verbose=True)
print('Memory reduced')

## First look at Data

In [None]:
%%time
# shape of data
print('How much data was imported?')
print('training data shape ;',train.shape)
print('test data shape ;',test.shape)

# missing data
print('\nHow much data is missing?')

training_missing_val_count_by_column = (train.isnull().values.sum())
test_missing_val_count_by_column = (test.isnull().values.sum())
print('missing training data :  {:.2f} ({:.1f})%'.format (training_missing_val_count_by_column,training_missing_val_count_by_column/train.shape[0]))
print('missing test data :  {:.2f} ({:.1f})%'.format (test_missing_val_count_by_column,test_missing_val_count_by_column/test.shape[0]))
print('\noverview complete')

In [None]:
%%time
print(train.info(),'\n')
print(test.info(),'\n')

In [None]:
%%time
print('Sample training data')
train.head()

In [None]:
%%time
print('Sample test data')
test.head()

In [None]:
%%time
train.describe()
test.describe()

## Missing values

In [None]:
%%time
#drop observations (rows) with nan's 
print('Missing Training Data')
train2=train.dropna(axis='rows')
print ("rows ; ",train.shape[0],"\nrows with missing data : ", ((train.shape[0]) - (train2.shape)[0]),(((train.shape[0]) - (train2.shape)[0])/train.shape[0])*100,"%")

#drop features (columns) with nan's 
train3=train.dropna(axis='columns')
print("columns ; ", train.shape[1], "\ncolumns with missing data : ", ((train.shape[1]) - (train3.shape)[1]))

print('\nMissing Test Data')
test2=test.dropna(axis='rows')
print ("rows ; ",test.shape[0],"\nrows with missing data : ", ((test.shape[0]) - (test2.shape)[0]),(((test.shape[0]) - (test2.shape)[0])/test.shape[0])*100,"%")

#drop features (columns) with nan's 
test3=test.dropna(axis='columns')
print("columns ; ", test.shape[1], "\ncolumns with missing data : ", ((test.shape[1]) - (test3.shape)[1]),'\n')
size=[(train.shape)[0],(test.shape)[0]]
missing=[(train2.shape)[0],(test2.shape)[0]]
plt.bar(np.arange(0,len(missing),1),size,color='b',alpha=0.5)
plt.bar(np.arange(0,len(missing),1),missing,color='b',alpha=0.5)
plt.xticks([1,2],('training','test'))
plt.title('Rows with missing data')
plt.show()

In [None]:
%%time
# Plot missing data
training_missing_val_count_by_column = (train.isnull().sum())
plt.bar(np.arange(0,len(training_missing_val_count_by_column),1),training_missing_val_count_by_column)
plt.xlabel('column')
plt.ylabel('Number of missing values')
plt.show()
print(training_missing_val_count_by_column.describe())

## Missing data by claim

There seems to be a relationship between the target value 'claim' and the missing values. One of the discussion topics explains by saying that insurrance data often comes from a number of different insurrance companies/agencie that often collect and process different data so data from different sources is often missing different features, it also suggests that different companies may be more or less likely to recieve claims. 

In this exercise the relationship is such that we can use the degree of missing data as a feature that makes a notable difference in model accuracy. 

Its possible that even in models where there is not such a direct relationship between missing data and target value that making 'misingness' a feature will still help with model accuracy because in a random forest based model it will help the model distinguish between observations/rows that have real values and observations where data was been estimated.

In [None]:
# missing value by label
claim=train[train['claim']==1]
missing_claim=(claim.isnull().sum())
noclaim=train[train['claim']==0]
missing_noclaim=(noclaim.isnull().sum())
plt.bar([1,2],[missing_claim.sum(),missing_noclaim.sum()])
plt.xticks([1,2],('claim','No claim'))
plt.show()

### Initial observation on missing data

There are a large number of missing data points,1820782 in the training data alone (about 1.6%). Approx 62% of the observation (rows) are missing at least some data and every feature (column) is some missing dat. Working out how to handle the missing data will be a priority.

Both training and test sets are missing approximately the same percentage of data.

# Visualising the data

## Claims

In [None]:
sns.distplot(train['claim'], kde=True, hist=False)

## Initial oversations on claims

There appear to be 2 equally sized normal distributions of 0 values and 1's in the training set. 

## Distibution boxplots by feature

In [None]:
%%time
# generate box plots
train_outliers = ((train - train.min())/(train.max() - train.min()))
fig, ax = plt.subplots(4, 1, figsize = (25,25))
sns.boxplot(data = train_outliers.iloc[:, 1:30], ax = ax[0])
sns.boxplot(data = train_outliers.iloc[:, 30:60], ax = ax[1])
sns.boxplot(data = train_outliers.iloc[:, 60:90], ax = ax[2])
sns.boxplot(data = train_outliers.iloc[:, 90:120], ax = ax[3])

## Train and test distibution histograms by feature.

In [None]:
%%time
train_outliers = ((train - train.min())/(train.max() - train.min()))
test_outliers = ((test - test.min())/(test.max() - test.min()))
try:
    train_outliers.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
print('Training data (blue), Test Data (red)')
fig = plt.figure(figsize = (20, 140))
for idx, i in enumerate(train_outliers.columns):
    fig.add_subplot(np.ceil(len(train_outliers.columns)/4), 4, idx+1)
    train_outliers.iloc[:, idx].hist(bins = 20,color='b',alpha=0.5)
    test_outliers.iloc[:, idx].hist(bins = 20,color='r',alpha=0.5)
    plt.title(i)
#plt.text(9, -20000, caption, size = 12)
plt.show()

## Observations on distribution of feature values.

The test and training data follow very similar distributions.

Some of the features appear to have normal distributions while others appear to have logarithmic distributions. Some features appear to have anomolies in the distributions. Further explorations may be required to discover which of these features are useful and which are noise.

## Feature correlation

In [None]:
%%time
# Generate correlations
corr=train.corr()

# create heatmap
mask = np.triu(np.ones_like(corr, dtype = bool))
plt.figure(figsize = (15, 15))
plt.title('Correlation matrix for Train data')
sns.heatmap(corr, mask = mask,annot=False, linewidths = .5,square=True,cbar_kws={"shrink": .60})
plt.show()

In [None]:
%%time
# get list of columns with high correlations
corr_matrix = train.corr().abs()
high_corr=np.where(corr_matrix>0.02)
high_corr=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr) if x!=y and x<y]
print("high correlation \n",high_corr)
featuresofinterest=['f6','f15', 'f32', 'f34','f36','f45','f46','f51', 'f57','f86','f90','f97','f111']

## Observations on Correlation 

There are a reasonably high number of features (118) with a relatively low correlation between features. Feature feature reduction/engineering may be required to reduce noise.

All of the features are numeric features, but there is a high degree of variability in the scale of the different features, normalization may be required.

It is unlikely that we can use correlations between features to help fill missing values more accurately.

# Data Engineering

## Add the number of missing values as a feature

based on the discussion at https://www.kaggle.com/c/tabular-playground-series-sep-2021/discussion/270206 we are going to add the number of missing values for each row as a feature.

In [None]:
# add missing data field
if "Missing" in train.columns:
    print('Missing training feature exists')
else:
    missing_values_list=train.isnull().sum(axis=1).tolist()
    train.insert(0,"Missing",missing_values_list, True)
    print('Missing training feature added')
    
if "Missing" in test.columns:
    print('Missing training feature exists')
else:
    missing_values_list2=test.isnull().sum(axis=1).tolist()
    test.insert(0, "Missing",missing_values_list2, True)
    print('Missing test feature added')

## Impute missing values

For now lets use the same method to impute all missing numbers, we might look at this again later and use different strategies for different features. We can choose from mean or media by remarking out the appropriate line.

During exploration i had very different results from different models and wanted to rule problems with preprocessing as as well as imputing the missing values i tried using a loops to substitle the mean for each feature. you can chhose from either of the following code blocks to deal with the missing numbers.



In [None]:
%%time
# impute missing values
imputer = SimpleImputer(strategy='mean')
train[features]=imputer.fit_transform(train[features])
test[features] =imputer.transform(test[features])
print('Missing values imputed')

## Normalize data

Lets normalise all of the features. We can choose from Standard, max/min or robust by remarking out the appropriate options)

In [None]:
%%time
# select scaler
scale = StandardScaler()
#scale = RobustScaler()
#scale = MinMaxScaler()
train[features]=scale.fit_transform(train[features])
test[features]= scale.transform(test[features])  

print('Data scaled')

# Prepare data for modeling

In [None]:
%%time
try:
    y = train.claim
    train.drop(['claim'], axis=1, inplace=True)
except:
    print('claim already separated')
X=train
print('Target data separated')
# Create data sets for training (80%) and validation (20%)
X_train, X_valid, y_train, y_valid = train_test_split(X, y,train_size=0.2,test_size = 0.2,random_state = 0)

## Create function to score performance

In [None]:
%%time
# taken from https://www.kaggle.com/ryanholbrook/getting-started-september-2021-tabular-playground

def score(X, y, model, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model, X_train, y_train, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )
print('Function built')

modelcomparison=pd.DataFrame(columns=['Model', 'ROC_AUC', 'Score'])

# Base line modelling with XGBOOST 

Essentially this is a classifier problem in that claims are binary, customer either make a claim or they don't. So we are going to use the XGBclassifier model to create a baseline with which to compare future models. 

In [None]:
%%time
# Instanciate model
xgbmodel = XGBClassifier(
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=3,
    min_child_weight=3,
    subsample=0.5,
    colsample_bytree=0.5,
    n_jobs=-1,
    objective='binary:logistic',
    eval_metric='auc', 
    # Uncomment if you want to use GPU. Recommended for whole training set.
    #tree_method='gpu_hist',
    random_state=38)
xgbmodel.fit(X_train, y_train, verbose = False)

print('model fit complete')

## Evaluate model performance

In [None]:
%%time
# evaluate model
y_preds = xgbmodel.predict(X_valid)
xgb_score=roc_auc_score(y_valid, y_preds)
print(xgb_score)

xgbscores = score(X_train, y_train, xgbmodel, cv=4)
display(xgbscores)
print('scoring completed')

A "neutral" AUC is 0.5, so anything better than that means our model learned something useful.

Note: The model scored around 0.8 for both test and train roc, the basic model without imputing and missing data scores 0.75 and 0.86 for test and train. 

## Confusion Matrix

In [None]:
%%time
metrics.plot_confusion_matrix(xgbmodel, X_valid, y_valid)
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

## Make a baseline submission
This is a classifier problem so predictions are binary 0 and 1. The next code block produces binary predictions, but you're allowed to submit probabilities instead so the code block under that uses predict_proba method instead of predict from scikit-learn to produce a probanility based submission.

In [None]:
%%time
# taken from https://www.kaggle.com/ryanholbrook/getting-started-september-2021-tabular-playground

# Make predictions
predictions = xgbmodel.predict(test)

# Save the predictions to a CSV file
sample_submission = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")
sample_submission.claim = predictions
sample_submission.to_csv("prediction_binary_submission.csv",index=False)
print('xgboost binary prediction complete')

In [None]:
%%time
# taken from https://www.kaggle.com/ryanholbrook/getting-started-september-2021-tabular-playground

# Make probabistic predictions
prob_predictions = xgbmodel.predict_proba(test)

# Save the predictions to a CSV file
sample_submission_prob = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")
sample_submission_prob.claim = prob_predictions
sample_submission_prob.to_csv("normalised_prediction_proba_submission.csv",index=False)
print('xgboost probabalistic prediction complete')

In [None]:
sns.distplot(sample_submission['claim'], kde=True, hist=False)
sns.distplot(sample_submission_prob['claim'], kde=True, hist=False)

### Observations  on baseline results

The scores for the untuned XGBClassifier model without imputing missing values or normalizing the data were 0.49385 for the binary and 0.50924 for the prabalistic submissions.

The scores for the XGBClassifier model with imputed missing values and minmax normalization of the data improved to 0.54.

The leaderboard has scores of 0.8, so there is definately room for improvement.

# Try other Models:

Having got a baseline with xgboost i tried a couple of alternate models to see how the different models performed these were just best guesses and not yet tuned. I went with 1000 estimators for each as that seemed to work for xgboost which i did more testing on.

In [None]:
%%time
try:
    y = train.claim
    train.drop(['claim'], axis=1, inplace=True)
except:
    print('claim already separated')
X=train
print('Target data separated')
# Create data sets for training (80%) and validation (20%)
X_train, X_valid, y_train, y_valid = train_test_split(X, y,train_size=0.8,test_size = 0.2,random_state = 0)

## LGBMClassifier

In [None]:
%%time
# instanciate model
lgbmmodel = LGBMClassifier(learning_rate=0.05,
                      n_estimators=1000,
                      reg_lambda = 1)
# fit model
lgbmmodel.fit(X_train, y_train)
# score model
y_preds = lgbmmodel.predict(X_valid)#[:, 1]
lgbm_score=roc_auc_score(y_valid, y_preds)
print("Area under the curve score ; ", lgbm_score)

# evaluate model
y_preds = lgbmmodel.predict(X_valid)
lgbscores=roc_auc_score(y_valid, y_preds)
print("Area under the curve score (binary) ; ",lgbscores)

lgbscore = score(X_train, y_train, lgbmmodel, cv=4)
display(lgbscore)
print('scoring completed')

# create confusion matrix
metrics.plot_confusion_matrix(lgbmmodel, X_valid, y_valid)
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

# predict test
lgbm_preds = lgbmmodel.predict(test)#[:, 1]

# Save the predictions to a CSV file
lgbm_submission = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")
lgbm_submission.claim = lgbm_preds
lgbm_submission.to_csv("lgbm_prediction_submission.csv",index=False)
print('lgbm prediction complete')   

sns.distplot(lgbm_submission['claim'], kde=True, hist=False)

### Observations on lgbm model

The lightgbm model scored 0.794 on the Area Under Curve score and achieved a public leaderboard score of 0.64211. 

137820 correct predictions 53764 incorrect predictions.

## Catboost

In [None]:
%%time
# catboost

# instanciate model
catboostmodel = CatBoostClassifier(iterations=1555, 
    bootstrap_type='Bernoulli', 
    od_wait=1144, 
    learning_rate=0.025, 
    reg_lambda=36, 
    random_strength=43.75, 
    depth=7, 
    min_data_in_leaf=11, 
    leaf_estimation_iterations= 1, 
    subsample= 0.8227911142845009,
    verbose=0)
# fit model
catboostmodel.fit(X_train, y_train)
print("model fit")
# score model
y_preds = catboostmodel.predict(X_valid)
catboostmodel_score=roc_auc_score(y_valid, y_preds)
print("Area under the curve score ; ", catboostmodel_score)

# evaluate model
y_preds = catboostmodel.predict(X_valid)
catboostscores=roc_auc_score(y_valid, y_preds)
print(catboostscores)
catboostscore = score(X_train, y_train, catboostmodel, cv=4)
display(catboostscore)
print('scoring completed')

# create confusion matrix
metrics.plot_confusion_matrix(catboostmodel, X_valid, y_valid)
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

# predict test
catboost_preds = catboostmodel.predict(test)
# Save the predictions to a CSV file
catboost_submission = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")
catboost_submission.claim = catboost_preds
catboost_submission.to_csv("catboost_prediction_submission.csv",index=False)
print('catboost prediction complete')   

sns.distplot(catboost_submission['claim'], kde=True, hist=False)

## Comparing model performance

With missing values filled using feature means and features standardized all of the models managed to get an area under the curve score of 0.79. The different models however scored very differently on the public leaderboard best performing model was the lgbm with 0.64, the random forest 0.54 and xgboost only 0.33.

The xgboost scores that i got ranged from 0.33 to 0.54 depending on how i filled the missing values as means and normalized the data. 

# Feature engineering

Now that we have an idea of how well the basic model works, we can start to look at how important some of the features are. We can then try and remove some of the noisy features to produce better models. 

### Important features

In [None]:
# Feature importance lgbm
# instanciate and fit model
lgbm_checker = LGBMClassifier(learning_rate=0.05,
                      n_estimators=1000,
                      reg_lambda = 1)
lgbm_checker.fit(X_train, y_train)

# put feature impoartance into table
importances_df = pd.DataFrame(lgbm_checker.feature_importances_, columns=['Feature_Importance'],
                              index=X_train.columns)
importances_df.sort_values(by=['Feature_Importance'], ascending=False, inplace=True)
#print(importances_df)

# plot importance as bar chart
importances=importances_df.index
importances_df = importances_df.sort_values(['Feature_Importance'])
y_pos = np.arange(len(importances_df))
plt.figure(figsize=(8,20))
plt.barh(y_pos,importances_df['Feature_Importance'])
plt.yticks(y_pos, importances,fontsize=10)
plt.ylabel('importance')
plt.title('Feature importance (lgbm)')
plt.show()
# print(importances_df)

In [None]:
# get important features
importart_features=[]
for feature in importances_df.head(95).index:
    importart_features.append(feature)
print(importart_features)

### Visualize important features

In [None]:
%%time
train_important=train[importart_features]
test_important=test[importart_features]
train_outliers = ((train_important - train_important.min())/(train_important.max() - train_important.min()))
test_outliers = ((test_important - test_important.min())/(test_important.max() - test_important.min()))
try:
    train_outliers.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
print('Training data (blue), Test Data (red)')
fig = plt.figure(figsize = (20, 40))
for idx, i in enumerate(train_outliers.columns):
    fig.add_subplot(np.ceil(len(train_outliers.columns)/4), 4, idx+1)
    train_outliers.iloc[:, idx].hist(bins = 20,color='b',alpha=0.5)
    test_outliers.iloc[:, idx].hist(bins = 20,color='r',alpha=0.5)
    plt.title(i)
#plt.text(9, -20000, caption, size = 12)
plt.show()

## Feature Reduction (lgbmodel)

In [None]:
%%time
#get test
#testcolumns=pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv',index_col='id')
#test.columns=testcolumns.columns

# Create data sets for training (20%) and validation (20%)
rX_train, rX_valid, ry_train, ry_valid = train_test_split(X[importart_features], y,train_size=0.8,test_size = 0.2,random_state = 0)

# instanciate model
lgbmmodel = LGBMClassifier(learning_rate=0.05,
                      n_estimators=1000,
                      reg_lambda = 1)
# fit model
lgbmmodel.fit(rX_train, ry_train)
print('model fit')
# score model
y_preds = lgbmmodel.predict(rX_valid)
lgbm_score=roc_auc_score(ry_valid, y_preds)
print(lgbm_score)

# evaluate model
y_preds = lgbmmodel.predict(rX_valid)
lgbmmodelscores=roc_auc_score(ry_valid, y_preds)
print(lgbmmodelscores)
lgbmmodelscore = score(rX_train, ry_train, lgbmmodel, cv=4)
display(lgbmmodelscore)
print('scoring completed')

# predict test
lgbm_preds = lgbmmodel.predict(test[importart_features])

#confusion matrix
metrics.plot_confusion_matrix(lgbmmodel, rX_valid, ry_valid)
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

# Save the predictions to a CSV file
lgbm_submission = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")
lgbm_submission.claim = lgbm_preds
lgbm_submission.to_csv("lgbm_prediction_featurereduction_submission2.csv",index=False)
print('lgbm prediction with reduced features complete')   

sns.distplot(lgbm_submission['claim'], kde=True, hist=False)

### Observations of feature reduction

In this really basic test reducing the reductions to the top 25 features reduced the models performance. The prediction in the top half of the confusion matrix improved a little but the lower portion was significantly worse. 

# Blending

Blending is a type of stacking that places the predictions of several different models into a new dataframe and then models that dataframe to use the best predictors for each feature.

We will use the same parameters as used previously for our models. I've used 0.33 as a training size because it is quite a big data set and we are using a lot of estimators in each model so it will take a while to run.

In [None]:
%%time

# Split data 
X_train,X_valid,y_train,y_valid = train_test_split(X,y,test_size=0.33, random_state=48,stratify=y)

# instanciate models
lgbm_model =  LGBMClassifier(learning_rate=0.05,
                      n_estimators=500,
                      reg_lambda = 1)
catboost_model =  CatBoostClassifier(iterations=1555, 
    bootstrap_type='Bernoulli', 
    od_wait=1144, 
    learning_rate=0.025, 
    reg_lambda=36, 
    random_strength=43.75, 
    depth=7, 
    min_data_in_leaf=11, 
    leaf_estimation_iterations= 1, 
    subsample= 0.8227911142845009,
    verbose=0)
xgb_model =  XGBClassifier(
    n_estimators=500,
    learning_rate=0.1,
    max_depth=3,
    min_child_weight=3,
    subsample=0.5,
    colsample_bytree=0.5,
    n_jobs=-1,
    objective='binary:logistic',
    eval_metric='auc', 
    random_state=38)

# create zero values
lgbm_preds = np.zeros(len(test))
catboost_preds = np.zeros(len(test))
xgb_preds = np.zeros(len(test))

lgbm_oof = np.zeros(len(X_valid))
catboost_oof = np.zeros(len(X_valid))
xgb_oof = np.zeros(len(X_valid))

# LGBM
lgbm_model.fit(X_train,y_train)
lgbm_oof = lgbm_model.predict_proba(X_valid)[:,1]
lgbm_preds = lgbm_model.predict_proba(test)[:,1]
print('lgbm model complete')

# Catboost
catboost_model.fit(X_train,y_train)
catboost_oof = catboost_model.predict_proba(X_valid)[:,1]
catboost_preds = catboost_model.predict_proba(test)[:,1]
print('catboost model complete')

# XGBoost
xgb_model.fit(X_train,y_train)
xgb_oof = xgb_model.predict_proba(X_valid)[:,1]
xgb_preds = xgb_model.predict_proba(test)[:,1]
print('xgboost model complete')

blend_x_valid

### Observations on blending

The highest score from a single model was 0.79 which improved to 0.81 when 3 models were blended.

# History
This notebook is my first notebook on this competition and is very much about exploring the data to see what we have to work with before i decide whether to  

I have added a few more models including random forest and LGBMClassifier to compare results. I have also revisited the code for filling the missing values and normalising the data because I was getting quite inconsistent public scores.

I've added confusion matrices to each of the models to better understand their scores.

This is version 10 of this notebook but is very much a draft | work in progress.

Version 10 includes some blending and built on previous feature engineering.

## Next Steps

Parameter tuning,
Stacking and Ensembles.

# Credit where credit's due
The following notebook's were super inspirational in creating this notebook, and gave me insights I may not have otherwise come up with. Thank you all for sharing your code it's awesome!

Memory reduction https://www.kaggle.com/bextuychiev/how-to-work-w-million-row-datasets-like-a-pro

Boxplots https://www.kaggle.com/snikhil17/making-basic-eda-attractive based on https://www.kaggle.com/suharkov/sep-2021-playground-eda-no-model-for-now

Model and evaluation https://www.kaggle.com/ryanholbrook/getting-started-september-2021-tabular-playground

Confusion matrix https://www.kaggle.com/maksymshkliarevskyi/tps-sep-all-for-start-eda-xgb-catboost-baseline

blending https://www.kaggle.com/raahulsaxena/tps-sept-21-ensembling-stacking-blending

I'm relatively new to kaggle contests, if you have any helpful hints or observations | recommendations please post in comments. 

If you found this notebook useful please upvote. If you found any of the notebooks listed above useful or use any of their code, please credit and upvote the source.