## About Tabular Playground Series - Oct 2021

The dataset used for this competition is synthetic, but based on a real dataset and generated using a CTGAN.

The dataset deals with predicting the biological response of molecules given various chemical properties. Although the features are anonymized, they have properties relating to real-world features.

## About this notebook

This notebook is a work in progress and will be regularly updated. It is my first notebook of this competition and will concentrate on data exploration and creating a baseline for future notebooks that will concentrate on modelling different solutions. This is a beginner level notebook meant for my own use and not indended to be a training aid or tutorial. Some of the code will be taken from other public notebooks, sources will be creditted at the bottom.

## First thoughts on this months project

* This months Tabular Playground Dataset is once again quite large, so managing both cpu usage and ram is going to be an important element of the project.
* It looks like another classification problem.
* There is no missing data, so imputing values will not be required.
* There a both categorical and continuous features. The categorical data is all binary and some of the continuous data appears to be category like. It may be possible to reduce the memory requirements by redefining data types in order to minimize memory use without lossing any meaningful information.
* Data engineering and feature importance may be important.
* Its likely that model selection and hyper parameter tuning will be important.
* Staking, blending and ensambles are likely to be important to get higher scores.

## Set up environment

In [None]:
%%time

import os, psutil
import gc

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import cross_validate,cross_val_score,train_test_split, KFold, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, log_loss, roc_auc_score
from sklearn import ensemble,metrics,model_selection,neighbors,preprocessing, svm, tree
import lightgbm as lgb
from lightgbm import LGBMClassifier

from statsmodels.graphics.mosaicplot import mosaic

# machine learning tools
import h2o
from h2o.estimators import H2OGeneralizedLinearEstimator, H2ORandomForestEstimator, H2OGradientBoostingEstimator

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Create functions

In [None]:
%%time
# taken from https://www.kaggle.com/ryanholbrook/getting-started-september-2021-tabular-playground

def cpu_stats():
    pid = os.getpid()
    py = psutil.Process(pid)
    memory_use = py.memory_info()[0] / 2. ** 30
    return 'memory GB:' + str(np.round(memory_use, 2))

def score(X, y, model, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model, X_train, y_train, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )
print('Function built')


## Get data

In [None]:
%%time
# Get data
train=pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')
test=pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')
print("Data imported")

## from: https://www.kaggle.com/bextuychiev/how-to-work-w-million-row-datasets-like-a-pro
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

train = reduce_memory_usage(train, verbose=True)
test = reduce_memory_usage(test, verbose=True)
print(cpu_stats())
print('Memory reduced')

## Get features

In [None]:
%%time
features=[]
cat_features=[]

In [None]:
cont_features=[]
for feature in test.columns:
    features.append(feature)
    if test.dtypes[feature]=='int8':
        cat_features.append(feature)
    if test.dtypes[feature]=='float16':
        cont_features.append(feature)
    #print(test.dtypes[feature])
print('features obtained')

plt.pie([len(cat_features), len(cont_features)], 
        labels=['Categorical', 'Continuos'],
        colors=['skyblue', 'blue'],
        textprops={'fontsize': 13},
        autopct='%1.1f%%')
plt.show()

## About imported data

In [None]:
%%time
# Get shape of data
print('*'*40, '\nHow much data was imported?')
print('*'*40)
print('Training data :', train.shape)
print('Test data :', test.shape)
print('*'*40,"\n")

# missing data
print('*'*40,'\nHow much data is missing?')
print('*'*40)
training_missing_val_count_by_column = (train.isnull().values.sum())
test_missing_val_count_by_column = (test.isnull().values.sum())
print('Missing training data :  {:.2f} ({:.1f})%'.format (training_missing_val_count_by_column,training_missing_val_count_by_column/train.shape[0]))
print('Missing test data :  {:.2f} ({:.1f})%'.format (test_missing_val_count_by_column,test_missing_val_count_by_column/test.shape[0]))
print('*'*40,"\n")

# categorical data
print('*'*40,'\nFeature types?')
print('*'*40)
print('Categorical features : ', (len(cat_features)))
print('Continuous features : ', (len(cont_features)))
print('*'*40,'\n')

# get info
print('*'*40,'\nInfo on datasets')
print('*'*40)
print(train.info(),'\n')
print(test.info(),'\n')
print('*'*40)

print('\noverview complete')  

plt.pie([len(train), len(test)], 
        labels=['train', 'test'],
        colors=['skyblue', 'blue'],
        textprops={'fontsize': 13},
        autopct='%1.1f%%')
plt.show()

del training_missing_val_count_by_column,test_missing_val_count_by_column

[](http://)

## Sample Data

In [None]:
%%time
# Get column titles
#print('*'*40,'\nColumn Names')
#print('*'*40)
#print(features)
#print('*'*40)

# Get sample data
print('*'*40,'\nSample Training Data')
print('*'*40)
print(train.head(),'\n')
#print('*'*40,'\n')
#print('*'*40,'\nSample Test Data')
#print('*'*40)

## Distibution of Categorical Data

In [None]:
%%time
train_outliers = ((train[cat_features] - train[cat_features] .min())/(train[cat_features] .max() - train[cat_features] .min()))
test_outliers = ((test[cat_features] - test[cat_features].min())/(test[cat_features].max() - test[cat_features].min()))
try:
    train_outliers.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
print('Training data (blue), Test Data (red)')
fig = plt.figure(figsize = (20, 40))
for idx, i in enumerate(train_outliers.columns):
    fig.add_subplot(np.ceil(len(train_outliers.columns)/4), 4, idx+1)
    train_outliers.iloc[:, idx].hist(bins = 2,color='b',alpha=0.5)
    test_outliers.iloc[:, idx].hist(bins = 2,color='r',alpha=0.5)
    plt.title(i)
#plt.text(9, -20000, caption, size = 12)
plt.show()
del train_outliers, test_outliers

## Distibution of Categorical Data by target value

In [None]:
%%time
# category features by target value

# separate data by target value
train_0=train[train.target==0]
train_1=train[train.target==1]

# separate cat features
train_outliers = ((train[cat_features] - train[cat_features] .min())/(train[cat_features] .max() - train[cat_features] .min()))
train_outliers_0 = ((train_0[cat_features] - train_0[cat_features] .min())/(train_0[cat_features] .max() - train_0[cat_features] .min()))
train_outliers_1 = ((train_1[cat_features] - train_1[cat_features].min())/(train_1[cat_features].max() - train_1[cat_features].min()))
try:
    train_outliers_0.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
try:
    train_outliers_1.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
print('Distribution for category features, red=target 0, blue=target 1')
fig = plt.figure(figsize = (20, 40))
for idx, i in enumerate(train_outliers_0.columns):
    fig.add_subplot(np.ceil(len(train_outliers_0.columns)/4), 4, idx+1)
    train_outliers_0.iloc[:, idx].hist(bins = 2,color='skyblue',alpha=0.3)
    train_outliers_1.iloc[:, idx].hist(bins = 2,color='blue',alpha=0.3)
    plt.title(i)
plt.show()
del train_outliers_0, train_outliers_1,train_0, train_1

## Distribution of Continuous data

In [None]:
%%time
#train_outliers = ((train[cont_features] - train[cont_features].min())/(train[cont_features].max() - train[cont_features].min()))
#train_test = ((test[cont_features] - test[cont_features].min())/(test[cont_features].max() - test[cont_features].min()))
train_outliers=train[cont_features]
test_outliers=test[cont_features]
try:
    train_outliers.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
print('Training data (blue), Test Data (red)')
fig = plt.figure(figsize = (40, 140))
for idx, i in enumerate(train_outliers.columns):
    fig.add_subplot(np.ceil(len(train_outliers.columns)/4), 4, idx+1)
    train_outliers.iloc[:, idx].hist(bins=20,color='b',alpha=0.5)
    test_outliers.iloc[:, idx].hist(bins = 20,color='r',alpha=0.5)    
    plt.subplots_adjust(hspace=0.2)
    plt.title(i)
plt.show()

In [None]:
%%time
# continuous features by target value

# separate data by target value
train_0=train[train.target==0]
train_1=train[train.target==1]

train_outliers_0=train_0[cont_features]
train_outliers_1=train_1[cont_features]
try:
    train_outliers_0.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
try:
    train_outliers_1.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
print('Distribution of continuous features, red=target0, blue=target1')

fig, axes = plt.subplots(nrows, ncols, figsize=(18, 150), facecolor='#EAEAF2')
for idx, i in enumerate(train_outliers_0.columns):
    fig.add_subplot(np.ceil(len(train_outliers_0.columns)/4), 4, idx+1)
    train_outliers_0.iloc[:, idx].hist(bins=20,color='b',alpha=0.5)
    train_outliers_1.iloc[:, idx].hist(bins = 20,color='r',alpha=0.5)  
    plt.subplots_adjust(hspace=0.2)
    plt.title(i)
plt.show()

del train_0,train_1,train_outliers_0,train_outliers_1

I thaught for one glorious moment that i'd hit upon some great secret within the data,where features like 196 and 202 had separatable distibutions of feature values directly correlated to the target value, however it looks like this is mearly down to the way that data is scaling to produce the histograms. The original histograms or kdeplot of the same data don't show the same separation of the data.   

## Kde plot of feature distribution colour coded by target value

In [None]:
%%time
# continuous features by target value

# separate data by target value
train_0=train[train.target==0]
train_1=train[train.target==1]

train_outliers_0=train_0[cont_features]
train_outliers_1=train_1[cont_features]
try:
    train_outliers_0.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
try:
    train_outliers_1.drop(['claim'], axis=1, inplace=True)
except:
    print('Already separated')
print('Distribution of continuous features, red=target0, blue=target1')

print("Feature distribution of continous features: ")
ncols = 5
nrows = int(len(cont_features) / ncols )#+ (len(features) % ncols > 0))

fig, axes = plt.subplots(nrows, ncols, figsize=(18, 150), facecolor='#EAEAF2')

for r in range(nrows):
    for c in range(ncols):
        col = cont_features[r*ncols+c]
        sns.kdeplot(x=train_outliers_0[col], ax=axes[r, c], color='blue', label='target=0')
        sns.kdeplot(x=train_outliers_1[col], ax=axes[r, c], color='orange', label='target=1')
        axes[r, c].set_ylabel('')
        axes[r, c].set_xlabel(col, fontsize=8, fontweight='bold')
        axes[r, c].tick_params(labelsize=5, width=0.5)
        axes[r, c].xaxis.offsetText.set_fontsize(4)
        axes[r, c].yaxis.offsetText.set_fontsize(4)
plt.show()

## Boxplots of continuous features

In [None]:
%%time
# generate box plots
train_outliers = ((train[cont_features] - train[cont_features].min())/(train[cont_features].max() - train[cont_features].min()))
fig, ax = plt.subplots(8, 1, figsize = (25,25))
sns.boxplot(data = train_outliers.iloc[:, 1:30], ax = ax[0])
sns.boxplot(data = train_outliers.iloc[:, 30:60], ax = ax[1])
sns.boxplot(data = train_outliers.iloc[:, 60:90], ax = ax[2])
sns.boxplot(data = train_outliers.iloc[:, 90:120], ax = ax[3])
sns.boxplot(data = train_outliers.iloc[:, 120:150], ax = ax[4])
sns.boxplot(data = train_outliers.iloc[:, 150:180], ax = ax[5])
sns.boxplot(data = train_outliers.iloc[:, 180:210], ax = ax[6])
sns.boxplot(data = train_outliers.iloc[:, 210:240], ax = ax[7])
plt.show()
del train_outliers

## Feature correlation of categogical data

In [None]:
%%time
# Generate correlations in categorical data
corr=train[cat_features].corr()

# create heatmap
mask = np.triu(np.ones_like(corr, dtype = bool))
plt.figure(figsize = (15, 15))
plt.title('Correlation matrix for categorigal features of Training data')
sns.heatmap(corr,cmap='coolwarm', mask = mask,annot=False, linewidths = .5,square=True,cbar_kws={"shrink": .60})
plt.show()

## Feature correlation of continuous data

In [None]:
%%time
# get list of columns with high correlations
corr = train[cont_features].corr().abs()
high_corr=np.where(corr>0.02)
high_corr=[(corr.columns[x],corr.columns[y]) for x,y in zip(*high_corr) if x!=y and x<y]
print('initial correlations calculated')
#print("high correlation \n",high_corr)
high_corr_features=[]
for x in high_corr:
    for item in x:
        if item not in high_corr_features:
            high_corr_features.append(item)
corr_matrix = train[high_corr_features].corr()
print('highest correlations calculated')

# create heatmap
mask = np.triu(np.ones_like(corr_matrix, dtype = bool))
plt.figure(figsize = (15, 15))
plt.title('Correlation matrix for continuous features of training data')
sns.heatmap(corr_matrix, cmap='coolwarm',mask = mask,annot=False, linewidths = .5,square=True,cbar_kws={"shrink": .60})
plt.show()

## Feature correlation to target

In [None]:
%%time
# correlations to target
# idea/code taken https://www.kaggle.com/rahullalu/tps-oct-2021-eda-and-baseline

corr_cat=pd.DataFrame()
corr_cat['target'] = train[cat_features].corrwith(train['target'])
df_cat=corr_cat.sort_values(by='target', ascending=False)

corr_cont=pd.DataFrame()
corr_cont['target'] = train[high_corr_features].corrwith(train['target'])
df_cont=corr_cont.sort_values(by='target', ascending=False)

fig, axes = plt.subplots(1, 2, figsize=(18, 10))
fig.suptitle('Correlation to Target')

heatmap = sns.heatmap(ax=axes[0],data=df_cat,annot=True,cmap='tab20c',linewidth=0.5,xticklabels=df_cat.columns,yticklabels=df_cat.index)
heatmap = sns.heatmap(ax=axes[1],data=df_cont,annot=True,cmap='tab20c',linewidth=0.5,xticklabels=df_cont.columns,yticklabels=df_cont.index)
plt.show()

corr_cont=pd.DataFrame()
corr_cont['target'] = train[cont_features].corrwith(train['target'])
df_cont=corr_cont.sort_values(by='target', ascending=False)

del df_cont, df_cat, corr_cat, corr_cont

## Feature importance - categorical features

In [None]:
# Feature importance lgbm
X=train[cat_features]
y=train['target']
# Split data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,train_size=0.8,test_size = 0.2,random_state = 0)

# instanciate and fit model
lgbm_checker = LGBMClassifier(learning_rate=0.05,
                      n_estimators=1000,
                      reg_lambda = 1)

lgbm_checker.fit(X_train, y_train)

# put feature impoartance into table
importances_df = pd.DataFrame(lgbm_checker.feature_importances_, columns=['Feature_Importance'],
                              index=X_train.columns)
importances_df.sort_values(by=['Feature_Importance'], ascending=False, inplace=True)
#print(importances_df)

# plot importance as bar chart
importances=importances_df.index
importances_df = importances_df.sort_values(['Feature_Importance'])
y_pos = np.arange(len(importances_df))
plt.figure(figsize=(8,10))
plt.barh(y_pos,importances_df['Feature_Importance'])
plt.yticks(y_pos, importances,fontsize=10)
plt.ylabel('importance')
plt.title('Feature importance for Lgbm classifier')
plt.show()

## Feature importance - continuous features

In [None]:
# Feature importance lgbm
X=train[cont_features]
y=train['target']
# Split data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,train_size=0.8,test_size = 0.2,random_state = 0)

# instanciate and fit model
lgbm_checker = LGBMClassifier(learning_rate=0.05,
                      n_estimators=1000,
                      reg_lambda = 1)

lgbm_checker.fit(X_train, y_train)

# put feature impoartance into table
importances_df = pd.DataFrame(lgbm_checker.feature_importances_, columns=['Feature_Importance'],
                              index=X_train.columns)
importances_df.sort_values(by=['Feature_Importance'], ascending=False, inplace=True)
#print(importances_df)

# plot importance as bar chart
importances=importances_df.index
importances_df = importances_df.sort_values(['Feature_Importance'])
y_pos = np.arange(len(importances_df))
plt.figure(figsize=(8,25))
plt.barh(y_pos,importances_df['Feature_Importance'])
plt.yticks(y_pos, importances,fontsize=10)
plt.ylabel('importance')
plt.title('Feature importance for Lgbm classifier')
plt.show()

## SHAP analysis

The goal of SHAP is to explain the prediction of an instance by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values from coalitional game theory. The feature values of a data instance act as players in a coalition. Shapley values tell us how to fairly distribute the “payout” among the features.

The idea behind SHAP feature importance is simple: Features with large absolute Shapley values are important. Since we want the global importance, we sum the absolute Shapley values per feature across the data:

In [None]:
%%time
#taken from https://www.kaggle.com/docxian/tabular-playground-10-first-glance-baseline/notebook#Model 

# extract list of features
features = train.columns.tolist()
features.remove('id')
features.remove('target')

# select predictors
predictors = features
print('Number of predictors: ', len(predictors))

# start H2O
h2o.init(max_mem_size='12G', nthreads=4) # define maximum memory usage and number of cores

# upload data in H2O environment
# let's start with a SUBSET of training data only to reduce RAM use!
n_sub = 50000
train_sub = train.sample(n=n_sub, random_state=42)
train_hex = h2o.H2OFrame(train_sub)

# force categorical target
train_hex['target'] = train_hex['target'].asfactor()

# fit Gradient Boosting model
n_cv = 5 # 5 folds

fit_GBM = H2OGradientBoostingEstimator(ntrees=250,
                                       max_depth=6,
                                       min_rows=10,
                                       learn_rate=0.1, # default: 0.1
                                       sample_rate=1,
                                       col_sample_rate=0.5,
                                       nfolds=n_cv,
                                       score_each_iteration=True,
                                       stopping_metric='auc',
                                       stopping_rounds=5,
                                       stopping_tolerance=0.0001*0.5,
                                       seed=999)
# train model
fit_GBM.train(x=predictors,
              y='target',
              training_frame=train_hex)

# variable importance
fit_GBM.varimp_plot()

# alternative variable importance using SHAP => see direction as well as severity of feature impact
fit_GBM.shap_summary_plot(train_hex);

## Correlation between feature f22 and target

Feature f22 produced some unusual results; It appears to have a high negative correlation to the target value but a very low correlation to other features and it scores poorly in terms of feature importance using basic models. For this reason i've singled it out to look at the distribution compared to target distribution and to produce a mossaic plot of the values against target.

In [None]:
%%time
# feature 'f22' ditribution v's target
sns.distplot(train['f22'], kde=True, hist=False, color='blue', label='f22')
sns.distplot(train['target'], kde=True, hist=False, color='red', label='target')
plt.legend()


## Mosaic plot for f22 versus target values

In [None]:
# plot target vs binary features using mosaic plot
# taken from https://www.kaggle.com/docxian/tabular-playground-10-first-glance-baseline/notebook#Model
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

for f in ['f22']:
    plt.rcParams['figure.figsize'] = (6,4) # increase plot size for mosaics
    mosaic(train, [f, 'target'], title='Target vs ' + f)
    plt.show()

It looks like approximately 25-30% of the values for f22 are either (0,0) or (1,1) meaning that if all you wanted was to be right 70-75% of the time this would work really well,but because there is little correlation between f22 and the other features if you want to score above 70-75% you shouldn't rely too heavily on this feature. 

# Baseline lgbm submission

This notebook is mainly about exploring the data, but we'll also produce a basic model with the default parameters to get a baseline on how well the basic model performs without any parameter tuning or feature engineering.

In [None]:
%%time
X=train[cont_features]
y=train['target']
# Split data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,train_size=0.8,test_size = 0.2,random_state = 0)
print('data split')

# instanciate model
lgbmmodel = LGBMClassifier(learning_rate=0.05,
                      n_estimators=1000,
                      reg_lambda = 1)
# fit model
lgbmmodel.fit(X_train, y_train)
print('model fit')

# score model
y_preds = lgbmmodel.predict(X_valid)#[:, 1]
lgbm_score=roc_auc_score(y_valid, y_preds)
print("Area under the curve score ; ", lgbm_score)

# evaluate model
y_preds = lgbmmodel.predict(X_valid)
lgbscores=roc_auc_score(y_valid, y_preds)
print("Area under the curve score (binary) ; ",lgbscores)

lgbscore = score(X_train, y_train, lgbmmodel, cv=4)
display(lgbscore)
print('scoring completed')

# create confusion matrix
metrics.plot_confusion_matrix(lgbmmodel, X_valid, y_valid)
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

# predict test
lgbm_preds = lgbmmodel.predict((test)[cont_features])
print('predictions complete')

# Save the predictions to a CSV file
lgbm_submission = pd.read_csv("../input/tabular-playground-series-oct-2021/sample_submission.csv")
lgbm_submission.target = lgbm_preds
lgbm_submission.to_csv("lgbm_pbaseline_submission.csv",index=False)
print('lgbm submission complete')   

sns.distplot(lgbm_submission['target'], kde=True, hist=False, color='blue', label='prediction')
sns.distplot(train['target'], kde=True, hist=False,color='red', label='target')

### Total memory usage

In [None]:
print(cpu_stats())

## Observation on Data

* The test dataset is approx 1/2 the size of the training dataset
* The training dataset is highly representative of the test dataset
* There is no missing data
* Approx 1/6th of features are binary features
* Approx 5/6th of features are continuous features
* There is relatively low correlation between features
* There appears to be a relatively high correlation between f22 and target value.
* The majority of categorical features have a negative correlation to target classification.
* Continuous feature have show both positive and negative correlations to target classification.
* feature importance indicates that there are a number of both categorical and continuous  features of importance.
* feature importance doesn't indicate f22 as an important feature.



## Next Steps

The next steps would be to build some basic models and see how they perform.
This is covered in my second notebook where i compare results for 20 classification models. This can be found here https://www.kaggle.com/davidcoxon/20-model-comparison-oct-tabular-playground

## Credit where credits due

First up thanks to the Kaggle team for the tireless work putting the tabular plaground together. Many thanks to the kaggle and stackoverflow communities and the folks that contitbutor to the various documents for the various python modules, without whom finding solutions to these problems would be so much rougher.  

Specific thanks this month go to:

https://www.kaggle.com/rahullalu for https://www.kaggle.com/rahullalu/tps-oct-2021-eda-and-baseline on correlation to target

https://www.kaggle.com/docxian for https://www.kaggle.com/docxian/tabular-playground-10-first-glance-baseline/notebook#Model on mosaic plots

https://www.kaggle.com/docxian for https://www.kaggle.com/docxian/tabular-playground-10-first-glance-baseline/notebook#Model on SHAP

https://www.kaggle.com/pourchot for comments on notebook.

If you found by notebook useful or you have comments please upvote / comment here. If you found any of the code from other sources useful (or used it in your own projects) please take the time to upvote their notebooks)