## About Tabular Playground Series - Oct 2021

The dataset used for this competition is synthetic, but based on a real dataset and generated using a CTGAN.

The dataset deals with predicting the biological response of molecules given various chemical properties. Although the features are anonymized, they have properties relating to real-world features.

## Previous notebooks

My first notebook on this competition explored the data in detail. At this point that notebook has had 38 upvotes and 15 comments, so might be worth a look before moving on to look at this notebook if you have not yet explored the data fully.
https://www.kaggle.com/davidcoxon/first-look-at-october-data

The second notebook concentrated on feature engineering and tevaluated a number of basic models creating a baseline for parameter tuning. https://www.kaggle.com/davidcoxon/20-model-comparison-oct-tabular-playground

## About this notebook

This notebook is a cuuenty a work in progress and will be regularly updated, it will concentrate on evaluating the performance of a several hyper tuned models.

This is a beginner level notebook meant mainly for my own use. Some of the code will be taken from other public notebooks, sources will be creditted at the bottom of the notebook.

## First thoughts on this months project

* This months Tabular Playground Dataset is once again quite large, so managing both cpu usage and ram is going to be an important element of the project.
* It looks like another classification problem.
* There is no missing data, so imputing values will not be required.
* There a both categorical and continuous features. The categorical data is all binary and some of the continuous data appears to be category like. It may be possible to reduce the memory requirements by redefining data types in order to minimize memory use without lossing any meaningful information.
* Data engineering and feature importance may be important.
* Its likely that model selection and hyper parameter tuning will be important.
* Staking, blending and ensambles are likely to be important to get higher scores.

## Exploring the data

You can find a complete exploration of the data this notebook: https://www.kaggle.com/davidcoxon/first-look-at-october-data/notebook

The summary of the data exploration is:

* The test dataset is approx 1/2 the size of the training dataset
* The training dataset is highly representative of the test dataset
* There is no missing data
* Approx 1/6th of features are binary features
* Approx 5/6th of features are continuous features
* There is relatively low correlation between features
* There appears to be a relatively high correlation between f22 and target value.
* The majority of categorical features have a negative correlation to target classification.
* Continuous feature have show both positive and negative correlations to target classification.
* feature importance indicates that there are a number of both categorical and continuous features of importance.
* feature importance doesn't indicate f22 as an important feature.

## Model performance

* The ROC AUC scores fell into 2 groups clustered around scores of 76 and 50.
* The Boosted classifiers generally produced the best results with Catboost coming top by a small margin. Ridge, Linear Discriminant Analysis and Random forest were also in the higher scoring model.
* Low scoring models included Gausian naive bayes, k nearest neighbour, linear svc, logistic regression, Stochastic Gradient Descent and passive aggressive models.

* If we take just the categorical features and run the same models we get a range af ROC_AUC scores of between 61.93 and 75.77 in fact 12 of the models produce a score of 75.77.

* If we take just the continuous features and run the models we get a range of ROC_AUC scores of between 49.5 and 63.4, meaning that despite 85% of the features being contimuous on only 15% binary categogical features.

## Set up environment

In [None]:
%%time

import os, psutil
import gc

import numpy as np 
import pandas as pd 
from statsmodels.graphics.mosaicplot import mosaic
from scipy.stats import randint

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from sklearn import ensemble, linear_model,metrics,model_selection,neighbors,preprocessing, svm, tree
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import (AdaBoostClassifier,BaggingClassifier,ExtraTreesClassifier,GradientBoostingClassifier,RandomForestClassifier,StackingClassifier,VotingClassifier)
from sklearn.feature_selection import mutual_info_regression, SelectKBest, f_classif
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier, LogisticRegression, PassiveAggressiveClassifier,RidgeClassifierCV
from sklearn.metrics import classification_report, accuracy_score, log_loss, roc_auc_score, mean_squared_error
from sklearn.model_selection import cross_validate,cross_val_score, GridSearchCV,KFold,train_test_split,StratifiedKFold
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import KBinsDiscretizer, OrdinalEncoder, OneHotEncoder,MinMaxScaler,StandardScaler, RobustScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import ensemble, linear_model,neighbors, svm, tree

import lightgbm as lgb

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from optuna.integration import lightgbm as lgb
from xgboost import XGBClassifier

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import (HistGradientBoostingClassifier)

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Create functions

In [None]:
%%time
# taken from https://www.kaggle.com/ryanholbrook/getting-started-september-2021-tabular-playground

def cpu_stats():
    pid = os.getpid()
    py = psutil.Process(pid)
    memory_use = py.memory_info()[0] / 2. ** 30
    return 'memory GB:' + str(np.round(memory_use, 2))

def score(X, y, model, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model, X_train, y_train, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

## from: https://www.kaggle.com/bextuychiev/how-to-work-w-million-row-datasets-like-a-pro
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

def SelectKBestFeatures(features, target, threshold):
    kbest = SelectKBest(score_func = f_classif, k = len(features.columns))
    X = kbest.fit_transform(features, target.values.ravel())
    print('Before the SelectKBest =',features.shape)
    
    selected_features = []
    
    for i in range(len(features.columns)):
        if kbest.pvalues_[i]<=threshold:
            selected_features.append(features.columns[i])
            
    X_selected =  pd.DataFrame(X)
    X_selected.columns = features.columns
    X_selected = X_selected[selected_features]
    
    print('After the SelectKBest = ', X_selected.shape)
    
    return X_selected, selected_features

print('Function built')

# Create an empty dataframe for performance results
AdvancedModelPerformanced_df = pd.DataFrame(columns=['Score'])
# Create an empty list for outputs
Outputs=[]

print('Dataframe created')

In [None]:
print(cpu_stats())

## Get Data

In [None]:
%%time
# Get data
train=pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')
test=pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')
test_id=test.id
print("Data imported")

train = reduce_memory_usage(train, verbose=True)
test = reduce_memory_usage(test, verbose=True)
print(cpu_stats())
print('Memory reduced')

plt.pie([len(train), len(test)], 
        labels=['train', 'test'],
        colors=['skyblue', 'blue'],
        textprops={'fontsize': 13},
        autopct='%1.1f%%')
plt.show()

# Merge training and test
combine=[train,test]
combined=pd.concat(combine)
combined = reduce_memory_usage(combined, verbose=True)

In [None]:
print(cpu_stats())

## Add statistical features

We can add the standard deviation, variance,absolute sum, minimum and maximum values of each row as additional features. 

In [None]:
# create lists
features=[]
cat_features=[]
cont_features=[]

# get initial features
for item in combined.columns:
    features.append(item)    
features.remove('target')
for feature in features:
    if combined.dtypes[feature]=='float16':
        cont_features.append(feature)

# add std field
if "Std" in combined.columns:
    print('Std training feature exists')
else:
    combined['std'] = combined[cont_features].std(axis=1)
    print('Std training feature added')
    
# add abs_sum field
if "abs_sum" in combined.columns:
    print('Abs_sum training feature exists')
else:
    combined['abs_sum'] = combined[cont_features].abs().sum(axis=1)
    print('Abs_sum training feature added')
        
# add var field
if "var" in combined.columns:
    print('var training feature exists')
else:
    combined['var'] = combined[cont_features].var(axis=1)
    print('var training feature added')   
    
# add min field
if "min" in combined.columns:
    print('min training feature exists')
else:
    combined['min'] = combined[cont_features].min(axis=1)
    print('min training feature added') 
    
# add max field
if "max" in combined.columns:
    print('max training feature exists')
else:
    combined['max'] = combined[cont_features].max(axis=1)
    print('max training feature added') 
    
# Add features to lists
new_features=["std","abs_sum","var","min","max"]
for item in new_features:
    features.append(item) 
    cat_features.append(item)

In [None]:
print(cpu_stats())

## Get features

In [None]:
# create lists
cat_features=[]
cont_features=[]

# get features
for feature in features:
    if combined.dtypes[feature]=='int8':
        cat_features.append(feature)
    if combined.dtypes[feature]=='float16':
        cont_features.append(feature)
    #print(test.dtypes[feature])
print('features obtained')

plt.pie([len(cat_features), len(cont_features)], 
        labels=['Categorical', 'Continuous'],
        colors=['skyblue', 'blue'],
        textprops={'fontsize': 13},
        autopct='%1.1f%%')
plt.show()

In [None]:
print(cpu_stats())

## Normalize data

RobustScaler - Scale features using statistics that are robust to outliers. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). 

MinMaxScaler - Scale features between the highest and lowest values making the lowest 0 and the highest 1. Use this option if you are going to use convert the script to convert category like continuous features later on. 

In [None]:
%%time

#select scaler (remark out the options you don't want to use)
#scale = StandardScaler()
scale = RobustScaler()
#scale = MinMaxScaler()

combined[cont_features]=scale.fit_transform(combined[cont_features]) 
print('Data scaled using : ', scale)

In [None]:
print(cpu_stats())

## Prepare data for modelling

In this notebook we are really only exploring how different models might perform to identify some models that we'll go on to look at in more detail later. I have therefore split only 10% of the data out for training the models and 5% for testing. This allows the full notebook to run in about an hour. You can increase this to 80% / 20% if you want to test individual models more robustly but its a relatively large dataset and the session will most likely time out if you try and run all cell with 80% of the data.

In [None]:
%%time
# split training and test
try:
    train = combined.iloc[:1000000,:]
    print(train.shape)
    test=combined.iloc[1000000:,:]
    print(test.shape)
    # delete dataframe
    del combined
    print('test/train datasets separated')
except:
    print('test/train datasets already separated')

#train = reduce_memory_usage(train, verbose=True)
#test = reduce_memory_usage(test, verbose=True)

try:
    y = train.target
    train.drop(['target'], axis=1, inplace=True)
    test.drop(['target'], axis=1, inplace=True)
except:
    print('target already separated')
X=train#[cat_features] #slice training here to explore feature reduction
print('Target data separated')
# Create data sets for training (80%) and validation (20%)
X_train, X_val, y_train, y_val = train_test_split(X, y,train_size=0.8,test_size = 0.2,random_state = 0)
print('Model data split')

plt.pie([len(X_train), len(X_val),len(X)-((len(X_train)+len(X_val)))], 
        labels=['X-train', 'y-train','usused data'],
        colors=['skyblue', 'blue','cornflowerblue'],
        textprops={'fontsize': 13},
        autopct='%1.1f%%')
plt.show()

In [None]:
print(cpu_stats())

## Target distribution

We can chart the distribution of the target values in the training dataset, on the assumption that the test dataset will have a similar distribution this gives us an idea of what the distribution might look like for our predicted results. 

In [None]:
sns.distplot(y, kde=True, hist=False)

## Observations on feature engineering and data processing

* The test and training data were combined before being normalised and were then separated to ensure that the values were not shifted comparative to each other.

* Normalising the data with robustscaler seemed to produce the best result.

* Initial test showed that removing some features yielded better results but doing this with kbestfeatures proved more effective than doing it manually.

* Converting some numeric category like features to a number of binary features produced better results but look a lot of time / memory and the difference in results was very small.

* The amount of time / processing taken to produce some models was so great that memory reduction was needed, converting some features into object types that look less memory without loosing any meaningful data.

* Good housekeeping, ie deleting temporary data that was no longer required as you went was also necessary to ensure that the notebook did not run out of resources before completing.

* When trying to run several models in the same notebook for comparison it sometimes became necessary to reduce the amount the size of the data used in the train|test split in order to minimize resource use, 60|15 produced similar results to 80\20.

* Adding features for statistical values std,var, abs sum, min and max for each row seems to improve performance.

## Observations on memory use

* Loading all of the libraries and the functions required in the entire notebook took up 0.3 gb

* Loading the data increase memory usage to 5gb, which came down to 2.5 after the data was compressed.

* producing feature correlations added 0.65gb to memory usage.

* Normalising the data did not affect over all memory usage.

* splitting the data added 0.5gb to memory usage.

* LgbModel added 0.15gb to memory usage.

# Optimized Models

The best performing models in the previous notebook https://www.kaggle.com/davidcoxon/20-model-comparison-oct-tabular-playground#Model-Comparison were all boosting models. This notebook will take the 3 main boosting models and build on the default parameters. As well as just running each model with enhanced parameters we will run them with some form of cross fold validation to further improve performance.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

## Catboost
Catboost is a categorizer that used gradient boosting on decision trees.

We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features.

It reduces the need for extensive hyper-parameter tuning and lowers the chances of overfitting also which leads to more generalized models. 

In [None]:
%%time
# tuned catboost model
modelname ="tuned_CatBoost"
y_train=y_train.astype(float)
# set parameters
cat_params = {
    'iterations': 15585, 
    'objective': 'CrossEntropy', 
    'bootstrap_type': 'Bernoulli',
    'learning_rate': 0.023575206684596582, 
    'reg_lambda': 36.30433203563295, 
    'random_strength': 43.75597655616195, 
    'depth': 8, 
    'min_data_in_leaf': 11, 
    'leaf_estimation_iterations': 1, 
    'subsample': 0.8227911142845009,
    'eval_metric' : 'AUC',
    'verbose' : 1000,
    'early_stopping_rounds' : 500,
}
# instanciate model
cat = CatBoostClassifier(**cat_params)
# fit model
cat.fit(X_train, y_train)

# evaluate performance
y_pred = cat.predict(X_val)
y_pred_proba = cat.predict_proba(X_val)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_val,  y_pred_proba)
auc = metrics.roc_auc_score(y_val, y_pred_proba)
print(auc)
metrics.plot_confusion_matrix(cat, X_val, y_val)
plt.title('Confusion matrix for catboost model')
plt.grid(False)
plt.show()

plt.plot(fpr,tpr,label="Catboost, auc="+str(auc))
plt.legend(loc=4)
plt.show()

# Use the model to generate predictions
predictions2 = cat.predict(test)

# Save the predictions to a CSV file
output1 = pd.DataFrame({'Id': test_id,
                       'target': predictions2})
output1.to_csv('tuned_cat_submission.csv', index=False)
print('tuned cat submission completed')

# plot distribution of target
sns.distplot(output1['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output1')
    
try:
    AdvancedModelPerformanced_df.at[modelname,'Score']=auc
except:
    AdvancedModelPerformanced_df = AdvancedModelPerformanced_df.append({'index':modelname,'Score':acc_xgb})

# tidy up
del auc,predictions2 

In [None]:
print(cpu_stats())

## Light Gradient Boosting

In [None]:
%%time
## lgb
modelname ="Light_Gradient_Boosting"
# set parameters

lgb_params = {'reg_alpha': 8.158768860412389,
        'reg_lambda': 8.793022151019823,
        'colsample_bytree': 0.2,
        'subsample': 0.4,
        'learning_rate': 0.02,
        'max_depth': 100,
        'num_leaves': 12,
        'min_child_samples': 68,
        'cat_smooth': 91,
        'objective': 'binary',  
        'random_state': 48,
        'n_estimators': 20000,
        'n_jobs': -1}

lgbmmodel = LGBMClassifier(**lgb_params)
lgbmmodel.fit(X_train, y_train)
print('model fit')
y_pred = lgbmmodel.predict(X_val)
acc_lgbmmodel = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_lgbmmodel)

# create confusion matrix
metrics.plot_confusion_matrix(lgbmmodel, X_val, y_val)
plt.title('Confusion matrix for light gradient boosting')
plt.grid(False)
plt.show()

# Use the model to generate predictions
lgbpredictions = lgbmmodel.predict(test)

# Save the predictions to a CSV file
output2 = pd.DataFrame({'Id': test_id,
                       'target': lgbpredictions})
output2.to_csv('tuned_lgbmodel_submission.csv', index=False)
print('tuned lgbmodel submission completed')

# plot distribution of target
sns.distplot(output2['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output2')

try:
    AdvancedModelPerformanced_df.at[modelname,'Score']=acc_lgbmmodel
except:
    AdvancedModelPerformanced_df = AdvancedModelPerformanced_df.append({index:modelname,'Score':acc_lgbmmodel})
    
# tidy up
del acc_lgbmmodel 

In [None]:
print(cpu_stats())

## xgboost


XGBoost (or Extreme Gradient Boosting) is an implementation of gradient boosted decision trees designed for speed and performance.

In [None]:
%%time
# Taken from https://www.kaggle.com/pallavisinha12/october-playground-series

# tuned xgboost
modelname ="tuned_XGBoost"
# set parameters
xgb_params = {
    "subsample": 0.65,
    "colsample_bytree": 0.4,
    "max_depth": 7,
    "learning_rate": 0.01,
    "objective": "binary:logistic",
    'eval_metric': 'auc',
    "nthread": -1,
    "max_bin": 192, 
    'min_child_weight': 2,
    'reg_lambda': 0.003,
    'reg_alpha': 0.02, 
    'seed' : 42,
    }

# instanciate model
xgb = XGBClassifier(**xgb_params)
xgb.fit(X_train, y_train)
print('model fit')
y_pred = xgb.predict(X_val)
acc_xgb = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_xgb)

fpr, tpr, _ = metrics.roc_curve(y_val,  y_pred)
auc = metrics.roc_auc_score(y_val, y_pred)
print(auc)
plt.plot(fpr,tpr,label="xgb, auc="+str(auc))
plt.legend(loc=4)
plt.show()

# create confusion matrix
metrics.plot_confusion_matrix(xgb, X_val, y_val)
plt.title('Confusion matrix for xgb model')
plt.grid(False)
plt.show()

# Use the model to generate predictions
xgbpredictions = xgb.predict(test)

# Save the predictions to a CSV file
output3 = pd.DataFrame({'Id': test_id,
                       'target': xgbpredictions})
output3.to_csv('tuned_xgb_submission.csv', index=False)
print('tuned xgb submission completed')

# plot distribution of target
sns.distplot(output3['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output3')

# Add to comparison file
try:
    AdvancedModelPerformanced_df.at[modelname,'Score']=auc
except:
    AdvancedModelPerformanced_df = AdvancedModelPerformanced_df.append({index:modelname,'Score':acc_xgb})
    
# tidy up
del acc_xgb 

In [None]:
print(cpu_stats())

## catboost with kbestfeatures

In [None]:
#%%time
# based on https://www.kaggle.com/pallavisinha12/october-playground-series

# kbestfeatures
modelname ="catboost_kbest"
p_feature = 0.0001
train_numerical, selected_numerical = SelectKBestFeatures(train[cont_features], y, p_feature)
train_categorical, selected_categorical = SelectKBestFeatures(train[cat_features], y, p_feature)

catboost_params = {
    'iterations': 15585, 
    'objective': 'CrossEntropy', 
    'bootstrap_type': 'Bernoulli',
    'learning_rate': 0.023575206684596582, 
    'reg_lambda': 36.30433203563295, 
    'random_strength': 43.75597655616195, 
    'depth': 8, 
    'min_data_in_leaf': 11, 
    'leaf_estimation_iterations': 1, 
    'subsample': 0.8227911142845009,
    'eval_metric' : 'AUC',
    'verbose' : 1000,
    'early_stopping_rounds' : 500,
}

preds = np.zeros((test.astype('float32')).shape[0])

kf = StratifiedKFold(n_splits = 5, random_state=42,shuffle=True)

auc = []
n = 0

try:
    X.drop(['kfold'], axis=1, inplace=True) # remove columns
    X.drop('target', axis=1, inplace=True) # remove columns
except:
    a="do nothing"# do nothing

for train_idx, test_idx in kf.split(X,y):
    X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]
    cat_model = CatBoostClassifier(**catboost_params)
    y_train=y_train.astype('float32')
    X_train=X_train.astype('float32')
    y_val=y_val.astype('float32')
    X_val=X_val.astype('float32')
    test=test.astype('float32')
    cat_model.fit(X_train, y_train, eval_set = [(X_val,y_val)],early_stopping_rounds = 100,verbose=False)
    preds += cat_model.predict_proba(test)[:,1]/kf.n_splits
    auc.append(roc_auc_score(y_val, cat_model.predict_proba(X_val)[:, 1]))
    gc.collect()
    print(f"fold: {n+1}, auc: {auc[n]}")
    n+=1  
    
print(np.mean(auc))

# Save the predictions to a CSV file
output4 = pd.DataFrame({'Id': test_id,
                       'target': preds})
output4.to_csv('tuned_catboost_kbest_submission.csv', index=False)
print('tuned catboost kbest submission completed')

# plot distribution of target
sns.distplot(output4['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output4')

try:
    AdvancedModelPerformanced_df.at[modelname,'Score']=(np.mean(auc))
except:
    AdvancedModelPerformanced_df = AdvancedModelPerformanced_df.append({index:modelname,'Score':(np.mean(auc))})
    
# tidy up
del auc 

In [None]:
print(cpu_stats())

## Light Gradient Boosting with kbestfeatures

In [None]:
#%%time
# kbestfeatures
modelname ="Light_Gradient_Boosting_kbest"
p_feature = 0.001
train_numerical, selected_numerical = SelectKBestFeatures(train[cont_features], y, p_feature)
train_categorical, selected_categorical = SelectKBestFeatures(train[cat_features], y, p_feature)

#set parameters
params={'reg_alpha': 8.158768860412389,
        'reg_lambda': 8.793022151019823,
        'colsample_bytree': 0.2,
        'subsample': 0.4,
        'learning_rate': 0.02,
        'max_depth': 100,
        'num_leaves': 12,
        'min_child_samples': 68,
        'cat_smooth': 91,
        'objective': 'binary',  
        'random_state': 48,
        'n_estimators': 20000,
        'n_jobs': -1}

#create preds
preds = np.zeros(test.shape[0])

kf = StratifiedKFold(n_splits = 5, random_state=42,shuffle=True)
auc = []
n = 0

try:
    X.drop(['kfold'], axis=1, inplace=True) # remove columns
    X.drop('target', axis=1, inplace=True) # remove columns
except:
    a="do nothing"# do nothing

for train_idx, test_idx in kf.split(X,y):
    X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]
    lgbm_model = LGBMClassifier(**params)
    lgbm_model.fit(X_train, y_train, eval_set = [(X_val,y_val)], early_stopping_rounds = 100, eval_metric = "auc", verbose = "False")
    preds += lgbm_model.predict_proba(test)[:,1]/kf.n_splits
    auc.append(roc_auc_score(y_val, lgbm_model.predict_proba(X_val)[:, 1]))
    gc.collect()
    print(f"fold: {n+1}, auc: {auc[n]}")
    n+=1  
    
print(np.mean(auc))

# Save the predictions to a CSV file
output5 = pd.DataFrame({'Id': test_id,
                       'target': preds})
output5.to_csv('tuned_lgbm_kbest_submission.csv', index=False)
print('tuned LGBM kbest submission completed')

# plot distribution of target
sns.distplot(output5['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output5')

try:
    AdvancedModelPerformanced_df.at[modelname,'Score']=np.mean(auc)
except:
    AdvancedModelPerformanced_df = AdvancedModelPerformanced_df.append({index:modelname,'Score':np.mean(auc)})
    
# tidy up
del auc 

In [None]:
print(cpu_stats())

## xgboost with kbestfeatures

In [None]:
#%%time
# based on https://www.kaggle.com/pallavisinha12/october-playground-series

# kbestfeatures
modelname ="xgBoosting_kbest"
p_feature = 0.0001
train_numerical, selected_numerical = SelectKBestFeatures(train[cont_features], y, p_feature)
train_categorical, selected_categorical = SelectKBestFeatures(train[cat_features], y, p_feature)

xgb_params = {
    "subsample": 0.65,
    "colsample_bytree": 0.4,
    "max_depth": 7,
    "learning_rate": 0.01,
    "objective": "binary:logistic",
    'eval_metric': 'auc',
    "nthread": -1,
    "max_bin": 192, 
    'min_child_weight': 2,
    'reg_lambda': 0.003,
    'reg_alpha': 0.02, 
    'seed' : 42,
    }

preds = np.zeros((test.astype('float32')).shape[0])

kf = StratifiedKFold(n_splits = 5, random_state=42,shuffle=True)

auc = []
n = 0

try:
    X.drop(['kfold'], axis=1, inplace=True) # remove columns
    X.drop('target', axis=1, inplace=True) # remove columns
except:
    a="do nothing"# do nothing

for train_idx, test_idx in kf.split(X,y):
    X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]
    xgb_model = XGBClassifier(**xgb_params)
    xgb_model.fit(X_train, y_train, eval_set = [(X_val,y_val)],early_stopping_rounds = 100,verbose=0)
    test=test.astype('float32')
    y_val=y_val.astype('float32')
    X_val=X_val.astype('float32')
    preds += xgb_model.predict_proba(test)[:,1]/kf.n_splits
    auc.append(roc_auc_score(y_val, xgb_model.predict_proba(X_val)[:, 1]))
    gc.collect()
    print(f"fold: {n+1}, auc: {auc[n]}")
    n+=1  
    
print(np.mean(auc))

# Save the predictions to a CSV file
output6 = pd.DataFrame({'Id': test_id,
                       'target': preds})
output6.to_csv('tuned_xgboost_kbest_submission.csv', index=False)
print('tuned xgboost kbest submission completed')

# plot distribution of target
sns.distplot(output6['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output6')

try:
    AdvancedModelPerformanced_df.at[modelname,'Score']=(np.mean(auc))
except:
    AdvancedModelPerformanced_df = AdvancedModelPerformanced_df.append({index:modelname,'Score':(np.mean(auc))})
    
# tidy up
del auc 

In [None]:
print(cpu_stats())

## Power averaging

Power average simply takes the average of a number of models, this can reduce problems associated with over fitting but flip side of this is that if some of the better performing models are finding genuinne patterns that the others arn't that averaging several poor models will reduce the score. 

In [None]:
# based on https://www.kaggle.com/vamsikrishnab/exploring-submissions-and-power-averaging

# plot models
hist_data = [output4.target, output5.target, output6.target]
group_labels = ['Catboost', 'lgbm', 'xgboost']
sns.distplot(hist_data)
plt.show()

output7 = output4.copy()
output7.loc[:,'target'] = (output4**2 + output5**2 + output6**2)/3
output7.to_csv('poweraverage.csv', index=False)

# plot distribution of target
sns.distplot(output4['target'], kde=True, hist=False)
sns.distplot(output5['target'], kde=True, hist=False)
sns.distplot(output6['target'], kde=True, hist=False)
sns.distplot(output7['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output7')

In [None]:
print(cpu_stats())

## Blending models

In [None]:
modelname ="blending"

# create test and training df
test_df=output5.copy()
test_df=train_df.append(output6['target'])
test_df.rename(columns=["id","pred_1","pred_2"])

train_df=train.insert(1,"pred_1",output6['target'],True)
train_df=train_df.insert(1,"pred_2",output6['target'],True)
useful_features = ["pred_1", "pred_2"]

NFOLDS = 5
SEED = 42

final_predictions = []
scores = []

kfold = KFold(n_splits=NFOLDS, shuffle=True, random_state=SEED)

for fold, (train_idx, valid_idx) in enumerate(kfold.split(train_df)):
    xtrain =  train_df.iloc[train_idx].reset_index(drop=True)
    xvalid = train_df.iloc[valid_idx].reset_index(drop=True)
    
    xtest = test_df.copy()
    
    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]

    model = LinearRegression()
    model.fit(xtrain, ytrain)
    
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_predictions.append(test_preds)
    rmse = mean_squared_error(yvalid, preds_valid, squared=False)
    print(fold, rmse)
    scores.append(rmse)

print(np.mean(scores), np.std(scores))

output12 = pd.DataFrame({'Id': test_id,
                       'target': np.mean(scores)})

# plot distribution of target
sns.distplot(output12['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output12')

In [None]:
print(cpu_stats())

## Stacking model

In [None]:
modelname ="stacking"

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# define first layer estamators
estimators = [
    ("catboost", CatBoostClassifier(n_estimators=10, random_state=42)),
    ("lgbm", LGBMClassifier(random_state=42)),
    ("xgboost", XGBClassifier(random_state=42)),
]

# create model 
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

# fit model
clf.fit(X_train, y_train).score(X_test, y_test)
print('model fit')

# evaluate model
y_pred = clf.predict(X_test)
acc_clf = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_clf)

fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred)
auc = metrics.roc_auc_score(y_test, y_pred)
print(auc)

# plot results
plt.plot(fpr,tpr,label="stacking, auc="+str(auc))
plt.legend(loc=4)
plt.show()

# create confusion matrix
metrics.plot_confusion_matrix(clf, X_test, y_test)
plt.title('Confusion matrix for stacking model')
plt.grid(False)
plt.show()

# Use the model to generate predictions
clfpredictions = clf.predict(test)

# Save the predictions to a CSV file
output10 = pd.DataFrame({'Id': test_id,
                       'target': clfpredictions})
output10.to_csv('Stacking_submission.csv', index=False)
print('Stacking submission completed')

# plot distribution of target
sns.distplot(output10['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output10')

try:
    AdvancedModelPerformanced_df.at[modelname,'Score']=acc_clf
except:
    AdvancedModelPerformanced_df = AdvancedModelPerformanced_df.append({index:modelname,'Score':acc_clf})

In [None]:
print(cpu_stats())

## Ensemble model

Ensemble models, take results of several models as a first layer and then feed these in as the inputs to secondary model that votes on which of the first layer predictions to use. 

In [None]:
modelname ="ensemble"

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# define first layer estamators
estimators = [
    ("catoost", CatBoostClassifier(n_estimators=10, random_state=42)),
    ("lgbm", LGBMClassifier(random_state=42)),
    ("xgboost", XGBClassifier(random_state=42)),
]

# create model
votingC = VotingClassifier(estimators=estimators, voting='soft', n_jobs=4)

# fit model
votingC = votingC.fit(X_train, y_train)
print('model fit')

# evaluate model
y_pred = votingC.predict(X_test)
acc_clf = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_clf)

fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred)
auc = metrics.roc_auc_score(y_test, y_pred)
print(auc)

# plot results
plt.plot(fpr,tpr,label="stacking, auc="+str(auc))
plt.legend(loc=4)
plt.show()

# create confusion matrix
metrics.plot_confusion_matrix(votingC, X_test, y_test)
plt.title('Confusion matrix for stacking model')
plt.grid(False)
plt.show()

# evaluate model
y_pred = votingC.predict(X_test)
print('predictions completed')

# Save the predictions to a CSV file
output9 = pd.DataFrame({'Id': test_id,
                       'target': votingC.predict(test)})
output9.to_csv('ensemble_submission.csv', index=False)
print('ensemble submission completed')

# plot distribution of target
sns.distplot(output9['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output9')

In [None]:
modelname ="folded ensemble"

# tidy up before modelling
try:
    X.drop(['kfold'], axis=1, inplace=True) # remove columns
    X.drop('target', axis=1, inplace=True) # remove columns
except:
    a="do nothing"# do nothing

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# define first layer estamators
estimators = [
    ("catoost", CatBoostClassifier(n_estimators=10, random_state=42)),
    ("lgbm", LGBMClassifier(random_state=42)),
    ("xgboost", XGBClassifier(random_state=42)),
]

# kbestfeatures
p_feature = 0.0001
train_numerical, selected_numerical = SelectKBestFeatures(train[cont_features], y, p_feature)
train_categorical, selected_categorical = SelectKBestFeatures(train[cat_features], y, p_feature)

# prepare for folds
preds = np.zeros((test.astype('float32')).shape[0])
kf = StratifiedKFold(n_splits = 5, random_state=42,shuffle=True)
auc = []
n = 0

for train_idx, test_idx in kf.split(X,y):
    X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]
    # create model
    model = VotingClassifier(estimators=estimators, voting='soft', n_jobs=4)
    model.fit(X_train, y_train)
    test=test.astype('float32')
    y_val=y_val.astype('float32')
    X_val=X_val.astype('float32')
    preds += model.predict_proba(test)[:,1]/kf.n_splits
    auc.append(roc_auc_score(y_val, model.predict_proba(X_val)[:, 1]))
    gc.collect()
    print(f"fold: {n+1}, auc: {auc[n]}")
    n+=1  
    
print(np.mean(auc))

# Save the predictions to a CSV file
output11 = pd.DataFrame({'Id': test_id,
                       'target': preds})
output11.to_csv('ensemble_kbest_submission.csv', index=False)
print('ensemble kbest submission completed')

# plot distribution of target
sns.distplot(output11['target'], kde=True, hist=False)
sns.distplot(y, kde=True, hist=False)
Outputs.append('output11')

try:
    AdvancedModelPerformanced_df.at[modelname,'Score']=(np.mean(auc))
except:
    AdvancedModelPerformanced_df = AdvancedModelPerformanced_df.append({index:modelname,'Score':(np.mean(auc))})
    
# tidy up
del auc 

In [None]:
print(cpu_stats())

# Model Comparison

In [None]:
%%time
#order and print model comparison
AdvancedModelPerformanced_df=AdvancedModelPerformanced_df.sort_values(by='Score', ascending=False)
print(AdvancedModelPerformanced_df)

# plot results
x_pos = [i for i, _ in enumerate(AdvancedModelPerformanced_df.index)]
plt.bar(x_pos,AdvancedModelPerformanced_df.Score)
plt.xlabel("Model")
plt.ylabel("Score")
plt.title("Advanced Model Comparison")
plt.xticks(x_pos, AdvancedModelPerformanced_df.index,rotation='vertical')
plt.show()

AdvancedModelPerformanced_df.to_csv("Advanced_model_comparison.csv")
print('Comparison saved as csv')

# plot distribution of target for all models
sns.distplot(y, kde=True, hist=False, color="Red")
#sns.distplot(output1['target'], kde=True, hist=False)
#sns.distplot(output2['target'], kde=True, hist=False)
#sns.distplot(output3['target'], kde=True, hist=False)
#sns.distplot(output4['target'], kde=True, hist=False)
#sns.distplot(output5['target'], kde=True, hist=False)
#sns.distplot(output6['target'], kde=True, hist=False)
#sns.distplot(output7['target'], kde=True, hist=False)
#sns.distplot(output8['target'], kde=True, hist=False)
#sns.distplot(output9['target'], kde=True, hist=False)
#sns.distplot(output10['target'], kde=True, hist=False)
plt.show()

In [None]:
print(cpu_stats())

## Observations on model comparison

* The boosted classifiers produced the best results in our initial tests so we have gone with light gradient boosting, xgboost and catboost as candidates for parameter turning. I should really have performed my own parameter tuning but at this stage i'm just exploring performance differences and have taken the parameters from the highest scoring public models for now. 

* The boosted models with the default parameters score between 75 and 76 in our previous test.

* The boosted models with parameters tuned scored between 76 and 77.

* Using the default setting from the boosting models a stacking model achieved a score of 76.6

* The boost model with kbestfeatures scored between 84 and 85.63

* Power averaging produced a score of 85.55 coming in just between the 2nd and third place booster models on the public score.  





#### NOTE:

This notebook is possibly more complicated that it needed to be because i wanted to do all of the tuned models, then the kfold versions and then all of the kbestfeatures version in that order, which meant that extra columns were sometimes added only to be removed later. Similarly some functions require particular object types for example as 'float32' while other wanted something else, so some columns had to be reformed several times. This could have been avoided by re-importing the data each time or copying it for each model however because of the size of the dataset this was complicated. These changes make it slighly harder to follow - sorry about that! 

## Credit were credits due

First up thanks to the Kaggle team for the tireless work putting the tabular plaground together. Many thanks to the kaggle and stackoverflow communities and the folks that contitbutor to the various documents for the various python modules, without whom finding solutions to these problems would be so much tougher.

Selectkbest feature selection and lgb parameters from : https://www.kaggle.com/pallavisinha12/october-playground-series by PALLAVI SINHA

Catboost tuning parameters from : https://www.kaggle.com/shenurisumanasekara/tabular-october-catboost by SHENURI SUMANASEKARA

Blending from : https://www.kaggle.com/mmellinger66/tps-oct-2021-the-melling-blend by mmellinger66

If you found this notebook useful or you have comments please upvote / comment here. Also please do upvote any of the notebooks above if you use them or find them helpful. 