# September Tabular Data Challenge - Binary Classification

In [None]:
import numpy as np
import datatable as dt
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import xgboost as xgb

## Preprocessing

The following is a function to reduce the memory of a pandas DataFrame by changing the datatypes of the columns. The new data type is set to the smallest datatype that can store the values in the column. [Here is a link to the source notebook](https://www.kaggle.com/somang1418/tuning-hyperparameters-under-10-minutes-lgbm#Bayesian-Optimization-with-LightGBM) and [heres a notebook with helpful tips including this one](https://www.kaggle.com/bextuychiev/how-to-work-w-million-row-datasets-like-a-pro).

In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64","object"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

In initial tests, the constant imputing strategy performed better than the mean strategy so we will use it here. The SimpleImputer will replace every missing value in the dataset with 0.0. The behavior would be the same if I used the same SimpleImputer, since it is only filling in 0's for missing values. The target column is not missing values, which we could check using the same logic as previous code cell.

In [None]:
def preprocess_data(df):
    num_missing = df.isna().sum(axis=1)
    # Fill missing values in training data with 0.0
    numerical_transformer = SimpleImputer(strategy='constant', fill_value=0.0)
    imputed_df1 = pd.DataFrame(numerical_transformer.fit_transform(df))
    imputed_df1.columns = df.columns
    
    df1 = reduce_memory_usage(imputed_df1)
    
    df1['missing'] = num_missing
    
    return df1

The library datatable can read csv files faster than pandas, so we will use it here and then convert the datatable objects into pandas DataFrames.

In [None]:
train_df = preprocess_data(dt.fread('../input/tabular-playground-series-sep-2021/train.csv').to_pandas())
test_df = preprocess_data(dt.fread('../input/tabular-playground-series-sep-2021/test.csv').to_pandas())

In [None]:
test_df.id = test_df.id.astype('int32') # Makes sure they are int and not float, which happens during imputing

# X holds the training data
X = train_df.drop(columns=['id','claim'])
# y holds the target/dependent variable, the 'claim' column
y = train_df['claim'].astype('int16')

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1)

We can check the shape of the dataset to get a better idea for what models may have the best prediction

In [None]:
X_train.shape

**There are 119 columns in this dataset, 118 features not including the id column. Random Forests are effective classifiers for datsets with many rows. A scikit-learn implementation of a Random Forest will be used to set a baseline of performance.**

**XGBoost will be used to train an even more accurate classifier and the parameters will be tuned with GridSearchCV.**

## Random Forest Classifier - Scikit Learn

### Turn below cell into a code cell in order to see baseline accuracy of random forest
Using bootstrap (default) strategy for the RandomForestClassifier, meaning that it will use random subsets of 10000 samples to train each of the 100 individual decision tree estimators.

In [None]:
rf_model = RandomForestClassifier(n_estimators=10, max_samples=100000, n_jobs=-1)

In [None]:
rf_model.fit(X_train, y_train)

rf_score = roc_auc_score(y_valid, rf_model.predict_proba(X_valid)[:, 1])

print("On the validation set, rf_model had a score(auc) of:", rf_score)

## XGB Classifier - XGBoost

In [None]:
xgb_model = xgb.XGBClassifier(n_estimators=20000, learning_rate=0.13, tree_method='gpu_hist',
                              max_depth=2, n_jobs=-1, gamma=0,
                              reg_alpha=0, reg_lambda=1, subsample=0.9, colsample_bytree=0.9, 
                              max_bin=256, objective='binary:logistic', eval_metric='auc',
                              max_delta_step=0, predictor='gpu_predictor', use_label_encoder=False,
                              random_state=459)

In [None]:
xgb_model.fit(X_train, y_train, early_stopping_rounds=1000, eval_set=[(X_valid, y_valid)], verbose=2000)

xgb_preds = xgb_model.predict_proba(X_valid)[:,1]

xgb_score = roc_auc_score(y_valid, xgb_preds)

print("On the validation set, xgb_model had a score(auc) of:", xgb_score)

Parameter search spaces used:

{'learning_rate':[0.01, 0.1, 0.3], 'max_depth':[3,5,7,9], 'reg_alpha':[0.1,1e-4,1e-5], 'reg_lambda':[0.5,1.0,1.5], 
    'subsample':[0.6,0.7,0.9], 'colsample_bytree':[0.6,0.7,0.9], 'gamma':[0,1,2], 'max_bin':[256,320,512]}
    
NB: I did randomly change the values more to experiment and find different arrangements of parameters after seeing some other notebooks. Increasing the number of estimators and decreasing the max_depth of the trees increased the roc_auc score by less than a percent. Also Optuna makes hyperparameter tuning much easier, I would recommend using it instead.

Only two parameters were searched at a time with GridSearchCV. Turn below cell into code cell in order to optimize hyperparameters.

%%time
xgb_params = {} # Fill params here for testing

xgb_cv = GridSearchCV(xgb_model, xgb_params, n_jobs=-1, cv=2)
xgb_cv.fit(imputed_X_train, y_train)

results = pd.DataFrame(xgb_cv.cv_results_)
print(results)

The confusion matrix allows you to see how the model is making its classification error

In [None]:
plot_confusion_matrix(xgb_model, X_valid, y_valid)

## CatBoost Classifier - CatBoost

CatBoost is another gradient boosting method like XGBoost

In [None]:
from catboost import CatBoostClassifier

In [None]:
cat_model = CatBoostClassifier(iterations=20000, depth=6, task_type="GPU", 
                               thread_count=-1, loss_function='Logloss', 
                               eval_metric='AUC', od_type='Iter', 
                               early_stopping_rounds=1000, 
                               use_best_model=True, verbose=2000,
                               random_state=459)

In [None]:
cat_model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])

cat_preds = cat_model.predict_proba(X_valid)[:,1]

cat_score = roc_auc_score(y_valid, cat_preds)

print("On the validation set, cat_model had a score(auc) of:", cat_score)

## Combining Methods

I am doing a very naive version of ensembling, but it does improve the AUC score slightly. This is because the XGBoost and CatBoost likely make different errors in their predictions. The average of their predictions should be a little closer to the ground truth, which is the likellihood of an insurance claim resulting from the information in the record.

In [None]:
avg_preds = (xgb_preds + cat_preds) / 2.0

In [None]:
combined_score = roc_auc_score(y_valid, avg_preds)

print("On the validation set, the combined model had a score(auc) of:", combined_score)

In [None]:
drop_test = test_df.drop(columns=['id'])

combined_output = (cat_model.predict_proba(drop_test)[:,1] + xgb_model.predict_proba(drop_test)[:,1]) / 2.0

output = pd.DataFrame({
    'id': test_df.id,
    'claim': combined_output
})
output.to_csv('submission.csv', index=False)

## This is my first public notebook, please let me know of any errors I made in the explanation/code. Thanks for reading! 