# Tabular Playground Series - Oct 2021

#### Oct 01, 2021 to Oct 31, 2021

#### https://www.kaggle.com/c/tabular-playground-series-oct-2021/

#### _**Predicting the biological response of molecules given various chemical properties**_

Notebook Author:

| Name  | Pradip Kumar Das  |
| ------------: | :------------ |
| **Profile:**  | [LinkedIn](https://www.linkedin.com/in/daspradipkumar/ "LinkedIn") l [GitHub](https://github.com/PradipKumarDas "GitHub") l [Kaggle](https://www.kaggle.com/pradipkumardas "Kaggle")  |
| **Contact:**  | pradipkumardas@hotmail.com (Email)  |
| **Location:**  | Bengaluru, India  |

**Sections:**

* Dependencies
* Exploratory Data Analysis (EDA) & Preprocessing
* Modeling & Evaluation
* Submission

## Dependencies

In [None]:
# Loads required packages

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score

from xgboost import XGBClassifier
import xgboost as xgb

from hyperopt import hp, fmin, tpe, Trials, STATUS_OK

## Exploratory Data Analysis (EDA) & Preprocessing

In [None]:
# Loads train dataset
train = pd.read_csv("../input/tabular-playground-series-oct-2021/train.csv")

In [None]:
# Checks how the train data set looks
with pd.option_context('display.max_rows', 10, 'display.max_columns', None): 
    display(train)

In [None]:
# Drops ID column as it is not required
train.drop(["id"], axis=1, inplace=True)

In [None]:
# Checks for data types used in the data set
train.dtypes.unique()

In [None]:
# Checks for nubmer of row having any missing values ('0' indicates no rows have missing values)
sum(train.isna().sum())

In [None]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
#                 if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
#                     df[col] = df[col].astype(np.float16)
#                 el
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
# Compresses the training data as Kaggle kernel resets due to large size of the training data 
train = reduce_mem_usage(train)

In [None]:
# Shows the column data types after data compression
train.dtypes

In [None]:
# Checks distribution of categorical target variable
train.target.value_counts()

**As `target` is equaly distributed, it itself can be used as bins in stratified K-Fold validation**

## Modeling & Evaluation

In [None]:
# Seperates predictor variables from target

y = train.target
train.drop(["target"], axis=1, inplace=True)

In [None]:
# Create stratification object for K-Fold cross validation
sk_fold = StratifiedKFold(n_splits=5)

In [None]:
# Performs cross validation on XGB Classifier

cv_generator = sk_fold.split(train, y)

model = XGBClassifier(
    n_estimators=100,
    objective='binary:logistic', 
    eval_metric='auc',
    tree_method='gpu_hist'
)

cv_scores = cross_val_score(model, train, y, scoring='roc_auc', cv=cv_generator, n_jobs=-1, verbose=10)

In [None]:
print("ROC AUC score of XGBoost (with default parameters) Model:", cv_scores.mean())

In [None]:
del cv_scores, model, cv_generator

**Automated Hyperparameter Tuning with Hyperopt**

In [None]:
# Instead of performing cross validation during hyperparameter tunining, 
# the tuning is done over fixed train and validation data set to save significant amount of time
# The following code snippet extract that stratified set of train and validation set

cv_generator = sk_fold.split(train, y)

for fold, (idx_train, idx_val) in enumerate(cv_generator):
    y_val = y.iloc[idx_val]
    dtrain = xgb.DMatrix(data=train.iloc[idx_train], label=y.iloc[idx_train])
    dval = xgb.DMatrix(data=train.iloc[idx_val], label=y.iloc[idx_val])
    break

In [None]:
# Sets up a search space for XGBoost hyperparameters
space = {
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.3),
    'max_depth': hp.quniform("max_depth", 2, 6, 1),
    'min_child_weight' : hp.quniform('min_child_weight', 1, 8, 1),
    'reg_alpha' : hp.uniform('reg_alpha', 1e-8, 100),
    'reg_lambda' : hp.uniform('reg_lambda', 1e-8, 100),
    'gamma': hp.uniform ('gamma', 0.0, 1.0),
    'subsample': hp.uniform("subsample", 0.1, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.1, 1.0)
}

In [None]:
def trial_loss(space):
    """
    Trial function for Hyperopt to call by passing a set a trial hyperparamets
    to train model and perform predictions.
    
    Parameters:
    ----------
    space: A set a trial hyperparamets
    
    Returns metric for Hyperopt to estimate for further tuning in search space.
    """
    
    # Converts parameter value to int as required by XGBoost
    space["max_depth"] = int(space["max_depth"])
    space["objective"] = "binary:logistic"
    space["eval_metric"] = "auc"
    space["tree_method"] = "gpu_hist"
    
    model = xgb.train(
        space, 
        dtrain, 
        num_boost_round=2000, 
        evals=[(dtrain, 'train'), (dval, 'eval')],
        early_stopping_rounds=50, verbose_eval=False)
    
    predictions = model.predict(dval)
    
    roc_auc = roc_auc_score(y_val, predictions)
    
    del predictions, model, space
    
    return {"loss": -roc_auc, "status": STATUS_OK}

In [None]:
# Starts hyperparameters tuning
trials = Trials()
best_trial = fmin(fn=trial_loss, space=space, algo=tpe.suggest, max_evals=50, trials=trials)

In [None]:
# Views the best hyperparameters
best_trial

In [None]:
del dtrain, dval, y_val, cv_generator

## Submission

In [None]:
# Loads test data set
test = pd.read_csv("../input/tabular-playground-series-oct-2021/test.csv")

# Removes ID column as it is not required for prediction
test.drop(["id"], axis=1, inplace=True)

# Loads submission data set that acts just as a template for submission
submission = pd.read_csv("../input/tabular-playground-series-oct-2021/sample_submission.csv")

**Prepares final XGBoost model with optimized parameters**

In [None]:
# Adds other important parameters
best_trial["max_depth"] = int(best_trial["max_depth"])
best_trial["objective"] = "binary:logistic"
best_trial["eval_metric"] = "auc"
best_trial["tree_method"] = "gpu_hist"

In [None]:
# Gets the model trained over cross validation and predictions 
# against each iteration is stored

test_predictions = []

cv_generator = sk_fold.split(train, y)

dtest = xgb.DMatrix(data=test)

for fold, (idx_train, idx_val) in enumerate(cv_generator):
    print("fold", fold)

    dtrain = xgb.DMatrix(data=train.iloc[idx_train], label=y.iloc[idx_train])
    dval = xgb.DMatrix(data=train.iloc[idx_val], label=y.iloc[idx_val])
    
    model = xgb.train(
        best_trial, 
        dtrain, 
        num_boost_round=2000, 
        evals=[(dtrain, 'train'), (dval, 'eval')],
        early_stopping_rounds=50, verbose_eval=200)
    
    predictions = model.predict(dtest)
    
    test_predictions.append(predictions)
    
    del predictions, model, dval, dtrain

In [None]:
test_predictions

In [None]:
del dtest, cv_generator, test, train

In [None]:
# Predictions stored against each cross validation iteration finally gets aeveraged
# and target column is set with that averaged predictions
submission["target"] = np.mean(np.column_stack(test_predictions), axis=1)

# Checks for sumbission file before saving
submission

In [None]:
# Saves test predictions
submission.to_csv("./submission.csv", index=False)