In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

Hello! If you're reading this, thanks for taking the time to look over my work. This is the abridged version of what I did for the March competition, and I was pretty happy with how well I did (top 38%) since it's only my second competition. If you see anything you would do differently or improve upon, please feel free to let me know!

Reading in and glancing over the training file

In [None]:
train = pd.read_csv(r'../input/tabular-playground-series-mar-2021/train.csv')
train

Early on into this competition, I realized there were some variables in the testing dataset that were not contained within the training dataset. The machine learning algorithms were not working correctly because they were encountering variables they had not trained for, and that was causing the model problems.

To alleviate this, I completed the following steps:
- read the training and testing data in as separate variables
- Marked with variable was training data and which was testing data
- Concatenated the dataframes
- Marked which columns contained categorical and continuous variables
- Encoded the categorical columns with the concatenated dataframe
- Removed any categorical variable with frequencies 10 or less in training and test data
- Droped the training and testing labels

In [None]:
from sklearn.preprocessing import LabelEncoder

# Read in the data
train = pd.read_csv(r'../input/tabular-playground-series-mar-2021/train.csv')
test = pd.read_csv(r'../input/tabular-playground-series-mar-2021/test.csv')

# Adding labels to distinguish the data after it has been concatenated 
train['label'] = 'train'
test['label'] = 'score'

# Putting the data together
concat_df = pd.concat([train, test])

# Labeling the categorical and continuous columns for identification
categorical_cols=['cat'+str(i) for i in range(19)]
continous_cols=['cont'+str(i) for i in range(11)]

# Encoding all the categorical variables
for e in categorical_cols:
    le = LabelEncoder()
    concat_df[e]=le.fit_transform(concat_df[e])
    train[e]=le.transform(train[e])
    test[e]=le.transform(test[e])

# Removing the variables with exceptionally low frequencies from the training dataset
threshold = 10 # Any frequency less than 10 will be removed
value_counts_cat10 = train['cat10'].value_counts()
to_remove = value_counts_cat10[value_counts_cat10 <= threshold].index
train['cat10'].replace(to_remove, np.nan, inplace=True)
    
# Removing the variables with exceptionally low frequencies from the testing dataset
value_counts_cat10 = test['cat10'].value_counts()
to_remove = value_counts_cat10[value_counts_cat10 <= threshold].index
test['cat10'].replace(to_remove, np.nan, inplace=True)

# Creating a target 
target=train['target']

# Drop the labels from the beginning to determine which is which
train = train.drop('label', axis=1)
test = test.drop('label', axis=1)

In [None]:
from pandas_profiling import ProfileReport

train_profile = ProfileReport(train, 'EDA')
train_profile

After looking through the data, I found that several of the variables were not as normal as I would like, so I checked the skewness for each. Any variable that had a skew outside of [-0.5,0.5] I adjusted until it fell within that parameter.

While there were other variables that were obviously not normal (such as cont3 and cont4), I found that after transforming them, I got a decrease in performance. Given this, I have not included those transformations here.

In [None]:
for e in continous_cols:
    print(e, 'Skew Value: ', train[e].skew())

From this, I transformed the continuous variables 7-10 using the box-cox methodology since none of the data for these variables has a negative value. 

In [None]:
from scipy import stats

train_cont7_box_cox = stats.boxcox(train['cont7'])[0]
print('Cont 7 box_cox_skew: ', pd.Series(train_cont7_box_cox).skew())
train_cont8_box_cox = stats.boxcox(train['cont8'])[0]
print('Cont 8 box_cox_skew: ', pd.Series(train_cont8_box_cox).skew())
train_cont9_box_cox = stats.boxcox(train['cont9'])[0]
print('Cont 9 box_cox_skew: ', pd.Series(train_cont9_box_cox).skew())
train_cont10_box_cox = stats.boxcox(train['cont10'])[0]
print('Cont 10 box_cox_skew: ', pd.Series(train_cont10_box_cox).skew())

In [None]:
to_be_transformed_cols = ['cont7', 'cont8', 'cont9', 'cont10']
transformed_cols = [train_cont7_box_cox,
                    train_cont8_box_cox,
                    train_cont9_box_cox,
                    train_cont10_box_cox]

train['cont7'] = train_cont7_box_cox
train['cont8'] = train_cont8_box_cox
train['cont9'] = train_cont9_box_cox
train['cont10'] = train_cont10_box_cox

Double checking that everything worked for these variables.

In [None]:
from pandas_profiling import ProfileReport

train_profile = ProfileReport(train, 'EDA')
train_profile

For this competition, I initially tested XGB, LightGBM, and CatBoost to see what their performances looked like. After seeing they were all performing pretty similarly, I decided to combine the three methods for my submission to this competition. In order to start this process, I looked for the best parameters using the code below.

In [None]:
from lightgbm import LGBMClassifier
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Making the dataset for training and testing
X = train.drop('target', axis=1)
Y = train['target']

# Splitting the data into training and testing 
# Saving 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# Storing the data as a LightGBM dataset
d_train = lgb.Dataset(x_train, label=y_train)

def tuninglightgbm(num_leaves, learning_rate, max_depth):
    
    # Setting the parameters for the model
    # Setting the tested variables to the variables as indicated
    params = {}
    params['learning_rate']=learning_rate
    params['boosting_type']='gbdt' #GradientBoostingDecisionTree
    params['num_leaves']=num_leaves
    params['objective']='binary' #Binary target feature
    params['metric']='auc' 
    params['max_depth']=max_depth
    
    # Training the model and then scoring it
    lightgbm_model = lgb.train(params, d_train, 500)
    y_pred_tuning_lightgbm = lightgbm_model.predict(x_test)
    auc_score = roc_auc_score(y_test, y_pred_tuning_lightgbm)
    
    # These values will be reported back to the function and the best will be recorded
    parameters = []
    parameters = [auc_score]
    parameters.append(num_leaves)
    parameters.append(learning_rate)
    parameters.append(max_depth)
    
    return parameters   
    
# Values being tested    
num_leaves = [10, 15, 20]
learning_rate = [0.01, 0.1, 0.2, 0.3]
max_depth = [10, 15, 20]

best_auc_score = 0

# Iterating from all the values one by one
for i in range(len(num_leaves)):
    for j in range(len(learning_rate)):
        for k in range(len(max_depth)):
            parameters = tuninglightgbm(num_leaves[i], learning_rate[j], max_depth[k])
            
            # The best set of values will have the highest AUC by the end.
            if parameters[0] > best_auc_score:
                best_auc_score = parameters[0]
                best_num_leaves = parameters[1]
                best_learning_rate = parameters[2]
                best_max_depth = parameters[3]

# Reporting the best scores overall
print('Best AUC: %.6f' % best_auc_score)
print('Best Number of Leaves: ', best_num_leaves)
print('Best Learning Rate: ', best_learning_rate)
print('Best Max Depth: ', best_max_depth)

Best AUC: 0.889088

Best Number of Leaves:  10

Best Learning Rate:  0.2

Best Max Depth:  10

In [None]:
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Making the dataset for training and testing
X = train.drop('target', axis=1)
Y = train['target']

# Splitting the data into training and testing 
# Saving 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

def tuningcatboost(learning_rate, max_depth):
    
    
    # Setting the tested variables to the variables as indicated
    model = CatBoostClassifier(iterations=1000,
                               learning_rate=learning_rate,
                               max_depth=max_depth
                              )

    model.fit(x_train, y_train, verbose=False)

    y_pred_catboost = model.predict_proba(x_test)[:,1]
    
    auc_score = roc_auc_score(y_test, y_pred_catboost)
    
    # These values will be reported back to the function and the best will be recorded
    parameters = []
    parameters = [auc_score]
    parameters.append(learning_rate)
    parameters.append(max_depth)
    
    return parameters   
    
# These are the values being tested for CatBoost    
learning_rate = [0.05, 0.1, 0.2, 0.3]
max_depth = [6, 8, 10]

best_auc_score = 0

for i in range(len(learning_rate)):
        for k in range(len(max_depth)):
            parameters = tuningcatboost(learning_rate[i], max_depth[k])
            
            # The best set of values will have the highest AUC by the end.
            if parameters[0] > best_auc_score:
                best_auc_score = parameters[0]
                best_learning_rate = parameters[1]
                best_max_depth = parameters[2]
                
# Reporting the best scores at the end
print('Best AUC: %.6f' % best_auc_score)
print('Best Learning Rate: ', best_learning_rate)
print('Best Max Depth: ', best_max_depth)

Best AUC: 0.890524

Best Learning Rate:  0.05

Best Max Depth:  8

In [None]:
from xgboost import XGBClassifier

# XGBoost performs better with one-hot encoding
# This is marking the categorical columns
categorical_cols=['cat'+str(i) for i in range(19)]

# Creating a new dataframe object with one-hot encoding
train_one_hot = pd.get_dummies(train, columns=categorical_cols)
print(train_one_hot.shape)

# Making the datasets for training and testing
X = train_one_hot.drop('target', axis=1)
Y = train_one_hot['target']

# Splitting the data into training and testing 
# Saving 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

def tuningxgb(learning_rate, depth):
    params={
        'n_estimators':500,
        'objective': 'binary:logistic',
        'learning_rate': learning_rate, # Testing the best learning rate
        'gamma':0.1,
        'subsample':0.8,
        'colsample_bytree':0.3,
        'min_child_weight':3,
        'max_depth': depth, # Testing the best depth
        'seed':1024,
        }

    model = XGBClassifier(**params, early_stopping_rounds=100)
    
    model.fit(x_train, y_train)

    y_pred_xgb = model.predict_proba(x_test)[:,1]
    
    auc_score = roc_auc_score(y_test, y_pred_xgb)
    
    # These values will be reported back to the function and the best will be recorded
    parameters = []
    parameters = [auc_score]
    parameters.append(learning_rate)
    parameters.append(depth)
    
    return parameters 
    
# These are the values being tested for CatBoost
learning_rate = [0.05, 0.1, 0.2, 0.3]
max_depth = [6, 8, 10]

best_auc_score = 0

for i in range(len(learning_rate)):
    for k in range(len(max_depth)):
        parameters = tuningxgb(learning_rate[i], max_depth[k])
        
        # The best set of values will have the highest AUC by the end.
        if parameters[0] > best_auc_score:
            best_auc_score = parameters[0]
            best_learning_rate = parameters[1]
            best_max_depth = parameters[2]

# Reporting the best scores at the end
print('Best AUC: %.6f' % best_auc_score)
print('Best Learning Rate: ', best_learning_rate)
print('Best Max Depth: ', best_max_depth)

Best AUC: 0.893018

Best Learning Rate:  0.05

Best Max Depth:  10

In [None]:
from lightgbm import LGBMClassifier
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Making the dataset for training and testing
X = train.drop('target', axis=1)
Y = train['target']

# Splitting the data into training and testing 
# Saving 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# Storing the data as a LightGBM dataset
d_train = lgb.Dataset(x_train, label=y_train)

def tuninglightgbm(num_leaves, learning_rate, max_depth):
    
    # Setting the parameters for the model
    # Setting the tested variables to the variables as indicated
    params = {}
    params['learning_rate']=learning_rate
    params['boosting_type']='gbdt' #GradientBoostingDecisionTree
    params['num_leaves']=num_leaves
    params['objective']='binary' #Binary target feature
    params['metric']='auc' 
    params['max_depth']=max_depth
    
    # Training the model and then scoring it
    lightgbm_model = lgb.train(params, d_train, 500)
    y_pred_lightgbm = lightgbm_model.predict(x_test)
    
    return y_pred_lightgbm

y_pred_lightgbm = tuninglightgbm(20, 0.1, 10)


# *******************************************
# Starting CatBoost
# *******************************************

def tuningcatboost(learning_rate, max_depth):
    
    
    # Setting the tested variables to the variables as indicated
    model = CatBoostClassifier(iterations=1000,
                               learning_rate=learning_rate,
                               max_depth=max_depth
                              )

    model.fit(x_train, y_train, verbose=False)

    y_pred_catboost = model.predict_proba(x_test)[:,1]
    
    auc_score = roc_auc_score(y_test, y_pred_catboost)
    return y_pred_catboost
    
y_pred_catboost = tuningcatboost(0.05, 8)


# *******************************************
# Starting XGBoost
# *******************************************

categorical_cols=['cat'+str(i) for i in range(19)]

train_one_hot = pd.get_dummies(train, columns=categorical_cols)
print(train_one_hot.shape)

X = train_one_hot.drop('target', axis=1)
Y = train_one_hot['target']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

def tuningxgb(learning_rate, depth):
    params={
        'n_estimators':500,
        'objective': 'binary:logistic',
        'learning_rate': learning_rate,
        'gamma':0.1,
        'subsample':0.8,
        'colsample_bytree':0.3,
        'min_child_weight':3,
        'max_depth': depth,
        'seed':1024,
        }

    model = XGBClassifier(**params, early_stopping_rounds=100)
    
    model.fit(x_train, y_train)

    y_pred_xgb = model.predict_proba(x_test)[:,1]
    
    auc_score = roc_auc_score(y_test, y_pred_xgb)
    print('XGBoost AUC: ', auc_score)
    
    return y_pred_xgb

y_pred_xgb = tuningxgb(0.05, 10)

y_pred_avg = (y_pred_xgb + y_pred_lightgbm + y_pred_catboost) / 3
y_pred_avg
roc_auc_score(y_test, y_pred_avg)

By combining all three of these methods, I have a AUC of 0.8923692458946146, which is actually slightly worse than the XGBoost method by itself. Based on this, I'm going to see how different weights may affect the overall outcome of the model.

In [None]:
y_pred_avg = (3.5*y_pred_xgb + y_pred_lightgbm + y_pred_catboost) / 5.5
y_pred_avg
roc_auc_score(y_test, y_pred_avg)

After playing with the weights of the three different approaches, I found that if XGB were considered the dominant method, I had the best results.

From here, I trained all three methods on the entirety of the training data and then used XGBoost as weight 3.5 with the others unadjusted for my highest score for the March Competition.