https://www.kaggle.com/c/tabular-playground-series-jan-2021/overview

Overview:
<br>
<br>
One challenging aspect of this competition for me was how weakly correlated the features are with the target. I tried several different approaches before settling on this one. My previous attempts leveraged Featuretools (library for automated feature engineering) to try and find new features that were more strongly correlated with the target. I did find some that were slightly more correlated, but not correlated enough to move the needle on model performance.
<br>
<br>
The solution in this notebook is based on the observation that the features and target are multimodal. I've read in the past that when you see multimodal distributions you could be looking at more than one population that are being treated as a single population. To address the multimodal characteristic of the data, I clustered the training target and treated each target cluster as coming from a distinct population. Then I build a classification model to predict the target cluster, and fit a linear regression model for each target cluster to predict the target.

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

from sklearn.metrics import mean_squared_error as MSE, classification_report
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import lightgbm as lgb

from datetime import datetime as dt
import warnings
warnings.filterwarnings("ignore")
pd.options.display.float_format = "{:,.4f}".format

start = dt.now()
print(f'Start time: {start}')

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# train_df = pd.read_csv('train.csv', index_col=0)
# eval_df = pd.read_csv('test.csv', index_col=0)

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/train.csv', index_col=0)
eval_df = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/test.csv', index_col=0)
print('train_df', train_df.shape)
print('eval_df', eval_df.shape)

# EDA

In [None]:
# Check for nulls in the training data
nulls = train_df.isna().sum()
nulls = nulls[nulls > 0]
print(f'# columns with null values: {len(nulls)}')
print(nulls.index)

In [None]:
# Check for nulls in the testing data
nulls = eval_df.isna().sum()
nulls = nulls[nulls > 0]
print(f'# columns with null values: {len(nulls)}')
print(nulls.index)

In [None]:
# Check data types
train_df.info()

In [None]:
# Check for outliers
for i in train_df.columns:
    sns.boxplot(train_df[i], color='#99c2a2')
    plt.title(i)
    plt.show()

In [None]:
# Separate features and target
X = train_df.drop(columns=['target'])
feat_cols = X.columns
y = train_df['target']

In [None]:
# Target variable is multimodal
sns.histplot(y)

In [None]:
# Features have low correlation with the target
train_df.corr()['target'].sort_values()

In [None]:
# All of the features are multimodal
for i in X.columns:
    sns.histplot(X[i], element='poly')
    plt.title(i)
    plt.show()

In [None]:
# Measure correlation between features
plt.figure(figsize=(15, 10))
sns.heatmap(X.corr(), annot=True, fmt='.2f')

In [None]:
# Looks like the features are on the same scale since they have similar summary stats
sumstats = X.describe().T[['mean', 'min', 'max', 'std']]
sumstats.style.background_gradient(axis=0, cmap = 'Blues').set_precision(2)

# Baseline model
Establish baseline model performance to try and improve upon

In [None]:
baselinemodel = LinearRegression().fit(X, y)
y_pred = baselinemodel.predict(X)
rmse = MSE(y, y_pred, squared=False)
print('Model name:', baselinemodel.__class__)
print(f'Baseline RMSE: {rmse:.3f}')

# Cluster target
The target could be multimodal if it's comprised of more than one population. If that's the case, then having one model to learn the relationships between the features and target will be difficult, because the features would have different relationships with each of the populations in the target.... To address this, I'll cluster the target and model each cluster target individually.

In [None]:
# Choosing the number of clusters with the elbow method
kmodel = KElbowVisualizer(KMeans(), k=(2,10))
kmodel.fit(y.values.reshape(-1, 1))
kmodel.show()

In [None]:
# Clustering with the elbow value
kmeans = KMeans(n_clusters=kmodel.elbow_value_, random_state=1).fit(y.values.reshape(-1, 1))

y_df = pd.DataFrame(y)

y_df['kmeans_cluster'] = kmeans.predict(y.values.reshape(-1, 1))

print(y_df['kmeans_cluster'].value_counts().sort_index())

In [None]:
# The target clusters are an attempt to split the target in its population components
# The distribution of the target makes more sense when it's colored by its cluster
sns.histplot(x=y_df['target'], hue=y_df['kmeans_cluster'])

# Predict the target cluster
Build a classification model to predict the target cluster

In [None]:
# LightGB classifier
y_labels = y_df['kmeans_cluster']
baseline_lgb = lgb.LGBMClassifier(random_state=1, objective='multiclass').fit(X, y_labels)
y_pred = baseline_lgb.predict(X)
print(classification_report(y_labels, y_pred))

In [None]:
# # LightGB classifier: hyperparameter tuning
# params = {'reg_alpha': np.random.uniform(0, 1, 200),
#          'reg_lambda': np.random.uniform(0, 1, 200),
#          'num_leaves': np.random.randint(-1, 250, 500),
#          'max_depth': np.random.randint(-1, 250, 500),
#          'min_child_samples': np.random.randint(0, 100, 80),
#          'learning_rate': np.random.uniform(0.001, 0.1, 200)}

# baseline_lgb = lgb.LGBMClassifier(random_state=1, objective='multiclass')
# search = RandomizedSearchCV(estimator = baseline_lgb,
#                             param_distributions = params,
#                             n_iter = 100,
#                             cv = 5,
#                             random_state = 1,
#                             verbose = 2).fit(X, y_labels)

bestparams = {'reg_lambda': 0.7404690869123333,
             'reg_alpha': 0.6050163922906838,
             'num_leaves': 217,
             'min_child_samples': 30,
             'max_depth': 43,
             'learning_rate': 0.09386873247205824}

In [None]:
tuned_lgb = lgb.LGBMClassifier(**bestparams, random_state=1, objective='multiclass').fit(X, y_labels)
y_pred = tuned_lgb.predict(X)
print(classification_report(y_labels, y_pred))

# Model each target cluster

In [None]:
# Predict the target cluster
train_df['predict_target_cluster'] = tuned_lgb.predict(train_df[feat_cols])

# Model each target cluster
train_result = pd.DataFrame()
pred_clusters = sorted(train_df['predict_target_cluster'].unique())
for i in pred_clusters:
    subset = train_df[train_df['predict_target_cluster'] == i]
    X_sub = subset[feat_cols]
    y_sub = subset['target']
    
    lr_model = LinearRegression().fit(X_sub, y_sub)
    y_pred = lr_model.predict(X_sub)
    res = pd.DataFrame({'Act_Train_Target': y_sub,
                       'Pred_Train_Target': y_pred})
    res['Target_Cluster'] = i
    
    train_result = pd.concat([train_result, res])
    
rmse = MSE(train_result['Act_Train_Target'], train_result['Pred_Train_Target'], squared=False)
# Baseline RMSE: 0.726
print(f'RMSE: {rmse:.3f}')

# Submission

In [None]:
# Predict the target cluster
eval_df['predict_target_cluster'] = tuned_lgb.predict(eval_df[feat_cols])

# Model each target cluster
eval_result = pd.DataFrame()
pred_clusters = sorted(eval_df['predict_target_cluster'].unique())
for i in pred_clusters:
    train_subset = train_df[train_df['predict_target_cluster'] == i]
    eval_subset = eval_df[eval_df['predict_target_cluster'] == i]
    
    X_sub_train = train_subset[feat_cols]
    y_sub_train = train_subset['target']
    
    X_sub_eval = eval_subset[feat_cols]
    
    lr_model = LinearRegression().fit(X_sub_train, y_sub_train)
    y_pred = lr_model.predict(X_sub_eval)
    
    eval_res = pd.DataFrame(y_pred, columns=['target'], index=X_sub_eval.index)    
    eval_result = pd.concat([eval_result, eval_res])
    
print(f'Shape of evaluation predictions: {eval_result.shape}')

In [None]:
eval_result.sort_index(inplace=True)
eval_result.to_csv('submission.csv')

In [None]:
duration = (dt.now() - start).seconds
mins = np.floor(duration / 60)
secs = duration % 60
print(f'Notebook run time:\n{mins:.0f} minute and {secs} seconds')