As a start I just copied over my notebook from the yabular playground may competition https://www.kaggle.com/jenssvensmark/scikit-stacking-tabular-play-may, adopted to the number of classes in this new dataset, and ran it as is.

# Load and inspect data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
train_data = pd.read_csv("/kaggle/input/tabular-playground-series-jun-2021/train.csv")
test_data = pd.read_csv("/kaggle/input/tabular-playground-series-jun-2021/test.csv")

In [None]:
train_data.head()

In [None]:
train_data.info()

No missing data, all the features are integers, except the target.
There are a lot of features here! And looks like a lot of zeros as well.
Let's check out the range of values in each feature

In [None]:
train_data.describe()

In [None]:
cols = list(train_data.columns)
cols.remove("id")
cols.remove("target")
classes = [f"Class_{i}" for i in range(1, 10)]

# EDA

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import cufflinks as cf # for using plotly with pandas
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

In [None]:
train_data.describe().loc["max", cols].iplot(kind="bar")

In [None]:
fig, ax = plt.subplots(figsize=(25, 6))
sns.boxplot(data=train_data.drop("id", axis=1), ax=ax)

In [None]:
for i in range(0, 50, 20):
    col = f"feature_{i}"
    data = train_data[[col, "target"]]
    #data = data[data[col] != 0]
    sns.countplot(x=col, data=data, #hue='target', #multiple='stack'
               )
    plt.yscale("log")
    plt.show()

Above I just plotted a few selected features. The count appears to change continously with each feature.

All the features are integers, but there are a lot of features, and a lot of values of each features. I have no idea if these features are categorical, ordinal or numerical.

In [None]:
train_data.groupby("target")["target"].count()

The dataset is quite imbalanced, there are a lot of `Class_2`, and fewer of the other three classes.

In [None]:
sns.heatmap(train_data.drop(["id", ], axis=1).corr())

Doesn't look like the features are correlated (although I don't know if all the zeros in the data affects this).

I'm not really sure whether we can eliminate some of the features, or do feature engineering, so I will just use all the features as is in the following

# scikit models

Imports and a couple of utility functions

In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, OneHotEncoder
from sklearn.metrics import log_loss
from sklearn.compose import ColumnTransformer
from sklearn.svm import LinearSVC
from sklearn.ensemble import StackingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.utils.class_weight import compute_class_weight
from sklearn.dummy import DummyClassifier

In [None]:
def test_model(model):
    model.fit(X_train, y_train)
    stats = just_test_model(model)
    return stats

def just_test_model(model):
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    loss = log_loss(y_test, model.predict_proba(X_test))
    stats = {"train": train_score, "test": test_score, "loss": loss}
    return stats

In [None]:
def test_model_partial(model, X, y, select=slice(None)):
    X = X[select]
    y = y[select]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    model.fit(X_train, y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    loss = log_loss(y_test, model.predict_proba(X_test))
    stats = {"train": train_score, "test": test_score, "loss": loss}
    return stats

In [None]:
def predict_proba_df(model, X, labels=classes):
    proba = model.predict_proba(X)
    proba = pd.DataFrame(proba, columns=labels)
    return proba

def plot_class_distribution(model, X_test, y_test, labels=classes):
    proba = predict_proba_df(model, X_test, labels=labels)
    proba["target"] = y_test.values
    plot_class_proba_distribution(proba)

def plot_class_proba_distribution(proba_df):
    fig, axes = plt.subplots(1, 4, figsize=(20, 2))
    for (real_val, group), ax in zip(proba_df.groupby("target"), axes):
        group.pop("target")
        sns.kdeplot(data=group, fill=True, ax=ax)
        ax.set_title(real_val)
        ax.set_ylim(top=10)
    plt.show()

In [None]:
y = train_data.target
X = train_data.drop(["target", "id"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

As a baseline I start with a dummy classifier that just predicts probabilities based on how often each class occurs in the training data

In [None]:
dummy = DummyClassifier(strategy="prior")
print("prior dummy", test_model(dummy))

Below I build a list of categories to be used by the one hot encoder. I'm including the test data in case there is a label in there not present in the train data

In [None]:
all_data = pd.concat((train_data, test_data))[cols]
categories = [sorted(all_data[col].unique()) for col in all_data]

Below I try out logistic regression considering the features as numerical using `StandardScaler` and as categorical using `OneHotEncoder`.

In [None]:
models = {"LogReg": Pipeline([("scaler", StandardScaler()), 
                             ("model", LogisticRegression())]),
                                     }
for C in [0.001, 0.01, 0.1, 1.0, 10]:
    models[f"LogReg_ohe_C{C}"] =  Pipeline([("ohe", OneHotEncoder(#drop="first", 
                                                       categories=categories, dtype="int")), 
                             ("model", LogisticRegression(max_iter=150, C=C))])
for name, model in models.items():
    print(name, test_model(model))

From the loss it appears that the one hot encoded model did better than considering the features as numerical, although neither seems to improve much compared to the priors dummy. The logistic regressions failed to converge, but I'm choosing to ignore this since they got pretty good scores.

Let's look at whether the imbalance of the dataset might be an issue

In [None]:
def loss_per_class(model):
    test_data = X_test.copy()
    test_data["target"] = y_test
    for idx, group in test_data.groupby("target"):
        loss = log_loss(group.target, model.predict_proba(group.drop(columns=["target"])), labels=classes)
        print(idx, " loss", loss)

loss_per_class(models["LogReg_ohe_C0.01"])
plot_class_distribution(models["LogReg_ohe_C0.01"], X_test, y_test)

So the model is heavily imbalanced towards predicting `Class_6`. Let's see what happens if we apply class weights to even this out

In [None]:
%%time
model =  Pipeline([("ohe", OneHotEncoder(categories=categories, dtype="int", sparse=False)),
                   ("model", LogisticRegression(max_iter=200, C=0.01, 
                                                class_weight="balanced",
                                               ))])
models["ohe_log_weighted"] = model
model = model.fit(X_train, y_train)
just_test_model(model)
loss_per_class(model)
plot_class_distribution(model, X_test, y_test)

So, now losses are the same across the classes, but the total loss is somewhat larger than the previous models.

Below I test if assuming that features with less than 10 elements are categorical and that the remainder are numerical would work well

In [None]:
bool_array = X.nunique() < 10
cat_cols = list(X.columns[bool_array])
cat_cols_idx = np.flatnonzero(bool_array)
#len(cat_cols_idx)

In [None]:
selected_cats = list(np.array(categories, dtype='object')[cat_cols_idx])
ohe = OneHotEncoder(categories=selected_cats, dtype="int", drop="first", sparse=False)
ct_ohe = ColumnTransformer([("ohe", ohe, cat_cols_idx)],
                            remainder="passthrough"
                                              )
model = Pipeline([("ct_ohe", ct_ohe), 
                  ("scaler", StandardScaler()),
                    ("model", LogisticRegression(max_iter=100, C=C))])
test_model(model)

Not particularly...

Okay, let's just try a bunch of classifiers

In [None]:
%%time
model = GradientBoostingClassifier()
models["gradboost"] = model
test_model(model)

In [None]:
%%time
forest = RandomForestClassifier(max_depth=10)
models["forest"] = forest
print("forest", test_model(forest))

None of these classifiers did particularly well. Let's just stack them all together...

In [None]:
%%time
voter_list = ["gradboost", "forest", "LogReg", "LogReg_ohe_C0.01", "ohe_log_weighted"]
voters = [(voter, models[voter]) for voter in voter_list]
sc = StackingClassifier(voters)
test_model(sc)

... and submit that

In [None]:
def predict_test_data(model):
    model.fit(X, y)
    proba = model.predict_proba(test_data.drop(["id"], axis=1))
    predicted = pd.DataFrame(proba, columns=classes)
    predicted["id"] = test_data.id
    predicted = predicted[["id"]+classes]
    return predicted

In [None]:
predicted = predict_test_data(sc)
predicted.to_csv("stacking_model.csv", index=False)