Here's my workbook for the February 2022 Tabular Playground. Nothing too special here, just an ExtraTreesClassifier.

I learnt a lot looking through others' notebooks. In particular, the following had lots of good ideas (some of which I borrowed for my entries):

https://www.kaggle.com/ayoubchaoui/extratreesclassifier-vs-randomforestclassifier

https://www.kaggle.com/munumbutt/extratrees-stratifiedkfold-memory-optimization

https://www.kaggle.com/maxencefzr/tps-feb22-eda-extratrees

https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense

https://www.kaggle.com/kotrying/extra-blender-addition

https://www.kaggle.com/ambrosm/tpsfeb22-03-clustering-improves-the-predictions

Script parameters:

In [None]:
num_estimators = 80
num_splits = 10

Library imports and data importing:

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv',
                   index_col='row_id')
#y = train['target']
from sklearn.preprocessing import LabelEncoder
target_encoder = LabelEncoder()
y = pd.Series(target_encoder.fit_transform(train["target"]))
X = train.drop(labels=['target'], axis=1)

In [None]:
test = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv',
                    index_col='row_id')
print(test.head())

Train an ExtraTreesClassifier. I tried RandomForest and lots of different hyperparameters evaluated using sklearn.model_selection.GridSearchCV etc., but nothing really provided much improvement over an ExtraTreesClassifier.

For each fold of the cross validation, test-set prediction probabilities are saved for later use.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import time

fold_probs = [] # Store the probabilities from each fold for later use
                # The final predicted value is determined by the
                # average across all cross-validation folds

# evaluate the model using Stratified K-Fold cross validation:
for fold, (train_id, test_id) in enumerate(StratifiedKFold(n_splits=num_splits, 
                                                           shuffle=True, 
                                                           random_state=456).split(X,y)): 
                                                                                   
    Xt = X.iloc[train_id]
    yt = y.iloc[train_id]
    Xv = X.iloc[test_id]
    yv = y.iloc[test_id]
    model = ExtraTreesClassifier(n_estimators = num_estimators)
    start = time.time()

    model.fit(Xt, yt)
    
    end = time.time()
    
    valid_pred = model.predict(Xv)
    valid_score = accuracy_score(yv, valid_pred)
    
    print("Fold:", fold + 1, "Accuracy:", valid_score, 'Time (min.):', (end - start)/60)
    
    
    fold_probs.append(model.predict_proba(test))
    

Although the cross-validated accuracy is quite high, the accuracy on the test set is a lot lower due to target drift. Essentially, bacteria mutated between the training and test set, and decision boundaries calculated on the training set are not as accurate. This is explained in more detail (with figures as well) in AmbrosM's notebooks (see above) and elsewhere. I spent a bit of time exploring this, but ran out of time in coming up with a novel way to improve the test-set predictions.

Next, we average the category probabilities across the cross-validation to come up with the best prediction:

In [None]:
mean_prob = sum(fold_probs) / len(fold_probs) # Mean probability for each row
print(mean_prob)

mean_pred = target_encoder.inverse_transform(np.argmax(mean_prob, axis=1))
print(mean_pred)

In [None]:
output = pd.DataFrame(data = {'row_id': test.index, 'target': mean_pred})
print(output)
output.to_csv('submission.csv', index=False)