# Intro

Two algorithms from the scikit-learn module performed great out-of-the-box in this competition, compared to the gradient boosting machines and deep learning models:
* ExtraTreesClassifier, see for example https://www.kaggle.com/hiro5299834/tps-feb-2022-extratreeclassifier by BIZEN (upvote!)
* KNeighborsClassifier, see for example https://www.kaggle.com/kartik2khandelwal/startified-kfold-with-knn-improved-score/ by Kartik Khandelwal (upvote!)

AmbrosM has explained in his insightful notebook https://www.kaggle.com/ambrosm/tpsfeb22-03-clustering-improves-the-predictions (upvote!) that "the bacteria in the test set will have undergone mutation and have slightly different DNA", which explains the unreliability of out-of-fold cross-validation. The competition paper also explains that there are two kinds of obstacle: experimental noise and mutations. https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full Experimental noise seems to affect both training and testing sets equally, but as pointed out by AmbrosM, mutations affect each set differently.

If I understood these insights correctly, this competition is therefore about how to deal with non-representative data due to both sampling bias (i.e. mutations) and sampling noise (i.e. experimental noise):
* The sampling noise does not seem to be an issue thanks to the large amount of data.
* The sampling bias is the main issue because it cannot be avoided without "peeking" into the testing set, or can it? Interestingly, the fact that "extra trees" and "nearest neighbors" performed so well might provide some hints about this. Let's start by loading the data.

# Input

In [None]:
import pandas as pd
xtrain = pd.read_csv('../input/tabular-playground-series-feb-2022/train.csv',index_col=0).drop_duplicates()
xtest = pd.read_csv('../input/tabular-playground-series-feb-2022/test.csv',index_col=0)
ytrain = xtrain.pop('target').to_frame()

# Target encoding as one-hot or integer:
ytrain_ohe = pd.get_dummies(ytrain['target']) #one-hot
ytoname = {x:y for x,y in enumerate(ytrain_ohe.columns)} #integer to name
ytoint = {y:x for x,y in enumerate(ytrain_ohe.columns)} #name to integer
ytrain['target_num'] = ytrain['target'].map(ytoint)

# KNN

Two observations regarding the nearest neighbor algorithm:
* The performance is better without cross-validation, that is, using the full training set only once.
* The performance is better with the manhattan metric, inverse-distance weights and only 2 neighbors.

This indicates that for a specific observation in the testing set, there might be one or two observations in the training set with similar mutations. The absolute value used by the Manhattan distance (L1) appears to be more robust against outlier mutations than the euclidean distance (L2).

In [None]:
# Fit on full training set once and predict class probabilities:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_jobs=-1,metric='manhattan',weights='distance',n_neighbors=2)
model.fit(xtrain,ytrain['target_num'])
ytest_prob = pd.DataFrame(model.predict_proba(xtest),index=xtest.index,columns=ytrain_ohe.columns)

As AmbrosM pointed out, the probabilities need to be adjusted in order to balance the amount of observations per class. Most notebooks did this by adding different constants to each class. 

In [None]:
# Check class distributions:
ytest_knn = ytest_prob.apply(lambda x: x.argmax(),axis=1)
ytest_knn = ytest_knn.map(ytoname).rename('target').reset_index()
display(ytest_knn.target.value_counts(normalize=True))

I found that the 90th percentile could be used more effectively in this case, notice how it is centered around 50% probability?:

In [None]:
display(ytest_prob.quantile(0.9))

We can therefore simply rescale the probabilities of each class through the 90th percentile and then make it go back to 50% probability, which will result in 10% of the observations having more than 50% probability for each class, as pointed out by xuedaolao in the comments section:

In [None]:
ytest_prob /= ytest_prob.quantile(0.9)
ytest_prob /= 2
display(ytest_prob.quantile(0.9))

The resulting submission scored **0.98835** in the public LB:

In [None]:
ytest_knn = ytest_prob.apply(lambda x: x.argmax(),axis=1)
ytest_knn = ytest_knn.map(ytoname).rename('target').to_frame()
ytest_knn[['target']].to_csv('submission_knn.csv')
display(ytest_knn.target.value_counts(normalize=True))

# ET + pseudolabel LightAutoML 

The "Extra Trees" approach has been excellently applied by several people. An additional improvement in performance has been observed by using the extra-trees predictions on the testing-set as "pseudolabels" to re-fit a second model. An excellent example of this is the following notebook by Alexander Ryzhkov, in which the "LightAutoML" model was fitted on the pseudolabels: https://www.kaggle.com/alexryzhkov/tps-feb-22-lightautoml-pseudolabel (upvote!).

Most notebooks combined different algorithms by taking the modes of the predicted classes; however, I found that simply averaging the probabilities predicted by KNN and LightAutoML worked better (the submission file below scored **0.99066** in the public LB). Since the notebook above did not output the probabilities, the following lines had to be added in a separate fork of the notebook, in addition to updating the test predictions to the following shared ensemble: https://www.kaggle.com/kotrying/extra-blender-addition by kotrying (upvote!) as shown below:

> pseudolabels = pd.read_csv('../input/extra-blender-addition/submission.csv')

> test_pred_out = pd.DataFrame(test_pred.data) <br>
> test_pred_out.columns = mapper.keys() <br>
> test_pred_out.to_csv('test_pred.csv') <br>

In [None]:
# Combine probabilities with current best tree-based model shared by Alexander Ryzhkov:
# https://www.kaggle.com/alexryzhkov/tps-feb-22-lightautoml-pseudolabel
ytest_prob_automl = pd.read_csv(
    '../input/a-fork-of-tps-feb-22-lightautoml-pseudolabel/test_pred.csv',index_col=0,header=0
)
ytest_prob = 0.5*ytest_prob_automl[ytest_prob.columns].values + 0.5*ytest_prob
ytest = ytest_prob.apply(lambda x: x.argmax(),axis=1).rename('target_num').to_frame()
ytest['target'] = ytest.target_num.map(ytoname)
ytest[['target']].to_csv('submission_knn_automl.csv')
display(ytest.target.value_counts(normalize=True))

The scikit-learn user guide explains that the "extra trees" approach "allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias" in comparison to other tree-based models. https://scikit-learn.org/stable/modules/ensemble.html#forest This would mean that a very small change in the data caused by the mutations could cause in turn the model to "jump" to another bacteria class prediction and therefore this model is better able to avoid overfitting on the mutations, similarly to the Manhattan (L1) distance in the case of nearest neighbors.

# GCD postprocessing with pseudolabel KNN

In AmbrosM's notebook (referenced in the intro), he applied an unsupervised method on a subset of the GCDs in order to re-label overlapping clusters of observations, which required searching for and isolating each pair of overlapping clusters in a lower-dimensional space. 

One alternative and perhaps simpler way is to fit a supervised algorithm such as KNN on the pseudolabels of the GCDs of interest. Two things to point out:
* Due to the training set's sampling bias, the algorithm was fitted entirely on the testing set.
* Since the distance of each observation to itself is zero, the number of neighbors had to be increased and the weights had to become "uniform" instead of "inverse-distance".

In [None]:
# Get GCD as explained by AmbrosM:
# https://www.kaggle.com/ambrosm/tpsfeb22-03-clustering-improves-the-predictions
import numpy as np
from math import factorial
def bias_of(s):
    w = int(s[1:s.index('T')])
    x = int(s[s.index('T')+1:s.index('G')])
    y = int(s[s.index('G')+1:s.index('C')])
    z = int(s[s.index('C')+1:])
    return factorial(10) / (factorial(w) * factorial(x) * factorial(y) * factorial(z) * 4**10)
def gcd_of_all(df_i):
    gcd = df_i[xtrain.columns[0]]
    for col in xtrain.columns[1:]:
        gcd = np.gcd(gcd, df_i[col])
    return gcd
itrain = pd.DataFrame({col: ((xtrain[col] + bias_of(col)) * 1000000).round().astype(int) for col in list(xtrain.columns)})
itest = pd.DataFrame({col: ((xtest[col] + bias_of(col)) * 1000000).round().astype(int) for col in list(xtest.columns)})
ytrain['gcd'], ytest['gcd'] = gcd_of_all(itrain), gcd_of_all(itest)

# Subset GCD to either 1 or 10:
xtest_sel, ytest_sel = xtest.loc[ytest['gcd'].isin([1,10])], ytest.loc[ytest['gcd'].isin([1,10])]
ytest_ohe = pd.get_dummies(ytest_sel['target'])

# Re-label classes through Manhattan metric and uniform weights:
ytest_proba_knn = 0*ytest_ohe.copy()
ytest_int = 0*ytest_sel['target_num'].copy()
model = KNeighborsClassifier(n_jobs=-1,n_neighbors=20,weights='uniform',metric='manhattan')
ytest_proba_knn += model.fit(xtest_sel,ytest_sel.target_num).predict_proba(xtest_sel)
ytest_int += np.argmax(ytest_proba_knn.values, axis=1)
display(pd.crosstab(ytest_sel.target_num,ytest_int))


# Convert to target name, check class distributions and save:
ysub = ytest.copy()
ysub.loc[ytest_int.index,'target_num'] = ytest_int
ysub['target'] = ysub['target_num'].map(ytoname)
display(ysub.target.value_counts(normalize=True))
ysub[['target']].to_csv('submission_pseudo_knn.csv')

# GCD postprocessing with pseudolabel FFNN

I am curious about whether the sampling bias of the training set is an inherent property of this data or whether a new variable could be added to measure the "mutation level" of each observation relative to some reference point. Perhaps then it would be possible for a neural network to learn how the data evolves as a function of the mutations... But as it is, I found it better to fit the model (a simple feed-forward network) entirely on the testing set only.

In [None]:
# Load Tensorflow libraries:
import os,random
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras import layers as L
from tensorflow.keras.callbacks import ReduceLROnPlateau,ModelCheckpoint,EarlyStopping
from tensorflow_addons.layers import WeightNormalization
from sklearn.model_selection import StratifiedKFold
def do_seed_tf(seed=0):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

# A simple feed-forward network with both batch and weight normalization:
def model_builder():
    in_num = L.Input(shape=(xtrain.shape[1],))
    out_num = L.BatchNormalization()(in_num)
    out_num = L.Dropout(0.1)(out_num)
    out_num = WeightNormalization(L.Dense(688,activation='relu',kernel_initializer="HeUniform"))(out_num)
    out_num = L.BatchNormalization()(out_num)
    out_num = L.Dropout(0.1)(out_num)
    out_num = WeightNormalization(L.Dense(688,activation='swish'))(out_num)
    out_num = L.BatchNormalization()(out_num)
    out_num = L.Dropout(0.1)(out_num)
    out = WeightNormalization(L.Dense(10,activation='softmax'))(out_num)
    model = tf.keras.Model(inputs=in_num, outputs=out)
    model.compile(
        metrics=['accuracy'], loss=tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.01),
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001)
    )
    return(model)

# K-fold CV of coordinates:
seeds, folds = 5, 10
ytest_proba_ffnn = 0*ytest_ohe.copy()
ytest_int = 0*ytest_sel['target_num'].copy()
for seed in range(seeds):
    skf = StratifiedKFold(n_splits=folds,random_state=seed,shuffle=True)
    for fold, (idt,idv) in enumerate(skf.split(ytest_sel,ytest_sel['target_num'])):
        print('\r',seed,fold,end='\t')
        K.clear_session()
        model = model_builder()
        do_seed_tf(seed)
        history = model.fit(
            xtest_sel.iloc[idt], ytest_ohe.iloc[idt],
            validation_data=(xtest_sel.iloc[idv], ytest_ohe.iloc[idv]),
            batch_size=256, epochs=100, verbose=0,
            callbacks=[
                ReduceLROnPlateau(monitor='val_loss',mode='min',
                    verbose=0,factor=0.5,patience=4),
                EarlyStopping(monitor='val_accuracy',mode='max',restore_best_weights=True,
                    verbose=0,min_delta=1e-5,patience=12),
                ModelCheckpoint(f'tmp.hdf5',monitor='val_accuracy',mode='max',
                    verbose=0,save_best_only=True,save_weights_only=True),
            ]
        )
        model.load_weights(f'tmp.hdf5')
        ytest_proba_ffnn.iloc[idv] += model.predict(xtest_sel.iloc[idv]) / seeds

# Predictions:
ytest_int += np.argmax(ytest_proba_ffnn.values, axis=1)
display(pd.crosstab(ytest_sel.target_num,ytest_int))

# Save result:
ysub = ytest.copy()
ysub.loc[ytest_int.index,'target_num'] = ytest_int
ysub['target'] = ysub['target_num'].map(ytoname)
display(ysub.target.value_counts(normalize=True))
ysub[['target']].to_csv('submission_pseudo_ffnn.csv')

The last step was to average the GCD postprocessing probabilities:

In [None]:
# Ensemble pseudo KNN and pseudo FFNN:
ytest_proba_ensemble = 0*ytest_ohe.copy()
ytest_int = 0*ytest_sel['target_num'].copy()
ytest_proba_ensemble += 0.5*ytest_proba_knn.values + 0.5*ytest_proba_ffnn.values
ytest_int += np.argmax(ytest_proba_ensemble.values, axis=1)
display(pd.crosstab(ytest_sel.target_num,ytest_int))

# Save result:
ysub = ytest.copy()
ysub.loc[ytest_int.index,'target_num'] = ytest_int
ysub['target'] = ysub['target_num'].map(ytoname)
display(ysub.target.value_counts(normalize=True))
ysub[['target']].to_csv('submission.csv')

Thank you for reading!