
<img src="https://i.insider.com/51f6ca3deab8eac47b000004?width=1200&format=jpeg" width="800">

In this notebook I just want to show a proof of concept of maybe new possible ways of tackling this problem. The starting point is what has been already shown in many other notebooks and I further explain [here](https://www.kaggle.com/vpallares/semi-supervised-learning-extratrees). Basically, as introduced in [here](https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense) we have 4 types of samples in this dataset. This type represents the resolution at which the DNA sequences were sampled. Two of them, resolutions 1 and 10, are the easier to classify, especially in the CV. The other two, 1000 and 10000, are quite noisy and, therefore, more challenging. I think we can help the classifier during the CV by taking advantage of having both, high-resolution and low-resolution samples already labelled.  

In [None]:
%%capture

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import accuracy_score
from scipy.stats import mode
from math import factorial
import gc
import sys
from tqdm import tqdm

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Model

!pip install scikit-learn-intelex

## 1. Preparing the data

I'm going to load only the training set and apply the same preprocessing that I did in [here](https://www.kaggle.com/vpallares/semi-supervised-learning-extratrees). 

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2

    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
 
    return df

In [None]:
train_df = pd.read_csv("../input/tabular-playground-series-feb-2022/train.csv", index_col=0)
train_df = reduce_mem_usage(train_df)

In [None]:
def bias_of(s):
    w = int(s[1:s.index('T')])
    x = int(s[s.index('T')+1:s.index('G')])
    y = int(s[s.index('G')+1:s.index('C')])
    z = int(s[s.index('C')+1:])
    return factorial(10) / (factorial(w) * factorial(x) * factorial(y) * factorial(z) * 4**10)

def gcd_of_all(df_i, elements):
    gcd = df_i[elements[0]]
    for col in elements[1:]:
        gcd = np.gcd(gcd, df_i[col])
    return gcd

In [None]:
feat = [col for col in train_df.columns if col != 'target']

le = LabelEncoder()
y = train_df['target']
y_le = le.fit_transform(y)

train_int = pd.DataFrame({col: ((train_df[col] + bias_of(col)) * 1000000).round().astype(int) for col in feat})
train_int['res'] = gcd_of_all(train_int, feat)
train_int['target'] = y_le
train_int.head()

Just by looking at the data samples reformatted as integers, two things caught my eye. First, samples 1, 2 and 3 belong to class 6, and they are actually very similar, with sample 1 having less resolution but kind of following the same pattern as the other two. The second thing was that this reminded me a lot of an image, so I thought, why not plotting it as one?

## 2. Plotting the images

In [None]:
del train_df
gc.collect()

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(20,6), sharex=True)
sns.heatmap(train_int[(train_int.target==4) & (train_int.res==1)][feat].head(100), ax=ax[0,0])
sns.heatmap(train_int[(train_int.target==4) & (train_int.res==10)][feat].head(100), ax=ax[0,1])
sns.heatmap(train_int[(train_int.target==4) & (train_int.res==1000)][feat].head(100), ax=ax[1,0])
sns.heatmap(train_int[(train_int.target==4) & (train_int.res==10000)][feat].head(100), ax=ax[1,1])
plt.show()

The upper four plots show samples of the same class (target 4) by resolution. The horizontal axis is the feature axis, while in the y-axis we just have 100 rows randomly sampled out of the training data. When I saw this I thought, this looks like a real spectrum, with two high-resolution images and other two with a lot of noise in them. 

Then I came up with something: maybe we can reconstruct those noisy samples by applying autoencoders... So, next I'm rearranging the data to have the 4-dimensions format in order to train and test an autoencoder for image reconstruction. 

Note: I haven't explored these parameters. I just thought that 100 rows was okay since we have 286 features, which is a long x-axis already. Since the images for resolutions 1 and 10 kind of look like similar, I decided to mix them up in the train set for the autoencoder. Each 2D image is made of 100 rows sampled from the training data. 

In [None]:
n_rows = 100
n_samples = 5000
train_data = np.zeros((n_samples, 100, 288, 1))
labels = []
for k in tqdm(range(n_samples)):
    c = np.random.randint(10)
    if np.random.rand() > 0.5:
        im = train_int[(train_int.target == c) & (train_int.res == 1)][feat].sample(n_rows).values
    else:
        im = train_int[(train_int.target == c) & (train_int.res == 10)][feat].sample(n_rows).values
    train_data[k, :, :, 0] = np.append(im/im.max(), np.zeros((100,2)), axis=1)
    labels.append(c)

In [None]:
res = 1000
n_rows = 100
n_samples = 5000
lowres_data = np.zeros((n_samples, 100, 288, 1))
lowres_labels = []
for k in tqdm(range(n_samples)):
    c = np.random.randint(10)
    im = train_int[(train_int.target == c) & (train_int.res == res)][feat].sample(n_rows).values
    lowres_data[k, :, :, 0] = np.append(im/im.max(), np.zeros((100,2)), axis=1)
    lowres_labels.append(c)

## 3. Building the autoencoder


 <img src="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/60bcd0b7b750bae1a953d61d_autoencoder.png" width="500">

As you may know, an autoencoder consists of two blocks, one encoder and one decoder that by fitting a latent representation of the input data allow to reconstruct an input with noisy or missing data according to that latent representation. Here I'm just going to build the enconder-decoder architecture with very standard parameters.

In [None]:
input = layers.Input(shape=(100, 288, 1))

# Encoder
x = layers.Conv2D(32, (3, 3), activation="relu", padding="same")(input)
x = layers.MaxPooling2D((2, 2), padding="same")(x)
x = layers.Conv2D(32, (3, 3), activation="relu", padding="same")(x)
x = layers.MaxPooling2D((2, 2), padding="same")(x)

# Decoder
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(1, (3, 3), activation="sigmoid", padding="same")(x)

# Autoencoder
autoencoder = Model(input, x)
autoencoder.compile(optimizer="adam", loss="binary_crossentropy")
autoencoder.summary()

The next step is to fit the model to the training data with the same data as target and check the output. 

In [None]:
autoencoder.fit(
    x=train_data,
    y=train_data,
    epochs=10,
    batch_size=128,
    shuffle=True,
    validation_data=(train_data,train_data),
)

Now I predict on some samples from the same training data and plot the result. 

In [None]:
X_pred = autoencoder.predict(train_data[:100,:,:,:])

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(14,6), sharex=True)
sns.heatmap(train_data[7,:,:,0], ax=ax[0])
sns.heatmap(X_pred[7,:,:,0], ax=ax[1])
plt.show()

We can see that for this sample the autoencoder is actually learning and the output has even better resolution. The discontinuities that we see on the original image is due to the noise that we have in each of the four resolution types (check AmbrosM's notebook if you don't know what I mean).

So this means it's working with a very simple architecture. Can we actually reconstruct images with 1000 or 10000 resolution?

## 4. Reconstructing the noisy data

<img src="https://i.ytimg.com/vi/h58lRIVHhGc/maxresdefault.jpg" width="600">

Now we use as input those images with resolution=1000 and we fit them to the training data. The idea is that the autoencoder learns to fill those gaps and resolution errors in the input samples. 

In [None]:
#callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
autoencoder.fit(
    x=lowres_data,
    y=train_data,
    epochs=30,
    batch_size=128,
    shuffle=True,
    validation_data=(lowres_data, train_data),
)

In [None]:
gc.collect()

Let's see the result on those samples!

In [None]:
lowres_pred = autoencoder.predict(lowres_data)

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(14,8), sharex=True)
sns.heatmap(lowres_data[10,:,:,0], ax=ax[0])
sns.heatmap(lowres_pred[10,:,:,0], ax=ax[1])
plt.show()

This is pretty cool! the autoencoder is learning to reconstruct the noisy input data and the output is almost as good as one of the high-resolution images. However, a resolution of 1000 isn't that bad, would it work with the poor-quality images of resolution=10000?

In [None]:
res = 10000
n_rows = 100
n_samples = 1000
lowres_data2 = np.zeros((n_samples, 100, 288, 1))
lowres_labels = []
for k in tqdm(range(n_samples)):
    c = np.random.randint(10)
    im = train_int[(train_int.target == c) & (train_int.res == res)][feat].sample(n_rows).values
    lowres_data2[k, :, :, 0] = np.append(im/im.max(), np.zeros((100,2)), axis=1)
    lowres_labels.append(c)

In [None]:
autoencoder.fit(
    x=lowres_data2,
    y=train_data[:1000,:,:,:],
    epochs=30,
    batch_size=128,
    shuffle=True,
    validation_data=(lowres_data2, train_data[:1000,:,:,:]),
)

lowres_pred2 = autoencoder.predict(lowres_data2)

fig, ax = plt.subplots(2, 1, figsize=(14,8), sharex=True)
sns.heatmap(lowres_data2[4,:,:,0], ax=ax[0])
sns.heatmap(lowres_pred2[4,:,:,0], ax=ax[1])
plt.show()

Well, that's not too bad. The reconstructed output has a way better look than the input. We'll probably get better results if we train with more samples and for longer
. 

## 5. But there was a trick...

However, I cheated a bit in this problem. I am using class information so I can group similar samples into a 2D structure, and I obviously don't have that information in the test set. But I still think this idea could be useful for:

1. Training with more different resolution images
2. Samples could be grouped using clustering and then passed to the autoencoder to be reconstructed
3. A similar idea could be used with a VAE to generate more samples for the different resolution types
4. Reconstructing those samples in test with resolution 1 and 10 (which should be easy to cluster) would already remove the intra-class noise that we see in the PCA
5. A sample could be replicated 100 times which would make it an image and then reconstructed

So, that's all for the moment... Thanks for reading!

PS: Of course, you might think of doing the same in 1D, why should we complicate our lives by transforming it into 2D? the answer is that in 2D you can apply convolutions, which could be useful to build a CNN for example. But the second and more important reason is that in 2D we can compensate the intra-class error for all the resolution (the differences in intensities that we see in some vertical lines). The price, as we saw, is that you need the label to group them as a 2D image.

![gif](https://media.giphy.com/media/IL1sMUfQVRNFC/giphy.gif)