Shalom,  

In this Notebook we will do Exploratory Data Analysis (EDA).  
This is the first thing we need to do when they give us data and a particular problem.  
Data is the tool to solve problems, so let's dive into the data. 

This database contains:
- /kaggle/input/tabular-playground-series-aug-2021/sample_submission.csv
- /kaggle/input/tabular-playground-series-aug-2021/train.csv
- /kaggle/input/tabular-playground-series-aug-2021/test.csv

Let's get the info about the files(global information), and row index we set to `id`.

## Import Data & Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

df_submission = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2021/sample_submission.csv")\
# [:10000]# <<< Development mode
df_train = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2021/train.csv", index_col="id").fillna(0)\
# [:10000]# <<< Development mode
df_test = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2021/test.csv", index_col="id").fillna(0)\
# [:10000]# <<< Development mode

display(df_submission.head())
display(df_train.head())
display(df_test.head())

In [None]:
# display(
#     df_train.describe().T.style.background_gradient(
#         subset=["mean"], cmap="coolwarm"
#     ).background_gradient(
#         subset=["std"], cmap="inferno"
#     )
# )

# df_train.info()

- large amount of features
- `loss` is the submission(`target`) variable 

We move `loss` to a unique variable called `target`

## Investigate the loss (target)

In [None]:
corr_mat = df_train.corr()

# loss -> target
target = df_train["loss"]
df_train = df_train.drop(['loss'], axis=1)

df_all = pd.concat([df_train,df_test], axis=0, copy=False)

In [None]:
target_cnt = target.value_counts()

plt.figure(figsize=(14, 5))
sns.barplot(x=target_cnt.index, y=target_cnt.values, palette="coolwarm")
plt.title("Target unique values", fontdict={"fontsize":20})

In [None]:
plt.figure(figsize=(25, 6))
corr_mat["loss"][:-1].plot(kind="bar", grid=True)
plt.title("Features correlation to target label", fontdict={"fontsize": 20})

# Investigate features [f0 - f99]
The dataset contains a lot of features, let we investigate.  
And create some function to visualize dataframes 

In [None]:
print(f'target unique values : {target.nunique()}')
for columns in df_train.columns.values.reshape((10, -1)):
    txt = ""
    for col in columns:
        txt += f'{col}: {df_train[col].nunique()},\t'
    print(txt)

In [None]:
df_train[df_train.duplicated()]

The dataset has __250000__ rows, most of the features get above __200000__ unique values.  
There is less change that a prediction can been possible on 1 single feature if nearly all rows has a unique value.

In [None]:
# normalize data [0, 1]
def normalize(df_1, df_2):
    df = pd.concat([df_1, df_2])
    minmax = MinMaxScaler()
    
    for column in df.columns:
        data = df[column].values.reshape((-1, 1))
        scaled = minmax.fit_transform(data).reshape((-1))
        df.update(pd.Series(scaled, name=column))

    split = len(df_1)
    return df.iloc[:split, :], df.iloc[split:, :]

# plot dataframe
def plot_features(df_train, df_test, file_name=None, n_loss=3, n_rows=10):
    # save file
    if file_name is not None:
        df_train.to_pickle(f"{file_name}_train.pkl")
        df_test.to_pickle(f"{file_name}_test.pkl")
    
    train, test = df_train.copy(), df_test.copy()
    qdf_train, qdf_test = normalize(train, test)

    for loss in range(1, len(target))[:n_loss]:
        matches = target.loc[target.values == loss].index[:n_rows]
        rows = qdf_train.iloc[matches]

        # display
        fig, ax = plt.subplots(1, 1, figsize=(20, 6))
        for index in range(len(rows)):
            row = rows.iloc[index]
            ax.bar(range(len(row)), row, alpha=0.5)    
        ax.set_yticks(range(0, 20, 3))
        ax.margins(0)
        ax.set_title(f'all Rows of {loss}', loc='left', fontweight='bold')
        ax.legend()
        plt.show()

In [None]:
plot_features(df_train, df_test)

The X-axis is all features that are present in a single row.  
The Y-axis represents the value of eacht feaeture, each color is a unique row.  
In one plotted graph there is 1 label value used.  

This way we can see if there is any pattern in the data that represents the target(loss) value.  
As we can see there is a lot of noise present in the data.

To reduce the noise and dimensions by using:
- Quantile Normalization
- Quantile Binning
- Denoised Auto Encoder + PCA

## Quantile Normalization

In [None]:
# Quantile Normalization
df_all_median = pd.DataFrame.median(df_all, 0)
df_all_25quan = df_all.quantile(0.25, 0)
df_all_75quan = df_all.quantile(0.75, 0)
df_all = (df_all - df_all_median) / (df_all_75quan - df_all_25quan)

def normalize_quantile(df):
    df_sorted = pd.DataFrame(
        np.sort(df.values, axis=0), 
        index=df.index, 
        columns=df.columns
    )
    
    df_sorted_mean = df_sorted.mean(axis=1)
    df_sorted_mean.index = np.arange(1, len(df_sorted_mean) + 1)
    
    qdf = df.rank(axis=0, method="min").stack().astype(int).map(df_sorted_mean).unstack()
    return qdf
    
qdf = normalize_quantile(df_all)
qdf_train = qdf.iloc[:len(df_train), :]
qdf_test = qdf.iloc[len(df_train):, :]

plot_features(qdf_train, qdf_test, file_name="quantile_df")

Comparing the the `first plotted graps` with `above plotted graps`:
- the color where all feature values reached, did increase
- the high feature values did decrease

## Quantile Binning

In [None]:
# Quantile Binning
def zeros_like(df):
    new_df = df.copy()
    for col in new_df.columns:
        new_df[col].values[:] = 0
        
    return new_df

def quantile_binning(df):
    df_bin = zeros_like(df)
    for i in range(df_bin.shape[1]):
        binning = pd.qcut(df.iloc[:, i], 50, labels=False, duplicates="drop")
        df_bin.iloc[:, i] = binning
        
    return df_bin

bdf = quantile_binning(qdf)    
bdf_train = bdf.iloc[:qdf_train.shape[0], :]
bdf_test = bdf.iloc[qdf_train.shape[0]:, :]

plot_features(bdf_train, bdf_test, file_name="binning_df")

In simple words so far i understand.  
We are deviding the data, large values will decrease faster than small values.  
- There is less white empty space, this means that all features get similar output.

## Denoiser AutoEncoder + PCA
[`Simple Denoise Autoencoder sample.`](https://www.kaggle.com/arenddejong/denoising-autoencoder-dae)  
To learn a Denoiser what noise is we will add noise on the original data and learn the network.  
The network contains of 2 parts the `Encoder` and `Decoder`.  

We need them both to train the network, as input we use the Quantile Normalized data





In [None]:
import tensorflow as tf
import tensorflow.keras.backend as K

from tensorflow.keras import layers, models
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam

def custom_loss(y_true, y_pred):
    loss = K.mean(K.square(y_pred - y_true))
    return loss

def dae_network():
    ae_input = layers.Input(shape = (100))
    ae_encoded = layers.Dense(
        units = 100,
        activation='elu')(ae_input)
    ae_encoded = layers.Dense(
        units = 300,
        activation='elu')(ae_encoded)
    ae_decoded = layers.Dense(
        units = 100,
        activation='elu')(ae_encoded)
    
    return models.Model(ae_input,ae_encoded), models.Model(ae_input, ae_decoded)

# create training data
df_noisy = qdf + np.random.normal(0, .1, df_all.shape)

split = int(0.8 * len(df_noisy))# split on 80%
xtrain, ytrain = df_noisy.iloc[:split], qdf.iloc[:split]
xvalid, yvalid = df_noisy.iloc[split:], qdf.iloc[split:]

# define callbacks
early_stop = EarlyStopping(
    monitor="val_loss", patience=20, verbose=0, mode="min",
    min_delta=1e-9,
    baseline=None,
    restore_best_weights=True
)
reduce_lr = ReduceLROnPlateau(
    monitor="val_loss", patience=4, verbose=0, mode="min",
    factor=0.8
)

# define network
decoder, autoencoder = dae_network()
autoencoder.compile(loss=custom_loss, optimizer=Adam(lr=5e-3))

# train network
history = autoencoder.fit(
    xtrain, 
    ytrain,
    epochs=200, 
    batch_size=512,
    verbose=0,# 1 = logs network
    validation_data=(xvalid, yvalid),
    callbacks=[early_stop, reduce_lr]
)

In [None]:
# Denoiser AutoEncoder
enp = decoder.predict(qdf)
print(f"max encoded value : {np.max(enp)}")

# Output of Encoder has 300 features, 
#  we take most accurate values that we end up with 100 Features.
enp_var = np.var(enp, axis=0, ddof=1)
enp_var1 = np.where(enp_var > 0.8)[0]# 0.8=RELEASE, 0.108=DEVELOPMENT

assert(len(enp_var1) >= 95 & len(enp_var1) <= 105)
print("number of selected columns", len(enp_var1))
columns = [f"col_{i}" for i in range(len(enp_var1))]

edf = pd.DataFrame(enp[:, enp_var1], columns=columns)
edf_train = edf.iloc[:len(df_train),:]
edf_test = edf.iloc[len(df_train):,:]

plot_features(edf_train, edf_test) # exlude PCA

In [None]:
# (Add) PCA to dataframe
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

pca = PCA(n_components=10)
pnp_all = pca.fit_transform(enp)

scaler = StandardScaler()
pnp_all = scaler.fit_transform(pnp_all)

pnp_train = pnp_all[:len(df_train)]
pnp_test = pnp_all[len(df_train):]

# add pca
for i in range(pnp_train.shape[1]):
    index = len(edf_train) + i
    edf_train[f"col_{index}"] = pd.Series(pnp_train[:, i])

for i in range(pnp_test.shape[1]):
    index = len(edf_test) + i
    edf_test[f"col_{index}"] = pd.Series(pnp_test[:, i])

# PCA is the last 10 bars
plot_features(edf_train, edf_test, file_name="dae_pca_df")

The result are Nice as the different colors get closer to each other.  
We add the PCA of this dataframe to the dataframe, the last 10 bars are the PCA in the plotted graphs.

In [None]:
# More ideas will than been added....

# ??? Random Trees Embeddings

# How to use this data?
As we can see now this problem is a `Regression Problem` as the data is a continuous/sparse data.  
Each Algorithm that can represent any better results than current data, we will use it as input data.  

The saved data are the following files in this kernel:
- `quantile_df_train.pkl` & `quantile_df_test.pkl`
- `binning_df_train.pkl` & `binning_df_test.pkl`
- `dae_pca_df_train.pkl` & `dae_pca_df_test.pkl`

We can use this data as input data on a MLP Network, and hope we will beat some places.
<!-- start on model: https://www.kaggle.com/arenddejong/tps-aug-nn-torch -->

Suggestions, Irritations or Improvements?   
leave a comment, thanks  

Niek Tuytel