![](https://drive.google.com/uc?id=16xvFPMC5llfX_AAM4cwzdS13g5hgZCBJ)

[Image Source](https://en.wikipedia.org/wiki/Pan-genome)

For Tabular Playground Series - Feb 2022 , the problem deals with classifying 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss.  In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment `ATATGGCCTT` becomes `A2T4G2C2`

# **<span style="color:#e76f51;">Goal</span>**
 
The goal is to predict bacteria species based on repeated lossy measurements of DNA snippets.

# **<span style="color:#e76f51;">Data</span>**

**Training Data**

Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities (e.g.,  to ), which then has a bias spectrum (of totally random ATGC) subtracted from the results.

The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging.

> - ```train.csv``` -  the training set, which contains the spectrum of 10-mer histograms for each sample
> - ```test.csv``` -  the test set; your task is to predict the bacteria species (target) for each row_id
> - ```sample_submission.csv``` - a sample submission file in the correct format

# **<span style="color:#e76f51;">Metric</span>**

Classification accuracy is a metric that summarizes the performance of a classification model as the number of correct predictions divided by the total number of predictions. [Source](https://developers.google.com/machine-learning/crash-course/classification/accuracy)


<img src="https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

> I will be integrating W&B for visualizations and logging artifacts!
> 
> [TPS Feb 2022 Project on W&B Dashboard]
(https://wandb.ai/usharengaraju/TabTransformer)
> 
> - To get the API key, create an account in the [website](https://wandb.ai/site) .
> - Use secrets to use API Keys more securely 

In [None]:
pip install -U tensorflow-addons

In [None]:
import math
import numpy as np
import pandas as pd
import wandb
import warnings

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_addons as tfa

import matplotlib.pyplot as plt
import seaborn as sns

#ignore warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    secret_value_0 = user_secrets.get_secret("api_key")
    wandb.login(key=secret_value_0)
    anony=None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')
    
CONFIG = dict(competition = 'TabTransformer',_wandb_kernel = 'tensorgirl')

Based on this [notebook](https://www.kaggle.com/odins0n/tps-feb-22-eda-modelling) from Sanskar Hasija , I have used only the top 9 features of LGBM feature importance attribute

In [None]:
train = pd.read_csv("../input/tabular-playground-series-feb-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-feb-2022/test.csv")
submission = pd.read_csv("../input/tabular-playground-series-feb-2022/sample_submission.csv")

all_features = [
    "A0T1G3C6",
    "A0T0G6C4",
    "A0T0G5C5",
    "A0T0G4C6",
    "A0T1G2C7",
    "A0T0G7C3",
    "A0T1G0C9",
    "A0T0G3C7",
    "A0T0G8C2",
    "A0T0G10C0",
    "A0T10G0C0",
    "target"
    ]

NUMERIC_FEATURE_NAMES = [
    "A0T1G3C6",
    "A0T0G6C4",
    "A0T0G5C5",
    "A0T0G4C6",
    "A0T1G2C7",
    "A0T0G7C3",
    "A0T1G0C9",
    "A0T0G3C7",
    "A0T0G8C2",
    ]

TARGET_FEATURE_NAME  = "target"
TARGET_LABELS = ["Bacteroides_fragilis","Streptococcus_pyogenes","Streptococcus_pneumoniae","Campylobacter_jejuni",        
"Salmonella_enterica",  
"Escherichia_coli",           
"Enterococcus_hirae",          
"Escherichia_fergusonii",      
"Staphylococcus_aureus",       
"Klebsiella_pneumoniae"]

# A dictionary of the categorical features and their vocabulary.
CATEGORICAL_FEATURES_WITH_VOCABULARY = {
    "A0T0G10C0": sorted(list(train["A0T0G10C0"].unique())),
    "A0T10G0C0": sorted(list(train["A0T10G0C0"].unique()))}

# A list of the categorical feature names.
CATEGORICAL_FEATURE_NAMES = list(CATEGORICAL_FEATURES_WITH_VOCABULARY.keys())
# A list of all the input features.
FEATURE_NAMES = NUMERIC_FEATURE_NAMES + CATEGORICAL_FEATURE_NAMES
# A list of column default values for each feature.

# **<span style="color:#e76f51;">Exploratory Data Analysis</span>**



In [None]:
# basic stats of features
train.describe().style.background_gradient(cmap="Pastel1")

## **<span style="color:#e76f51;">Null Values</span>**

ðŸ“Œ There are no null values in the dataset

In [None]:
plt.figure(figsize = (25,9))
sns.heatmap(train.isna().values, cmap = ["#2a9d8f","#ff355d"], xticklabels=train.columns)
plt.title("Missing values in training Data", size=20);

## **<span style="color:#e76f51;">Numerical Feature Distribution</span>**

In [None]:
fig, ax = plt.subplots(3,3, figsize=(18, 18))
for i, feature in enumerate(NUMERIC_FEATURE_NAMES):
    sns.distplot(train[feature], color = "#ff355d", ax=ax[math.floor(i/3),i%3]).set_title(f'{feature} Distribution')
fig.show()

## **<span style="color:#e76f51;">Categorical Feature Distribution</span>**

In [None]:
def countplot_features(df_train, feature, title):
    '''Takes a column from the dataframe and plots the distribution (after count).'''    
           
    plt.figure(figsize = (25, 9))
    
    sns.countplot(df_train[feature], color = '#ff355d')
        
    plt.title(title, fontsize=15)
    plt.xticks(rotation=45)
    plt.show();

# plot distributions of categorical features
for feature in CATEGORICAL_FEATURE_NAMES:
    fig = countplot_features(train, feature=feature, title = "Frequency of "+ feature)

## **<span style="color:#e76f51;">Target Distribution</span>**

In [None]:
#code copied from https://www.kaggle.com/hamzaghanmi/welcome-tps-feb-2022

pie, ax = plt.subplots(figsize=[18,8])
train.groupby('target').size().plot(kind='pie',autopct='%.2f',ax=ax,title='Target distibution' , cmap = "Pastel1")

## **<span style="color:#e76f51;">Correlation of Features</span>**

In [None]:
plt.figure(figsize=(25, 9))
sns.heatmap(train[[f'{feature}' for feature in all_features]].corr(),annot=True ,cmap = "Pastel1");

# **<span style="color:#e76f51;">Preprocessing</span>**

In [None]:
train, val = np.split(train.sample(frac=1), [int(0.8*len(train))])
train = train[all_features]
val = val[all_features]
train = train.dropna()
test = test.dropna()

In [None]:
train_data_file = "train_data.csv"
test_data_file = "test_data.csv"

train.to_csv(train_data_file, index=False, header=False)
val.to_csv(test_data_file, index=False, header=False)

# **<span style="color:#e76f51;">W & B Artifacts</span>**

An artifact as a versioned folder of data.Entire datasets can be directly stored as artifacts .

W&B Artifacts are used for dataset versioning, model versioning . They are also used for tracking dependencies and results across machine learning pipelines.Artifact references can be used to point to data in other systems like S3, GCP, or your own system.

You can learn more about W&B artifacts [here](https://docs.wandb.ai/guides/artifacts)

![](https://drive.google.com/uc?id=1JYSaIMXuEVBheP15xxuaex-32yzxgglV)

In [None]:
# Save train data to W&B Artifacts
train.to_csv("train_wandb.csv", index = False)
run = wandb.init(project='TabTransformer', name='training_data', anonymous=anony,config=CONFIG) 
artifact = wandb.Artifact(name='training_data',type='dataset')
artifact.add_file("./train_wandb.csv")

wandb.log_artifact(artifact)
wandb.finish()

In [None]:
LEARNING_RATE = 0.001
WEIGHT_DECAY = 0.0001
DROPOUT_RATE = 0.2
BATCH_SIZE = 265
NUM_EPOCHS = 15

NUM_TRANSFORMER_BLOCKS = 3  # Number of transformer blocks.
NUM_HEADS = 4  # Number of attention heads.
EMBEDDING_DIMS = 16  # Embedding dimensions of the categorical features.
MLP_HIDDEN_UNITS_FACTORS = [
    2,
    1,
]  # MLP hidden layer units, as factors of the number of inputs.
NUM_MLP_BLOCKS = 2  # Number of MLP blocks in the baseline model.



## **<span style="color:#e76f51;">ðŸŽ¯tf.data</span>**

tf.data API is used for building efficient input pipelines which can handle large amounts of data and perform complex data transformations . tf.data API has provisions for handling different data formats .


## **<span style="color:#e76f51;">ðŸŽ¯tf.data.Dataset</span>**

tf.data.Dataset is an abstraction introduced by tf.data API and consists of sequence of elements where each element has one or more components . For example , in a tabular data pipeline , an element might be a single training example , with a pair of tensor components representing the input features and its label

tf.data.Dataset can be created using two distinct ways

Constructing a dataset using data stored in memory by a data source

Constructing a dataset from one or more tf.data.Dataset objects by a data transformation

In [None]:
def get_dataset_from_csv(csv_file_path, batch_size=128, shuffle=False):
    dataset = tf.data.experimental.make_csv_dataset(
        csv_file_path,
        batch_size=batch_size,
        column_names=all_features,
        label_name=TARGET_FEATURE_NAME,
        num_epochs=1,
        header=False,
        shuffle=shuffle,
    ).map(prepare_example, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
    return dataset.cache()

## **<span style="color:#e76f51;">ðŸŽ¯Categorical Feature Transformation</span>**


In [None]:
target_label_lookup = layers.StringLookup(
    vocabulary=TARGET_LABELS, mask_token=None, num_oov_indices=0
)


def prepare_example(features, target):
    target_index = target_label_lookup(target)
    
    return features, target_index


In [None]:
def run_experiment(
    model,
    train_data_file,
    test_data_file,
    num_epochs,
    learning_rate,
    weight_decay,
    batch_size,
):

    optimizer = tfa.optimizers.AdamW(
        learning_rate=learning_rate, weight_decay=weight_decay
    )

    model.compile(
        optimizer=optimizer,
        loss=keras.losses.BinaryCrossentropy(),
        metrics=[keras.metrics.BinaryAccuracy(name="accuracy")],
    )

    train_dataset = get_dataset_from_csv(train_data_file, batch_size, shuffle=True)
    validation_dataset = get_dataset_from_csv(test_data_file, batch_size)

    print("Start training the model...")
    history = model.fit(
        train_dataset, epochs=num_epochs, validation_data=validation_dataset
    )
    print("Model training finished")

    _, accuracy = model.evaluate(validation_dataset, verbose=0)

    print(f"Validation accuracy: {round(accuracy * 100, 2)}%")

    return history



## **<span style="color:#e76f51;">ðŸŽ¯Model Inputs</span>**

In [None]:
def create_model_inputs():
    inputs = {}
    for feature_name in FEATURE_NAMES:
        if feature_name in NUMERIC_FEATURE_NAMES:
            inputs[feature_name] = layers.Input(
                name=feature_name, shape=(), dtype=tf.float32
            )
        else:
            inputs[feature_name] = layers.Input(
                name=feature_name, shape=(), dtype=tf.string
            )
    return inputs

## **<span style="color:#e76f51;">Feature representation using Keras Preprocessing Layers</span>**

Feature representations can be one of the crucial aspect in model developement workflows . It is a experimental process and there is no perfect solution . Keras preprocessing Layers helps us create more flexible preprocessing pipeline where new data transformations can be applied while changing the model architecture .

![](https://drive.google.com/uc?id=1248y8JYTwjnxZnIEaTQHr1xV5jUZotLm)

[ImageSource](https://blog.tensorflow.org/2021/11/an-introduction-to-keras-preprocessing.html)

## **<span style="color:#e76f51;">Keras Preprocessing Layers - Numerical Features</span>**

The Keras preprocessing layers available for numerical features are below 

`tf.keras.layers.Normalization`: performs feature-wise normalization of input features.
  
`tf.keras.layers.Discretization`: turns continuous numerical features into integer categorical features.

`adapt():`

Adapt is an optional utility function which helps in setting the internal state of layers from input data . adapt() is available on all stateful processing layerrs and it computes mean and variance for the layerrs and stores them as layers weights . adapt() is called before fit() , evaluate or predict()


## **<span style="color:#e76f51;">Keras Preprocessing Layers - Categorical Features</span>**

The various keras preprocessing layers available for categorical variables are below .

`tf.keras.layers.CategoryEncoding:` turns integer categorical features into one-hot, multi-hot, or count dense representations.

`tf.keras.layers.Hashing:` performs categorical feature hashing, also known as the "hashing trick".

`tf.keras.layers.StringLookup:` turns string categorical values an encoded representation that can be read by an Embedding layer or Dense layer.

`tf.keras.layers.IntegerLookup:` turns integer categorical values into an encoded representation that can be read by an Embedding layer or Dense layer.

In [None]:
def encode_inputs(inputs, embedding_dims):

    encoded_categorical_feature_list = []
    numerical_feature_list = []

    for feature_name in inputs:
        if feature_name in CATEGORICAL_FEATURE_NAMES:

            # Get the vocabulary of the categorical feature.
            vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]

            # Create a lookup to convert string values to an integer indices.
            # Since we are not using a mask token nor expecting any out of vocabulary
            # (oov) token, we set mask_token to None and  num_oov_indices to 0.
            lookup = layers.StringLookup(
                vocabulary=vocabulary,
                mask_token=None,
                num_oov_indices=0,
                output_mode="int",
            )

            # Convert the string input values into integer indices.
            encoded_feature = lookup(inputs[feature_name])

            # Create an embedding layer with the specified dimensions.
            embedding = layers.Embedding(
                input_dim=len(vocabulary), output_dim=embedding_dims
            )

            # Convert the index values to embedding representations.
            encoded_categorical_feature = embedding(encoded_feature)
            encoded_categorical_feature_list.append(encoded_categorical_feature)

        else:

            # Use the numerical features as-is.
            numerical_feature = tf.expand_dims(inputs[feature_name], -1)
            numerical_feature_list.append(numerical_feature)

    return encoded_categorical_feature_list, numerical_feature_list



## **<span style="color:#e76f51;">Tab Transformer</span>**


[Source](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/structured_data/ipynb/tabtransformer.ipynb)

The TabTransformer architecture works as follows:

ðŸ“Œ All the categorical features are encoded as embeddings, using the same embedding_dims. This means that each value in each categorical feature will have its own embedding vector.

ðŸ“Œ A column embedding, one embedding vector for each categorical feature, is added (point-wise) to the categorical feature embedding.

ðŸ“Œ The embedded categorical features are fed into a stack of Transformer blocks. Each Transformer block consists of a multi-head self-attention layer followed by a feed-forward layer.

ðŸ“Œ The outputs of the final Transformer layer, which are the contextual embeddings of the categorical features, are concatenated with the input numerical features, and fed into a final MLP block.

ðŸ“Œ A softmax classifer is applied at the end of the model.

The [paper](https://arxiv.org/pdf/2012.06678.pdf) discusses both addition and concatenation of the column embedding in the Appendix: Experiment and Model Details section. The architecture of TabTransformer is shown below, as presented in the paper.

<img src="https://raw.githubusercontent.com/keras-team/keras-io/master/examples/structured_data/img/tabtransformer/tabtransformer.png" width="500"/>

In [None]:
def create_mlp(hidden_units, dropout_rate, activation, normalization_layer, name=None):

    mlp_layers = []
    for units in hidden_units:
        mlp_layers.append(normalization_layer),
        mlp_layers.append(layers.Dense(units, activation=activation))
        mlp_layers.append(layers.Dropout(dropout_rate))

    return keras.Sequential(mlp_layers, name=name)

In [None]:
def create_tabtransformer_classifier(
    num_transformer_blocks,
    num_heads,
    embedding_dims,
    mlp_hidden_units_factors,
    dropout_rate,
    use_column_embedding=False,
):

    # Create model inputs.
    inputs = create_model_inputs()
    # encode features.
    encoded_categorical_feature_list, numerical_feature_list = encode_inputs(
        inputs, embedding_dims
    )
    # Stack categorical feature embeddings for the Tansformer.
    encoded_categorical_features = tf.stack(encoded_categorical_feature_list, axis=1)
    # Concatenate numerical features.
    numerical_features = layers.concatenate(numerical_feature_list)

    # Add column embedding to categorical feature embeddings.
    if use_column_embedding:
        num_columns = encoded_categorical_features.shape[1]
        column_embedding = layers.Embedding(
            input_dim=num_columns, output_dim=embedding_dims
        )
        column_indices = tf.range(start=0, limit=num_columns, delta=1)
        encoded_categorical_features = encoded_categorical_features + column_embedding(
            column_indices
        )

    # Create multiple layers of the Transformer block.
    for block_idx in range(num_transformer_blocks):
        # Create a multi-head attention layer.
        attention_output = layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embedding_dims,
            dropout=dropout_rate,
            name=f"multihead_attention_{block_idx}",
        )(encoded_categorical_features, encoded_categorical_features)
        # Skip connection 1.
        x = layers.Add(name=f"skip_connection1_{block_idx}")(
            [attention_output, encoded_categorical_features]
        )
        # Layer normalization 1.
        x = layers.LayerNormalization(name=f"layer_norm1_{block_idx}", epsilon=1e-6)(x)
        # Feedforward.
        feedforward_output = create_mlp(
            hidden_units=[embedding_dims],
            dropout_rate=dropout_rate,
            activation=keras.activations.gelu,
            normalization_layer=layers.LayerNormalization(epsilon=1e-6),
            name=f"feedforward_{block_idx}",
        )(x)
        # Skip connection 2.
        x = layers.Add(name=f"skip_connection2_{block_idx}")([feedforward_output, x])
        # Layer normalization 2.
        encoded_categorical_features = layers.LayerNormalization(
            name=f"layer_norm2_{block_idx}", epsilon=1e-6
        )(x)

    # Flatten the "contextualized" embeddings of the categorical features.
    categorical_features = layers.Flatten()(encoded_categorical_features)
    # Apply layer normalization to the numerical features.
    numerical_features = layers.LayerNormalization(epsilon=1e-6)(numerical_features)
    # Prepare the input for the final MLP block.
    features = layers.concatenate([categorical_features, numerical_features])

    # Compute MLP hidden_units.
    mlp_hidden_units = [
        factor * features.shape[-1] for factor in mlp_hidden_units_factors
    ]
    # Create final MLP.
    features = create_mlp(
        hidden_units=mlp_hidden_units,
        dropout_rate=dropout_rate,
        activation=keras.activations.selu,
        normalization_layer=layers.BatchNormalization(),
        name="MLP",
    )(features)

    # Add a sigmoid as a binary classifer.
    outputs = layers.Dense(units=1, activation="sigmoid", name="sigmoid")(features)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

### Acknowledgements :
Google supported this work by providing Google Cloud credit

### References :

https://www.kaggle.com/odins0n/tps-feb-22-eda-modelling#Feature-Engineering

https://www.kaggle.com/hamzaghanmi/welcome-tps-feb-2022

https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/structured_data/ipynb/tabtransformer.ipynb

https://arxiv.org/pdf/2012.06678.pdf

### Work in progress ðŸš§