# Background
In the competition, https://www.kaggle.com/headsortails/explorations-of-action-moa-eda, we saw that the data contains:
* treated and control samples
* high and low treatments
* treatment treated for 24h, 48h, and 72h
Gene expression features were randomly sampled and found to be about normally distributed. 

# Goal
Build a basic neural network using pytorch to train, validate, and test a model to classify the mechanisms of action for each drug

# Libraries
The code in the box directly below comes in every kaggle notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Next I import the libaries that will be used to build the neural network.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opt
import torch.utils.data as data

import tqdm
import itertools

# Set up the dataset

In [None]:
train_df = pd.read_csv("/kaggle/input/lish-moa/train_features.csv")
print(train_df.shape)
train_df.head()

Let's take a quick look at our categorical variables

In [None]:
train_df.cp_type.value_counts()

In [None]:
train_df.cp_dose.value_counts()

cool, cool. Now we look at the labels for each row in the training set.

In [None]:
train_labels_df = pd.read_csv("/kaggle/input/lish-moa/train_targets_scored.csv")
print(train_labels_df.shape)
train_labels_df.head()

We can later match the `sig_id` in the rows to get the labels. First, let's set up a validation set to use for training. We randomly choose 5000 of the 23814 rows.

In [None]:
val_df = train_df.sample(5000, random_state=100) 
print(val_df.shape)
val_df.head()

And why not, let's import the testing set.

In [None]:
test_df = pd.read_csv("/kaggle/input/lish-moa/test_features.csv")
test_df.head()

# Create the Model
Now the code starts to get more complicated as we dive into pytorch and a feed-forward network!

Here is the reference I used to get acquainted as a beginner:
* https://www.marktechpost.com/2019/06/30/building-a-feedforward-neural-network-using-pytorch-nn-module/

A feed-forward network simply takes input, feeds the input through several layers, and then finally gives the output. 

To make this network, we create a class `FeedForward_Network` which inherits the class `nn.Module`. An `nn.Module` contains layers, and a method `forward(input)` that returns the output.

In the `__init__` function we have 5 layers that apply linear transformations as well as a list of dropout parameters. Because we have 5 layets that means we reshape the data 4 times before we return the output. The four shapes here are 700,500,300, and 250. These can be optimized, but the reason we use decreasing numbers is because previous work has shown us that smaller layers can generalize models better.

The dropout rates can also be optimized. Dropouts are useful to prevent overfitting.

With each forward pass through the network we run the input through the first layer, apply relu to remove any negative connections (A negative relation is meaningless so removing them will allow for models to learn faster and perform better), dropout some of the sample features, repeat three times with the second, third, and forth layer, apply the fifth layer, and then run the data into a sigmoid function to scale the outputs to be between 0 and 1.

In [None]:
class FeedForward_Network(nn.Module):
    def __init__(self, input_size=100, output_size=100, 
                 dropout=[0.25, 0.25, 0.25, 0.25]):
        super(FeedForward_Network, self).__init__()
        
        self.layer1 = nn.Linear(input_size, 700)
        self.layer2 = nn.Linear(700, 500)
        self.layer3 = nn.Linear(500, 300)
        self.layer4 = nn.Linear(300, 250)
        self.layer5 = nn.Linear(250, output_size)
        
        self.dropout = dropout
        
    def forward(self, x):
        # run linear transformation using layer 1
        output = self.layer1(x)
        # remove negative connections
        output = F.relu(output) 
        # prevent overfitting
        output = F.dropout(output, p=self.dropout[0]) 
        
        # run linear transformation using layer 2
        output = self.layer2(output)
        # remove negative connections
        output = F.relu(output)
        # prevent overfitting
        output = F.dropout(output, p=self.dropout[1])
        
        # run linear transformation using layer 3
        output = self.layer3(output) 
        # remove negative connections
        output = F.relu(output)
        # prevent overfitting
        output = F.dropout(output, p=self.dropout[2])
        
        # run linear transformation using layer 4
        output = self.layer4(output)
        # remove negative connections
        output = F.relu(output)
        # prevent overfitting
        output = F.dropout(output, p=self.dropout[3])
        
        # run linear transformation using layer 5
        output = self.layer5(output)
        # translate output to range between 0-1
        return torch.sigmoid(output) 


Now we create a dictionary `data_map` that contains both our training and validation data in tuples. Each tuple contains:
0. the features of the (training or validation) set
1. the targets of the (training or validation) set

In [None]:
data_map = {
    'train':(
        train_df.query(f"sig_id not in {val_df.sig_id.tolist()}"), 
        train_labels_df.query(f"sig_id not in {val_df.sig_id.tolist()}")
    ),
    'validation':(
        val_df, 
        train_labels_df.query(f"sig_id in {val_df.sig_id.tolist()}")
    )
}

Next we create a dictionary of dataloaders `dataloader_map` using the the above `data_map` dictionary. For beginners, note that we use "[dictionary comprehension](https://www.python.org/dev/peps/pep-0274/)" rather than a for-loop to fill the `dataloader_map`. 

The `dataloader_map` contains a [DataLoader object](https://pytorch.org/docs/stable/data.html) for both both the 'train' and 'validation' set. This is the `DataLoader`constructor signature:
```
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)
```
There are two possible dataset types: interable-style and map-style. Iterable-style datasets are particularly suitable for cases where random reads are either improbable, impossible, or expensive. Since this training data is not being streamed in we do not need this type. Instead we will use a map-style dataset which implements __getitem__() and __len__().

We set the "dataset" in the DataLoader to a TensorDataset object which wraps a dataset with tensors. The TensorDataset class takes tensors that have the same height but nesiccarily the same width. We import two FloatTensors into the TensorDataset. One which contains the features of the (training or validation) set and the other which contains the targets.

Because we are using a map-style dataset we can control how the data in our DataLoader is loaded. Batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model's internal parameters are updated. Here we set this to 256, but it can always be optimized.

Every time we use the dataloader we want to reshuffle the training data to prevent overfitting (shuffle=True). However, we do not need to shuffle the validation data (shuffle=False).

In [None]:
#Create the dataloader
column_drops = ["sig_id", "cp_type", "cp_time", "cp_dose"]
dataloader_map = {
    label:data.DataLoader(
        data.TensorDataset(
            torch.FloatTensor(data_map[label][0].drop(column_drops, axis=1).values), 
            torch.FloatTensor(data_map[label][1].drop("sig_id", axis=1).values)
        ), 
        batch_size=256,
        shuffle=(label == 'train')
    )
    for label in data_map
}

Now we can finally define the FeedForward network model using the FeedForward_Nerwork class we made above. We set the height of the training features floatTensor and the height of the target values floatTensor to initialize our network.

For this competition submissions are scored by the log loss:

$$ \text{score} = - \frac{1}{M}\sum_{m=1}^{M} \frac{1}{N} \sum_{i=1}^{N} \left[ y_{i,m} \log(\hat{y}_{i,m}) + (1 - y_{i,m}) \log(1 - \hat{y}_{i,m})\right] $$

where:

* `N` is the number of sig_id observations in the test data (\(i=1,…,N\))
* `M` is the number of scored MoA targets (\(m=1,…,M\))
* $\hat{y}_{i,m}$ is the predicted probability of a positive MoA response for a sig_id
* $y_{i,m}$ is the ground truth, 1 for a positive response, 0 otherwise
* $log()$ is the natural (base e) logarithm

Since our data tends to be roughly normally distributed we set our loss function to Poisson negative log loss. 

In [None]:
model = FeedForward_Network(
    input_size=dataloader_map['train'].dataset[0][0].shape[0],
    output_size=dataloader_map['train'].dataset[0][1].shape[0]
)
loss_fn = nn.PoissonNLLLoss()

We set a few more hyper-parameters. 
* The number of epochs is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset. This can be optimized. 
* lr is the learning rate, the rate at which model weights are adjusted. If this number is too low, the model will take forever to run, but if it is too high, the model may overshoot.

We can try multiple values for these hyper-parameters by adding them to the list. We put these hyper-parameters into a grid so we can test each of the epoch values with each of the learning-rate values.

In [None]:
#epochs = [50,100, 200]
#lr = [1e-3, 1e-5, 1e-7]
epochs = [50]
lr = [1e-3]
grid = itertools.product(epochs, lr)

To optimize the model using each of the epoch values with each of the learning-rate values we perform the code below.

In [None]:
for epoch_grid, lr_param in grid:
    optimizer = opt.Adam(model.parameters(), lr=lr_param)
    
    # Run the epochs
    for epoch in range(epoch_grid+1):
        
        if epoch > 0:
            print(epoch)
            train_loss = []
            
            for batch in dataloader_map['train']:
                # reset optimizer
                optimizer.zero_grad()
                
                # pass data through the model
                prediction = model(batch[0])
                loss = loss_fn(prediction, batch[1])
                
                # Track the model loss
                train_loss.append(loss.item())
                loss.backward()
                optimizer.step()

            print(f"train loss: {np.mean(train_loss)}")
                
                
        # Set model to evaluation
        model.eval()
        val_loss = []
        
        for batch in dataloader_map['validation']:
            # Pass data through model
            prediction = model(batch[0])
            
            # Get the loss
            loss = loss_fn(prediction, batch[1])
            
            # Track the loss
            val_loss.append(loss.item())
        
        print(f"val loss: {np.mean(val_loss)}")


Once the model has finished predicting the target values we can submit the predictions by importing them into the "model_submission.csv" file.

In [None]:
feat_cols = (
    train_labels_df
    .drop("sig_id", axis=1)
    .columns
)

(
    pd.DataFrame(
        model(
            torch.FloatTensor(
                test_df
                .drop(column_drops, axis=1)
                .values
            )
        )
        .detach()
        .numpy(),
        columns=feat_cols
    )
    .assign(sig_id=test_df.sig_id.tolist())
    .to_csv("submission.csv", index=False)
)