<h1 style="background-color:SteelBlue; color:white" >-> Content:</h1>
Hi all, this notebook covers the most important concepts of deep learning on structured (tabular) data. 

This project is still work in progress. Feel free to leave a comment to suggest further improvements.

## 0. [Prerequisits](#sec0)

## 1. [Tensor Handling](#sec1)
* Rank And Size
* Reshaping Tensors
* Dtypes
* Tensor Broadcasting
* Use Tensors On GPUs

## 2. [Data Loading](#sec2)
* Custom Datasets
* Samplers And Data Loaders
* Demonstration

## 3. [Modules, Linear Layers And FNN](#sec3)
* Linear Layers From Scratch
* Create A Custom FNN
* Model Inspection
* Demonstration

## 4. [Autograd and Training](#sec4)
* Autograd
* Training Neural Networks
* Training Loop And Evaluation
* Perform Training

## 5. [Regularization](#sec5)
* Overfitting And Underfitting
* Vanishing And Exploding Gradients
* Weight Decay
* Dropout

## 6. [Gradient Accumulation](#sec6)

## 7. [(Batch) Normalization](#sec7)

<a id="sec0"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 0. Prerequisits</h1>


1. **python fundamentals:** [This simple & free Kaggle Course](https://www.kaggle.com/learn/python) is already enough!
2. **notebooks & numpy:** [Chapter 1&2 of this free book](https://jakevdp.github.io/PythonDataScienceHandbook/) is probably the best way to learn it! 

From now on I expect you all to be familiar with the concepts used in the named sources. I don't expect any further python skills.

In [None]:
#!pip install tabulate

# all imports we will need
from wand.image import Image as WImage
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import RandomSampler, SequentialSampler, DataLoader
from tabulate import tabulate
from graphviz import Digraph
from torch.autograd import Variable
from torch.optim import Adam
from sklearn.metrics import accuracy_score


<a id="sec1"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 1. Tensor Handling</h1>

There are multiple ways of interpreting tensors. From a python perspective it's best to interpret them as regular arrays.
We can simply create them from existing lists or numpy arrays.

## Rank And Size
Describing tensors we need some terminology. The most important property of a tensor is its **rank** and **size**.

Tensor of...

* **rank 0**: a number / scalar
* **rank 1**: an array of rank 0 tensors
* **rank 2**: an array of rank 1 tensors
* **rank 3**: an array of rank 2 tensors
...

The **size** of a tensor describes how many tensors of smaller ranks are stored in it.

In [None]:
array = np.array([[[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [0, 0, 0, 1]],

                  [[4, 3, 2, 1],
                   [8, 7, 6, 5],
                   [1, 0, 0, 0]]
                  ])

# bridge from np to torch
tensor = torch.tensor(array)
# bridge from torch to np
array = tensor.numpy()

print(tensor, "\n\n")
print("size =", tensor.size())
print("\ntype:\n", type(tensor.numpy()))

the **size** of this tensor tells us that it is a tensor of rank 3 consisting of 2 tensors of rank 2, in which 3 tensors of rank 1 are stored. In each of these tensors of rank 1 are 4 tensors of rank 0.

## Reshaping Tensors
We can simply reshape tensors to change their size.
Having tensor of rank n, reshaping it recursively fills the desired shape beginning with the first element of the first tensor of rank n-1.

In [None]:
tensor = torch.rand(size=(3, 4, 2))
print("original tensor:\n", tensor)

new_size = [4, 6]
tensor_reshaped = tensor.reshape(new_size)
print("reshaped:\n", tensor_reshaped)

## Dtypes
Each tensor stores values of a single type only.
You will see later on that some core functionalities of PyTorch expect tensors of a certain **dtype**.

Standard **dtypes**:
* **int64 aka long**: tensors storing integers only
* **float32 aka float**: tensors storing at least one non-integer

In [None]:
tensor = torch.rand(size=(3, 3))

# change dtype of a tensor
tensor_long = tensor.long()
print("dtype of a long tensor:", tensor_long.dtype)

tensor_float = tensor.float()
print("dtype of a float tensor:", tensor_float.dtype)

## Tensor Broadcasting

similar to numpy arrays, we can add, subtract, multiply ... 2 tensors. 

The following example shows that tensor operations stick to the rules of **numpy broadcasting** as explained in [this chapter](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html) of the book listed in the prerequisits:

if we add a tensor **a of rank 2** to a tensor **b of rank 1**, b is will be added to each tensor of rank 1 stored in a

In [None]:
tensor_a = torch.tensor([[1, 2, 3],
                         [4, 5, 6]])

tensor_b = torch.tensor([7, 8, 9])

tensor_a + tensor_b

## Use Tensors On GPUs

One of the most important benefits of tensors is that you can perform tensor operations on the GPU (if you have one).

Both tensors have to be on the same device if you want to perfom an operation on them.

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

# move them to the gpu (if one is available),
tensor_a.to(device)
tensor_b.to(device)

tensor_a + tensor_b

<a id="sec2"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 2. Data Loading</h1>

## Custom Datasets
Each entry of a TensorDataset contains the independend and the dependend variable(s) of one observation.

In [None]:
class CustomDataset(Dataset):
    """
    contains all the mandatory methods
    to load it using a pytorch dataloader
    """

    def __init__(self, data):
        super().__init__()
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, item):
        obs = self.data.iloc[item, :]
        x = torch.tensor(obs.drop("label").values).float()
        y = torch.tensor(obs["label"]).long()
        return x, y

## Samplers And Data Loaders
* samplers define data drawing policies
* Data Loaders are used to draw batches of data from Datasets using drawing policies

In [None]:
def create_loader(df: pd.DataFrame, sampler_class, batch_size: int):
    dataset = CustomDataset(data=df)
    sampler = sampler_class(data_source=dataset)
    return DataLoader(dataset=dataset, 
                      sampler=sampler, 
                      batch_size=batch_size)

## Demonstration

In [None]:
def get_demo_batch():
    # parameters
    batch_size = 16

    # read data
    demo_df = pd.read_csv("../input/mnist-in-csv/mnist_train.csv")

    # create loader
    demo_loader = create_loader(df=demo_df,
                                sampler_class=RandomSampler,
                                batch_size=batch_size)

    # draw first batch from loader
    demo_x_batch, demo_y_batch = next(iter(demo_loader))
    return demo_x_batch, demo_y_batch

In [None]:
def demo_dataloaders():
    demo_x_batch, demo_y_batch = get_demo_batch()
    print("each batch is stored in a list.")
    print("\nthe first entry is of type", type(demo_x_batch))
    print("and has a size of", demo_x_batch.size())
    print("\nthe second entry is of type", type(demo_y_batch))
    print("and has a size of", demo_y_batch.size())


demo_dataloaders()

<a id="sec3"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 3. Linear Layers And FNN</h1>

## Modules, Linear Layers From Scratch

PyTorch combines tensors and tensor operations to modules. These modules are building blocks which can be used to construct neural networks. They can contain single layers or large neural networks. 
The probably most simple module is the [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layer.

Given some input $x$, the linear layer computes its output $y$ via $y=xA^T+b$. Both, the weight matrix $A$ and the bias $b$ will be adapted during the training process so that the output $y$ becomes as close as possible to the ground truth.

In [None]:
class CustomLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        A = torch.randn(size=(out_features, in_features))
        self.weight = nn.Parameter(data=A, requires_grad=True)
        b = torch.randn(size=(out_features,))
        self.bias = nn.Parameter(data=b, requires_grad=True) 

    def forward(self, x):
        return x @ self.weight.t() + self.bias

The `CustomLinear` module is a simplified implementiation of the [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) module.
The fundamental functionalities are the same.
Note that this layer is nothing else but a combination of tensors with `requires_grad=True`, and tensor operations.
Note that `A` and `b` are `nn.Parameters`, and thus are iteratively altered during the training process to improve the performance of the module.
Later on we will see how this works in detail.
I renamed `A` and `b` to `weight` and `bias` respectively. to match the pattern of the original linear layer.

## Create A Custom FNN
Note: The first linear layer is builtin, the last linear layer is our custom module

As we can see, the whole neural network is based on the module-abstraction as well.

In [None]:
class ExampleFNN(nn.Module):
    def __init__(self, num_feats, num_classes):
        super(ExampleFNN, self).__init__()

        # hidden layer 1
        self.linear1 = nn.Linear(in_features=num_feats, 
                                 out_features=256)
        self.relu1 = nn.ReLU()

        # output layer 
        self.linear2 = CustomLinear(in_features=256, 
                                    out_features=num_classes)

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu1(x)
        return self.linear2(x)

## Model Inspection
Let's investigate our model a little. Therefore, we can either use a predefined model summarizer like [torchsummery](https://pypi.org/project/torch-summary/https://pypi.org/project/torch-summary/) or write our own summarizer from scratch. The latter provides some further insights into handling the model.

Let's investigate how the model parameters are initialized:

In [None]:
def show_weights(module):
    if (type(module) == nn.Linear) or (type(module) == CustomLinear):
        print(module)
        print(type(module))
        print(module.weight)
        print("\n")

Note that we can use the apply method to iterate over the whole model. We can exploit this functionality to select certain module types and transform them. I.e., we can perform our own parameter initialization before training the model.

The last element contains the whole model.

In [None]:
def summary(module, x):
    """
    Iterates over all inner module childs,
    calculates number of parameters per child,

    :param module: module to summarize
    :param x: demo module input
    """
    print_list = []
    total_params = 0

    # iterate over all inner module childs
    for child in module.named_children():
        x = child[1](x)
        param_string = ""
        child_params = 0
        for param in child[1].named_parameters():
            shape = list(param[1].size())
            params = 1
            for ax in shape:
                params *= ax
            child_params += params
            param_string = param_string+f"'{param[0]}'"+" shape: "+str(shape)+" "
        total_params += child_params
        print_list.append([child[0], list(x.size()), param_string, child_params])

    print(f"Using a Batch Size of {x.size(0)}:\n")
    headers = ["Name", "Out Shape", "Weights", "Trainable Parameters"]
    print(tabulate(print_list, headers=headers))
    print("\nTrainable Model Parameters:", total_params)

visualize the model as a computational graph:

In [None]:
# source; https://gist.github.com/wangg12/f11258583ffcc4728eb71adc0f38e832
def make_dot(var, params=None):
    if params is not None:
        assert isinstance(params.values()[0], Variable)
        param_map = {id(v): k for k, v in params.items()}

    node_attr = dict(style="filled", 
                     shape="box", 
                     align="left", 
                     fontsize="12", 
                     ranksep="0.1", 
                     height="0.2")
    dot = Digraph(node_attr=node_attr, graph_attr=dict(size="12,12"))
    seen = set()

    def size_to_str(size):
        return "(" + (", ").join(["%d" % v for v in size]) + ")"

    def add_nodes(var):
        if var not in seen:
            if torch.is_tensor(var):
                dot.node(str(id(var)), 
                         size_to_str(var.size()), 
                         fillcolor="orange")
                dot.edge(str(id(var.grad_fn)), str(id(var)))
                var = var.grad_fn
            if hasattr(var, "variable"):
                u = var.variable
                name = param_map[id(u)] if params is not None else ""
                node_name = "%s\n %s" % (name, size_to_str(u.size()))
                dot.node(str(id(var)), 
                         node_name, 
                         fillcolor="lightblue")
            else:
                dot.node(str(id(var)), str(type(var).__name__))
            seen.add(var)
            if hasattr(var, "next_functions"):
                for u in var.next_functions:
                    if u[0] is not None:
                        dot.edge(str(id(u[0])), str(id(var)))
                        add_nodes(u[0])
            if hasattr(var, "saved_tensors"):
                for t in var.saved_tensors:
                    dot.edge(str(id(t)), str(id(var)))
                    add_nodes(t)

    add_nodes(var)
    return dot

## Demonstration

In [None]:
def get_demo_model():
    return ExampleFNN(num_feats=784, num_classes=10)

In [None]:
def demo_fnn():
    demo_model = get_demo_model()

    demo_x_batch, _ = get_demo_batch()
    demo_pred = demo_model(demo_x_batch)

    # show weight initiation
    demo_model.apply(show_weights)

    # model summary
    summary(module=demo_model, x=demo_x_batch)

    # create computational graph
    graph = make_dot(demo_pred)
    graph.view()
    img = WImage(filename='../input/computational-graph/Digraph.gv (2).pdf')
    return img


demo_fnn()

Each layer $\mathscr{l}$ of a FNN has $M_{\mathscr{l}-1} \cdot M_{\mathscr{l}}$ many weights plus $M_{\mathscr{l}}$ many bias terms. $M_{\mathscr{l}}$ is the number of nodes (i.e. output size) of layer ${\mathscr{l}}$.

The computational graph displays:

* orange = model output is of size `batch_size x n_classes` (leaf of the forward pass tree, root of the backward pass tree)
* blue = trainable model parameters (root of the forward pass tree, leaf of the backward pass tree)

note: the edges within the backward pass direct into the opposite directions. This results in the roots becoming leafs, and vice versa. Thus, we often say that the gradient can be calculated wrt the leafs, we actually refer to the model parameters.

<a id="sec4"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 4. Autograd And Training</h1>

## Autograd

We can describe every tensor operation as a function. PyTorch Autograd allows us to calculate the Gradient of such a function with respect to each individual input. Let me give you an example. If you are not familiar with matrix calculus, you can always refer to [this](https://en.wikipedia.org/wiki/Matrix_calculus):

$$f:\mathbb{R}^2\rightarrow \mathbb{R}, f(x) = x^Tx+1 = x_1^2+x_2^2 +1$$

Has the following partial derivatives:
$$\frac{\partial f(x)}{\partial x_1}=2x_1$$

$$\frac{\partial f(x)}{\partial x_2}=2x_2$$

So we would obtain at $x=\left(\begin{array}{c} 2 \\ 3 \end{array}\right)$:


$$f(x)=2^2+3^2+1=4+9+1=14$$

$$\frac{\partial f(x)}{\partial x_1}=4$$

$$\frac{\partial f(x)}{\partial x_2}=6$$

As we can see, Pytorch autograd provides the same results:

In [None]:
def f(x):
    return x.t() @ x + 1


x = torch.tensor([[2],
                  [3]], dtype=torch.float32)

# enable autograd for that tensor
# underscore in the end denotes inplace operations
x.requires_grad_()

y = f(x)

print(y)

y.backward()  # populate gradients
x.grad

This way, we can automatically calculate the gradient of a function (e.g. a loss function) with respect to all parameters, even if the computation of the function contains loops and conditionals.
Let's exploit this for training a neural network!

## Training Neural Networks

One update step is performed as follows:
1. fetch a batch
2. perform the forward pass
3. calculate the loss
4. calculate the gradient of the loss wrt all model parameters
5. perform a variant of gradient descent to update the model parameters
6. clear the gradient to perform further updates

Let's perform such a step to get an idea of its mechanics!

In [None]:
def demo_update_step():
    # select a model to train
    demo_model = get_demo_model()

    # set the model into training mode if not done yet
    demo_model.train()

    # an elaborate variant of gradient descent
    optimizer = Adam(demo_model.parameters())

    # a loss function to determine the "goodness" of the model
    loss_func = nn.CrossEntropyLoss()

    # fetch a batch
    x, y = get_demo_batch()

    # forward
    probas = demo_model(x)

    # calculate loss
    loss = loss_func(probas, y)

    print("weight gradient before backward:\n", 
          demo_model.linear2.weight.grad)
    print("x gradient before backward:\n", 
          x.grad)

    # calculate gradient wrt all model parameters
    loss.backward()

    # gradient descent to update the model parameters
    optimizer.step()

    print("\nweight gradient after backward:\n", 
          demo_model.linear2.weight.grad)
    print("x gradient after backward:\n", 
          x.grad)

    # clear gradient
    optimizer.zero_grad()

    print("\nweight gradient after clearing:\n", 
          demo_model.linear2.weight.grad)
    print("x gradient after clearing:\n", 
          x.grad)


demo_update_step()

As we can see, `loss.backward()` populates the gradient tensors of the parameters, not of the input.
Moreover, we can see that we can clear them up (setting them to 0) manually using `optimizer.zero_grad()`. Otherwise the gradients would end up being added together (which has some benefits in more complex scenarios we will talk about later on).

## Training Loop And Evaluation

In [None]:
def calculate_loss(batch, model, loss_func, device):
    x_batch = batch[0].to(device)
    y_batch = batch[1].to(device)

    optimizer.zero_grad()  # clear gradient
    model.train()  # training mode  
    out = model(x_batch)  # forward = make predictions
    loss = loss_func(out, y_batch)  # calculate error
    return loss


def train(batch, model, optimizer, loss_func, device):
    loss = calculate_loss(batch=batch, 
                          model=model, 
                          loss_func=loss_func, 
                          device=device)
    loss.backward()  # populate gradient wrt model parameters
    optimizer.step()  # update model parameters
    return loss


def train_loop(model, optimizer, loss_func, train_loader, device, n_loss_prints):
    print_interval = int(len(train_loader) / n_loss_prints)
    next_print_iter = print_interval
    loss_summed = 0
    for i, batch in enumerate(train_loader):
        loss_summed += train(batch=batch,
                             model=model,
                             optimizer=optimizer,
                             loss_func=loss_func,
                             device=device)
        if i == next_print_iter:
            print(f"iteration {i+1}/{len(train_loader)}, loss:{loss_summed/print_interval}")
            next_print_iter += print_interval
            loss_summed = 0


def predict(model, sequential_loader, device):
    model.eval()
    y_proba = []
    y_true = []
    for batch in sequential_loader:
        x_batch = batch[0].to(device)
        y_batch = batch[1].to(device)
        with torch.no_grad():
            out = model(x_batch)
            y_proba.append(out)
            y_true.append(y_batch)
    y_proba_tensor = torch.cat(y_proba)
    y_true_tensor = torch.cat(y_true).cpu().numpy()
    y_pred = np.argmax(y_proba_tensor.cpu().numpy(), axis=1)
    return [y_pred, y_true_tensor]

<a id="sec5"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 5. Regularization</h1>

## Overfitting And Underfitting
todo

## Vanishing And Exploding Gradients
todo

## Weight Decay
* weight decay (l2 regularization) is a regularization technique
* weight decay flattens the distribution of weights within the model (parameters are more likely to have similar values)
* model is less sensitive and focuses more on underlying patterns instead of noise
* gradient flow is improved, which lowers the risk of vanishing/exploding gradients (more on that later)

Let's perform training with some weight decay:

In [None]:
# parameters
total_epochs = 2
batch_size = 16

# create loaders
train_df = pd.read_csv("../input/mnist-in-csv/mnist_train.csv")
val_df = pd.read_csv("../input/mnist-in-csv/mnist_test.csv")
train_sampler = RandomSampler
eval_sampler = SequentialSampler
train_loader = create_loader(df=train_df,
                             sampler_class=train_sampler,
                             batch_size=batch_size)
train_loader_eval = create_loader(df=train_df,
                                  sampler_class=eval_sampler,
                                  batch_size=batch_size)
val_loader_eval = create_loader(df=val_df,
                                sampler_class=eval_sampler,
                                batch_size=batch_size)

In [None]:
model = ExampleFNN(num_feats=784, num_classes=10).to(device)

# access all parameters
# perform weight decay on selected ones
optimizer = Adam([{"params": model.linear1.bias},
                  {"params": model.linear2.bias},
                  {"params": model.linear1.weight, "weigth_decay": 1},
                  {"params": model.linear2.weight, "weight_decay": 1}], lr=0.0001)
loss_func = nn.CrossEntropyLoss()

# train
for epoch in range(1, total_epochs + 1):
    print(f"Epoch {epoch} / {total_epochs}:")
    train_loop(model=model,
               optimizer=optimizer,
               loss_func=loss_func,
               train_loader=train_loader,
               device=device,
               n_loss_prints=10)
    y_pred_train, y_true_train = predict(model=model,
                                         sequential_loader=train_loader_eval,
                                         device=device)

    y_pred_val, y_true_val = predict(model=model,
                                       sequential_loader=val_loader_eval,
                                       device=device)
    acc_train = accuracy_score(y_true=y_true_train, y_pred=y_pred_train)
    acc_val = accuracy_score(y_true=y_true_val, y_pred=y_pred_val)
    print("Train Accuracy:", acc_train)
    print("Validation Accuracy:", acc_val)
    print("\n")

## Dropout
todo

<a id="sec6"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 6. Gradient Accumulation</h1>
Gradient Accumulation might be beneficial:

* The optimal batch size depends on both data and algorithm. 

* Too small values lead to less stable learning behavior and giving more importance to outliers. The calculated Gradient per batch is just a predicted gradient for the whole dataset.

* GPU memory is limited, thus we might be forced to use batch sizes smaller than optimal.

* GPU memory consumption mostly depends on the `model complexity` (#parameters), the `data size` (e.g. large images vs. small imagtes)

* We can reduce the memory consumption depending on data size by using Gradient Accumulation

* `Gradient Accumulation` sequentially creates gradients for pseudo batches that are larger than the actual batches.

In [None]:
def train_loop_ga(model, optimizer, loss_func, train_loader, device, accumulation):
    loss_accumulated = 0
    for i, batch in enumerate(train_loader):
        loss_accumulated += calculate_loss(batch=batch, 
                              model=model, 
                              loss_func=loss_func, 
                              device=device)
        if ((i+1)%accumulation==0):
            loss_accumulated /= accumulation
            loss_accumulated.backward()
            optimizer.step()
            optimizer.zero_grad() 
            print(f"iteration {i+1}/{len(train_loader)}, loss:{loss_accumulated}")
            loss_accumulated = 0

In [None]:
model = ExampleFNN(num_feats=784, num_classes=10).to(device)
optimizer = Adam([{"params": model.linear1.bias},
                  {"params": model.linear2.bias},
                  {"params": model.linear1.weight, "weigth_decay": 50},
                  {"params": model.linear2.weight, "weight_decay": 50}], lr=0.0001)
loss_func = nn.CrossEntropyLoss()

# train
for epoch in range(1, total_epochs + 1):
    print(f"Epoch {epoch} / {total_epochs}:")
    train_loop_ga(model=model,
                  optimizer=optimizer,
                  loss_func=loss_func,
                  train_loader=train_loader,
                  device=device,
                  accumulation=150)
    y_pred_train, y_true_train = predict(model=model,
                                         sequential_loader=train_loader_eval,
                                         device=device)

    y_pred_val, y_true_val = predict(model=model,
                                       sequential_loader=val_loader_eval,
                                       device=device)
    acc_train = accuracy_score(y_true=y_true_train, y_pred=y_pred_train)
    acc_val = accuracy_score(y_true=y_true_val, y_pred=y_pred_val)
    print("Train Accuracy:", acc_train)
    print("Validation Accuracy:", acc_val)
    print("\n")

As we can see, a larger batch size (aka a bigger accumulation range) doesn't automatically result in a better training behavior. 

<a id="sec7"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 7. (Batch) Normalization</h1>
todo

Helpful Videos and Blogs:
* [Elliot Waite: Autograd](https://www.youtube.com/watch?v=MswxJw-8PvE&t=75s)

This Project is still in progress =)

<div class="alert alert-danger" role="alert">
    <h3>Feel free to <span style="color:red">comment</span> if you have any suggestions   |   motivate me with an <span style="color:red">upvote</span> if you like this project.</h3>
</div>