In [2]:
import numpy as np

**Module** is an abstract class which defines fundamental methods necessary for a training a neural network. You do not need to change anything here, just read the comments.

In [3]:
class Module(object):
    """
    Basically, you can think of a module as of a something (black box)
    which can process `input` data and produce `ouput` data.
    This is like applying a function which is called `forward`:

        output = module.forward(input)

    The module should be able to perform a backward pass: to differentiate the `forward` function.
    More, it should be able to differentiate it if is a part of chain (chain rule).
    The latter implies there is a gradient from previous step of a chain rule.

        gradInput = module.backward(input, gradOutput)

        В принципе, вы можете представить себе модуль как нечто (черный ящик)
    , которое может обрабатывать "входные" данные и выдавать "выходные" данные.
    Это похоже на применение функции, которая называется `forward`:

        output = модуль.forward (ввод)

    Модуль должен уметь выполнять обратный переход: чтобы отличить функцию "forward".
    Более того, он должен уметь отличать ее, если она является частью цепочки (правило цепочки).
    Последнее подразумевает, что существует отклонение от предыдущего шага правила цепочки.

        gradInput = модуль.обратный(ввод, gradOutput)

    """
    def __init__ (self):
        self.output = None
        self.gradInput = None
        self.training = True

    def forward(self, input):
        """
        Takes an input object, and computes the corresponding output of the module.

        Принимает входной объект и вычисляет соответствующий выходной сигнал модуля.

        """
        return self.updateOutput(input)

    def backward(self,input, gradOutput):
        """
        Performs a backpropagation step through the module, with respect to the given input.

        This includes
         - computing a gradient w.r.t. `input` (is needed for further backprop),
         - computing a gradient w.r.t. parameters (to update parameters while optimizing).

        Выполняет шаг обратного распространения по модулю в соответствии с заданными входными данными.

        Это включает
         - вычисление градиента с использованием `входных данных" (необходимо для дальнейшей обратной обработки),
         - вычисление градиента с параметрами (для обновления параметров при оптимизации).
        """
        self.updateGradInput(input, gradOutput)
        self.accGradParameters(input, gradOutput)
        return self.gradInput


    def updateOutput(self, input):
        """
        Computes the output using the current parameter set of the class and input.
        This function returns the result which is stored in the `output` field.

        Make sure to both store the data in `output` field and return it.

        Вычисляет выходные данные, используя текущий набор параметров класса и входные данные.
        Эта функция возвращает результат, который сохраняется в поле "Выходные данные".

        Убедитесь в том, что вы сохранили данные как в поле "вывод", так и вернули их.
        """

        # The easiest case:

        # self.output = input
        # return self.output

        pass

    def updateGradInput(self, input, gradOutput):
        """
        Computing the gradient of the module with respect to its own input.
        This is returned in `gradInput`. Also, the `gradInput` state variable is updated accordingly.

        The shape of `gradInput` is always the same as the shape of `input`.

        Make sure to both store the gradients in `gradInput` field and return it.

        Вычисление градиента модуля относительно его собственных входных данных.
        Это значение возвращается в `gradInput`. Кроме того, переменная состояния `gradInput` обновляется соответствующим образом.

        Форма `gradInput` всегда совпадает с формой `input`.

        Убедитесь, что градиенты сохраняются в поле `gradInput` и возвращаются обратно.
        """

        # The easiest case:

        # self.gradInput = gradOutput
        # return self.gradInput

        pass

    def accGradParameters(self, input, gradOutput):
        """
        Computing the gradient of the module with respect to its own parameters.
        No need to override if module has no parameters (e.g. ReLU).

        Вычисление градиента модуля относительно его собственных параметров.
        Нет необходимости переопределять, если модуль не имеет параметров (например, ReLU).
        """
        pass

    def zeroGradParameters(self):
        """
        Zeroes `gradParams` variable if the module has params.

        Обнуляет переменную gradParams, если в модуле есть параметры.

        """
        pass

    def getParameters(self):
        """
        Returns a list with its parameters.
        If the module does not have parameters return empty list.

        Возвращает список с его параметрами.
        Если модуль не имеет параметров, то возвращается пустой список.
        """
        return []

    def getGradParameters(self):
        """
        Returns a list with gradients with respect to its parameters.
        If the module does not have parameters return empty list.

        Возвращает список с градиентами относительно его параметров.
        Если модуль не имеет параметров, возвращает пустой список.
        """
        return []

    def train(self):
        """
        Sets training mode for the module.
        Training and testing behaviour differs for Dropout, BatchNorm.

        Устанавливает режим обучения для модуля.
        Поведение при обучении и тестировании отличается в случае отсева и пакетной нормы.
        """
        self.training = True

    def evaluate(self):
        """
        Sets evaluation mode for the module.
        Training and testing behaviour differs for Dropout, BatchNorm.

        Устанавливает режим оценки для модуля.
        Поведение при обучении и тестировании отличается в зависимости от отсева и пакетной нормы.
        """
        self.training = False

    def __repr__(self):
        """
        Pretty printing. Should be overrided in every module if you want
        to have readable description.

        Красивая печать. Должно быть переопределено в каждом модуле, если вы хотите
        чтобы иметь удобочитаемое описание.
        """
        return "Module"

# Sequential container

**Define** a forward and backward pass procedures.

In [4]:
class Sequential(Module):
    """
         This class implements a container, which processes `input` data sequentially.

         `input` is processed by each module (layer) in self.modules consecutively.
         The resulting array is called `output`.

         Этот класс реализует контейнер, который последовательно обрабатывает "входные" данные.

         "входные" данные обрабатываются каждым модулем (слоем) в self.modules последовательно.
         Результирующий массив называется `выходным`.
    """

    def __init__ (self):
        super(Sequential, self).__init__()
        self.modules = []

    def add(self, module):
        """
        Adds a module to the container.

        Добавляет модуль в контейнер.
        """
        self.modules.append(module)

    def updateOutput(self, input):
        """
        Basic workflow of FORWARD PASS:

            y_0    = module[0].forward(input)
            y_1    = module[1].forward(y_0)
            ...
            output = module[n-1].forward(y_{n-2})


        Just write a little loop.

        Основной рабочий процесс ПРЯМОГО ПРОХОДА:

            y_0 = модуль[0].forward(ввод)
            y_1 = модуль[1].forward(y_0)
            ...
            output = модуль[n-1].forward(y_{n-2})


        Просто напишите небольшой цикл.
        """

        # Your code goes here. ################################################

        output = input
        self.output_modules = [output]  # Initialize list with input
        for module in self.modules:
            output = module.forward(output)
            self.output_modules.append(output)  # Store intermediate outputs
        self.output = output

        return self.output
    def backward(self, input, gradOutput):
        """
        Workflow of BACKWARD PASS:

            g_{n-1} = module[n-1].backward(y_{n-2}, gradOutput)
            g_{n-2} = module[n-2].backward(y_{n-3}, g_{n-1})
            ...
            g_1 = module[1].backward(y_0, g_2)
            gradInput = module[0].backward(input, g_1)


        !!!

        To ech module you need to provide the input, module saw while forward pass,
        it is used while computing gradients.
        Make sure that the input for `i-th` layer the output of `module[i]` (just the same input as in forward pass)
        and NOT `input` to this Sequential module.

        !!!

         Рабочий процесс ОБРАТНОГО ПРОХОДА:

            g_{n-1} = модуль[n-1].обратный(y_{n-2}, градуированный вывод)
            g_{n-2} = модуль[n-2].в обратном направлении(y_{n-3}, g_{n-1})
            ...
            g_1 = модуль[1].в обратном направлении(y_0, g_2)
            gradInput = module[0].обратный ввод(input, g_1)


        !!!

        Модулю ech необходимо предоставить входные данные, которые модуль видел при прямом проходе,
        они используются при вычислении градиентов.
        Убедитесь, что входные данные для "i-го" соответствуют выходным данным "модуля [i]" (точно таким же входным данным, как и в прямом проходе)
        а не "входным данным" для этого последовательного модуля.

        !!!

        """
        # Your code goes here. ################################################
        gradInput = gradOutput
        for module, output in zip(reversed(self.modules), reversed(self.output_modules[:-1])):
            gradInput = module.backward(output, gradInput)
        self.gradInput = gradInput

        return self.gradInput


    def zeroGradParameters(self):
        for module in self.modules:
            module.zeroGradParameters()

    def getParameters(self):
        """
        Should gather all parameters in a list.

        Следует собрать все параметры в список.
        """
        return [x.getParameters() for x in self.modules]

    def getGradParameters(self):
        """
        Should gather all gradients w.r.t parameters in a list.

        Следует собрать все градиенты с параметрами rt в список.
        """
        return [x.getGradParameters() for x in self.modules]

    def __repr__(self):
        string = "".join([str(x) + '\n' for x in self.modules])
        return string

    def __getitem__(self,x):
        return self.modules.__getitem__(x)

    def train(self):
        """
        Propagates training parameter through all modules

        Распространяет обучающий параметр по всем модулям
        """
        self.training = True
        for module in self.modules:
            module.train()

    def evaluate(self):
        """
        Propagates training parameter through all modules

        Распространяет обучающий параметр по всем модулям
        """
        self.training = False
        for module in self.modules:
            module.evaluate()

# Layers

## 1 (0.2). Linear transform layer
Also known as dense layer, fully-connected layer, FC-layer, InnerProductLayer (in caffe), affine transform
- input:   **`batch_size x n_feats1`**
- output: **`batch_size x n_feats2`**

In [None]:
class Linear(Module):
    """
    A module which applies a linear transformation
    A common name is fully-connected layer, InnerProductLayer in caffe.

    The module should work with 2D input of shape (n_samples, n_feature).
    """
    def __init__(self, n_in, n_out):
        super(Linear, self).__init__()

        # This is a nice initialization
        stdv = 1./np.sqrt(n_in)
        self.W = np.random.uniform(-stdv, stdv, size = (n_out, n_in))
        self.b = np.random.uniform(-stdv, stdv, size = n_out)

        self.gradW = np.zeros_like(self.W)
        self.gradb = np.zeros_like(self.b)

    def updateOutput(self, input):
        # Your code goes here. ################################################
        # self.output = ...
        self.output = input @ self.W.T + self.b
        return self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        # self.gradInput = ...
        self.gradInput = gradOutput @ self.W
        return self.gradInput

    def accGradParameters(self, input, gradOutput):
        # Your code goes here. ################################################
        # self.gradW = ... ; self.gradb = ...
        self.gradW = gradOutput.T @ input
        self.gradb = np.sum(gradOutput, axis=0)
        pass

    def zeroGradParameters(self):
        self.gradW.fill(0)
        self.gradb.fill(0)

    def getParameters(self):
        return [self.W, self.b]

    def getGradParameters(self):
        return [self.gradW, self.gradb]

    def __repr__(self):
        s = self.W.shape
        q = 'Linear %d -> %d' %(s[1],s[0])
        return q

## 2. (0.2) SoftMax
- input:   **`batch_size x n_feats`**
- output: **`batch_size x n_feats`**

$\text{softmax}(x)_i = \frac{\exp x_i} {\sum_j \exp x_j}$

Recall that $\text{softmax}(x) == \text{softmax}(x - \text{const})$. It makes possible to avoid computing exp() from large argument.

In [None]:
class SoftMax(Module):
    def __init__(self):
         super(SoftMax, self).__init__()

    def updateOutput(self, input):
        # start with normalization for numerical stability
        shifted_input = np.subtract(input, input.max(axis=1, keepdims=True))

        # Your code goes here. ################################################
        exp = np.exp(shifted_input)
        self.output = exp / np.sum(exp, axis=1, keepdims=True)
        return self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        shifted_input = np.subtract(input, input.max(axis=1, keepdims=True))
        exp = np.exp(shifted_input)
        softmax = exp / np.sum(exp, axis=1, keepdims=True)

        self.gradInput = np.empty_like(input)

        for i in range(len(input)):
          s = softmax[i].reshape(-1, 1)
          jacobian_matrix = np.diagflat(s) - np.dot(s, s.T)
          self.gradInput[i] = np.dot(jacobian_matrix, gradOutput[i])

        return self.gradInput

    def __repr__(self):
        return "SoftMax"

## 3. (0.2) LogSoftMax
- input:   **`batch_size x n_feats`**
- output: **`batch_size x n_feats`**

$\text{logsoftmax}(x)_i = \log\text{softmax}(x)_i = x_i - \log {\sum_j \exp x_j}$

The main goal of this layer is to be used in computation of log-likelihood loss.

In [None]:
class LogSoftMax(Module):
    def __init__(self):
         super(LogSoftMax, self).__init__()

    def updateOutput(self, input):
        # start with normalization for numerical stability
        shifted_input = np.subtract(input, input.max(axis=1, keepdims=True))

        # Your code goes here. ################################################
        exp = np.exp(shifted_input)
        self.output = shifted_input - np.log(np.sum(exp, axis=1, keepdims=True))

        return self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        softmax_grad = np.exp(self.output) / np.sum(np.exp(self.output), axis=1, keepdims=True)
        self.gradInput = gradOutput - softmax_grad * np.sum(gradOutput, axis=1, keepdims=True)

        return self.gradInput

    def __repr__(self):
        return "LogSoftMax"

## 4. (0.3) Batch normalization
One of the most significant recent ideas that impacted NNs a lot is [**Batch normalization**](http://arxiv.org/abs/1502.03167). The idea is simple, yet effective: the features should be whitened ($mean = 0$, $std = 1$) all the way through NN. This improves the convergence for deep models letting it train them for days but not weeks. **You are** to implement the first part of the layer: features normalization. The second part (`ChannelwiseScaling` layer) is implemented below.

- input:   **`batch_size x n_feats`**
- output: **`batch_size x n_feats`**

The layer should work as follows. While training (`self.training == True`) it transforms input as $$y = \frac{x - \mu}  {\sqrt{\sigma + \epsilon}}$$
where $\mu$ and $\sigma$ - mean and variance of feature values in **batch** and $\epsilon$ is just a small number for numericall stability. Also during training, layer should maintain exponential moving average values for mean and variance:
```
    self.moving_mean = self.moving_mean * alpha + batch_mean * (1 - alpha)
    self.moving_variance = self.moving_variance * alpha + batch_variance * (1 - alpha)
```
During testing (`self.training == False`) the layer normalizes input using moving_mean and moving_variance.

Note that decomposition of batch normalization on normalization itself and channelwise scaling here is just a common **implementation** choice. In general "batch normalization" always assumes normalization + scaling.

In [None]:
class BatchNormalization(Module):
    EPS = 1e-3
    def __init__(self, alpha = 0.):
        super(BatchNormalization, self).__init__()
        self.alpha = alpha
        self.moving_mean = None
        self.moving_variance = None

    def updateOutput(self, input):
        # Your code goes here. ################################################
        # use self.EPS please
        if self.training:
            batch_mean = np.mean(input, axis=0)
            batch_variance = np.var(input, axis=0)

            if self.moving_mean is None:
                self.moving_mean = batch_mean
            else:
                self.moving_mean = self.moving_mean * self.alpha + batch_mean * (1 - self.alpha)

            if self.moving_variance is None:
                self.moving_variance = batch_variance
            else:
                self.moving_variance = self.moving_variance * self.alpha + batch_variance * (1 - self.alpha)

            self.output = (input - batch_mean) / np.sqrt(batch_variance + self.EPS)
        else:
            self.output = (input - self.moving_mean) / np.sqrt(self.moving_variance + self.EPS)

        return self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        batch_size, n_feats = input.shape
        batch_mean = np.mean(input, axis=0)
        batch_variance = np.var(input, axis=0)
        std_inv = 1.0 / np.sqrt(batch_variance + self.EPS)

        dvar = np.sum(gradOutput * (input - batch_mean) * -0.5 * std_inv**3, axis=0)

        dmean = np.sum(gradOutput * -std_inv, axis=0) + dvar * np.mean(-2.0 * (input - batch_mean), axis=0)

        self.gradInput = gradOutput * std_inv + dvar * 2.0 * (input - batch_mean) / batch_size + dmean / batch_size

        return self.gradInput

    def __repr__(self):
        return "BatchNormalization"

In [None]:
class ChannelwiseScaling(Module):
    """
       Implements linear transform of input y = \gamma * x + \beta
       where \gamma, \beta - learnable vectors of length x.shape[-1]
    """
    def __init__(self, n_out):
        super(ChannelwiseScaling, self).__init__()

        stdv = 1./np.sqrt(n_out)
        self.gamma = np.random.uniform(-stdv, stdv, size=n_out)
        self.beta = np.random.uniform(-stdv, stdv, size=n_out)

        self.gradGamma = np.zeros_like(self.gamma)
        self.gradBeta = np.zeros_like(self.beta)

    def updateOutput(self, input):
        self.output = input * self.gamma + self.beta
        return self.output

    def updateGradInput(self, input, gradOutput):
        self.gradInput = gradOutput * self.gamma
        return self.gradInput

    def accGradParameters(self, input, gradOutput):
        self.gradBeta = np.sum(gradOutput, axis=0)
        self.gradGamma = np.sum(gradOutput*input, axis=0)

    def zeroGradParameters(self):
        self.gradGamma.fill(0)
        self.gradBeta.fill(0)

    def getParameters(self):
        return [self.gamma, self.beta]

    def getGradParameters(self):
        return [self.gradGamma, self.gradBeta]

    def __repr__(self):
        return "ChannelwiseScaling"

Practical notes. If BatchNormalization is placed after a linear transformation layer (including dense layer, convolutions, channelwise scaling) that implements function like `y = weight * x + bias`, than bias adding become useless and could be omitted since its effect will be discarded while batch mean subtraction. If BatchNormalization (followed by `ChannelwiseScaling`) is placed before a layer that propagates scale (including ReLU, LeakyReLU) followed by any linear transformation layer than parameter `gamma` in `ChannelwiseScaling` could be freezed since it could be absorbed into the linear transformation layer.

## 5. (0.3) Dropout
Implement [**dropout**](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf). The idea and implementation is really simple: just multimply the input by $Bernoulli(p)$ mask. Here $p$ is probability of an element to be zeroed.

This has proven to be an effective technique for regularization and preventing the co-adaptation of neurons.

While training (`self.training == True`) it should sample a mask on each iteration (for every batch), zero out elements and multiply elements by $1 / (1 - p)$. The latter is needed for keeping mean values of features close to mean values which will be in test mode. When testing this module should implement identity transform i.e. `self.output = input`.

- input:   **`batch_size x n_feats`**
- output: **`batch_size x n_feats`**

In [None]:
class Dropout(Module):
    def __init__(self, p=0.5):
        super(Dropout, self).__init__()

        self.p = p
        self.mask = None

    def updateOutput(self, input):
        # Your code goes here. ################################################
        if self.training:
          self.mask = np.random.binomial(1, 1 - self.p, size=input.shape)
          self.output = input * self.mask / (1 - self.p)
        else:
          self.output = input
        return  self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        if self.training:
          self.gradInput = gradOutput * self.mask / (1 - self.p)
        else:
          self.gradInput = gradOutput
        return self.gradInput

    def __repr__(self):
        return "Dropout"

#6. (2.0) Conv2d
Implement [**Conv2d**](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html). Use only this list of parameters: (in_channels, out_channels, kernel_size, stride, padding, bias, padding_mode) and fix dilation=1 and groups=1.

In [14]:
class Conv2d(Module):
    def __init__(self, in_channels, out_channels, kernel_size,
                 stride=1, padding=0, bias=True, padding_mode='zeros'):
        super(Conv2d, self).__init__()

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.bias = bias
        self.padding_mode = padding_mode

        self.training = True


        if isinstance(kernel_size, int):
            self.kernel_size = (kernel_size, kernel_size)
        if isinstance(stride, int):
            self.stride = (stride, stride)
        if isinstance(padding, int):
            self.padding = (padding, padding)

        self.weight = np.random.randn(out_channels, in_channels, self.kernel_size[0], self.kernel_size[1]) / np.sqrt(in_channels * self.kernel_size[0] * self.kernel_size[1])
        if bias:
            self.bias_term = np.zeros(out_channels)
        else:
            self.bias_term = None

        self._parameters = [self.weight]
        if bias:
            self._parameters.append(self.bias_term)

        self._gradient = [np.zeros_like(w) for w in self._parameters]

    def _pad_input(self, input):
        if self.padding_mode == 'zeros':
            padded_input = np.pad(input, ((0, 0), (0, 0), (self.padding[0], self.padding[0]), (self.padding[1], self.padding[1])), 'constant')
        elif self.padding_mode == 'reflect':
            padded_input = np.pad(input, ((0, 0), (0, 0), (self.padding[0], self.padding[0]), (self.padding[1], self.padding[1])), 'reflect')
        elif self.padding_mode == 'replicate':
            padded_input = np.pad(input, ((0, 0), (0, 0), (self.padding[0], self.padding[0]), (self.padding[1], self.padding[1])), 'edge')
        elif self.padding_mode == 'same':
            height = input.shape[2]
            width = input.shape[3]
            out_height = (height + self.stride[0] - 1) // self.stride[0]
            out_width = (width + self.stride[1] - 1) // self.stride[1]
            pad_h = max((out_height - 1) * self.stride[0] + self.kernel_size[0] - height, 0)
            pad_w = max((out_width - 1) * self.stride[1] + self.kernel_size[1] - width, 0)
            pad_top = pad_h // 2
            pad_bottom = pad_h - pad_top
            pad_left = pad_w // 2
            pad_right = pad_w - pad_left
            padded_input = np.pad(input, ((0, 0), (0, 0), (pad_top, pad_bottom), (pad_left, pad_right)), 'constant')
        else:
            raise ValueError(f"Invalid padding_mode: {self.padding_mode}")
        return padded_input

    def updateOutput(self, input):
        input = np.array(input)
        padded_input = self._pad_input(input)
        batch_size, in_channels, in_height, in_width = padded_input.shape
        out_height = (in_height - self.kernel_size[0]) // self.stride[0] + 1
        out_width = (in_width - self.kernel_size[1]) // self.stride[1] + 1
        output = np.zeros((batch_size, self.out_channels, out_height, out_width))

        for b in range(batch_size):
            for out_c in range(self.out_channels):
                for out_h in range(out_height):
                    for out_w in range(out_width):
                        input_slice = padded_input[b, :, out_h * self.stride[0]:out_h * self.stride[0] + self.kernel_size[0], out_w * self.stride[1]:out_w * self.stride[1] + self.kernel_size[1]]
                        output[b, out_c, out_h, out_w] = np.sum(input_slice * self.weight[out_c])
                        if self.bias_term is not None:
                            output[b, out_c, out_h, out_w] += self.bias_term[out_c]

        self.output = output
        return self.output

    def updateGradInput(self, input, gradOutput):
        input = np.array(input)
        padded_input = self._pad_input(input)
        batch_size, in_channels, in_height, in_width = padded_input.shape
        out_height, out_width = gradOutput.shape[2], gradOutput.shape[3]
        gradInput = np.zeros_like(padded_input)
        gradWeight = np.zeros_like(self.weight)
        gradBias = np.zeros_like(self.bias_term) if self.bias_term is not None else None

        for b in range(batch_size):
            for out_c in range(self.out_channels):
                for out_h in range(out_height):
                    for out_w in range(out_width):
                        input_slice = padded_input[b, :, out_h * self.stride[0]:out_h * self.stride[0] + self.kernel_size[0], out_w * self.stride[1]:out_w * self.stride[1] + self.kernel_size[1]]
                        gradInput[b, :, out_h * self.stride[0]:out_h * self.stride[0] + self.kernel_size[0], out_w * self.stride[1]:out_w * self.stride[1] + self.kernel_size[1]] += gradOutput[b, out_c, out_h, out_w] * self.weight[out_c]
                        gradWeight[out_c] += gradOutput[b, out_c, out_h, out_w] * input_slice
                        if self.bias_term is not None:
                            gradBias[out_c] += gradOutput[b, out_c, out_h, out_w]

        if self.padding_mode == 'same' :
          pad_h = max((self.stride[0] * (input.shape[2] - 1) + self.kernel_size[0] - input.shape[2]) // 2, 0)
          pad_w = max((self.stride[1] * (input.shape[3] - 1) + self.kernel_size[1] - input.shape[3]) // 2, 0)
          gradInput = gradInput[:,:,pad_h:gradInput.shape[2]-pad_h, pad_w:gradInput.shape[3]-pad_w]
        else:
          gradInput = gradInput[:,:,self.padding[0]:gradInput.shape[2]-self.padding[0], self.padding[1]:gradInput.shape[3]-self.padding[1]]

        self.gradInput = gradInput
        self._gradient[0] = gradWeight
        if self.bias_term is not None:
            self._gradient[1] = gradBias
        return self.gradInput

    def __repr__(self):
        return "Conv2d"

#7. (0.5) Implement [**MaxPool2d**](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html) and [**AvgPool2d**](https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html). Use only parameters like kernel_size, stride, padding (negative infinity for maxpool and zero for avgpool) and other parameters fixed as in framework.

In [None]:
class MaxPool2d(Module):
    def __init__(self, kernel_size, stride, padding):
        super(MaxPool2d, self).__init__()

        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding

    def updateOutput(self, input):
        # Your code goes here. ################################################
        batch_size, channels, height, width = input.shape

        out_height = (height - self.kernel_size + 2 * self.padding) // self.stride + 1
        out_width = (width - self.kernel_size + 2 * self.padding) // self.stride + 1

        padded_input = np.pad(input,
                              ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),
                              mode='constant', constant_values=-np.inf)

        self.output = np.zeros((batch_size, channels, out_height, out_width))

        for i in range(out_height):
            for j in range(out_width):
                h_start = i * self.stride
                w_start = j * self.stride

                region = padded_input[:, :, h_start:h_start + self.kernel_size, w_start:w_start + self.kernel_size]
                self.output[:, :, i, j] = np.max(region, axis=(2, 3))
        return  self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        batch_size, channels, height, width = input.shape
        out_height, out_width = gradOutput.shape[2], gradOutput.shape[3]

        gradInput = np.zeros_like(input)

        padded_input = np.pad(input,
                          ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),
                          mode='constant', constant_values=-np.inf)

        for i in range(out_height):
          for j in range(out_width):
            h_start = i * self.stride
            w_start = j * self.stride

            region = padded_input[:, :, h_start:h_start + self.kernel_size, w_start:w_start + self.kernel_size]

            max_mask = (region == np.max(region, axis=(2, 3), keepdims=True))

            gradInput[:, :, h_start:h_start + self.kernel_size, w_start:w_start + self.kernel_size] += \
                gradOutput[:, :, i, j][:, :, np.newaxis, np.newaxis] * max_mask

        self.gradInput = gradInput

        return self.gradInput

    def __repr__(self):
        return "MaxPool2d"


class AvgPool2d(Module):
    def __init__(self, kernel_size, stride, padding):
        super(AvgPool2d, self).__init__()
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding

    def updateOutput(self, input):
        # Your code goes here. ################################################
        batch_size, channels, in_h, in_w = input.shape
        k_h, k_w = (self.kernel_size, self.kernel_size) if isinstance(self.kernel_size, int) else self.kernel_size
        s_h, s_w = (self.stride, self.stride) if isinstance(self.stride, int) else self.stride
        p_h, p_w = (self.padding, self.padding) if isinstance(self.padding, int) else self.padding

        out_h = (in_h + 2 * p_h - k_h) // s_h + 1
        out_w = (in_w + 2 * p_w - k_w) // s_w + 1

        padded_input = np.pad(input,
                            ((0,0), (0,0), (p_h,p_h), (p_w,p_w)),
                            mode='constant',
                            constant_values=0)

        self.output = np.zeros((batch_size, channels, out_h, out_w))

        for b in range(batch_size):
            for c in range(channels):
                for i in range(out_h):
                    for j in range(out_w):
                        h_start = i * s_h
                        h_end = h_start + k_h
                        w_start = j * s_w
                        w_end = w_start + k_w

                        window = padded_input[b, c, h_start:h_end, w_start:w_end]
                        self.output[b, c, i, j] = np.mean(window)
        return self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        batch_size, channels, in_h, in_w = input.shape
        k_h, k_w = (self.kernel_size, self.kernel_size) if isinstance(self.kernel_size, int) else self.kernel_size
        s_h, s_w = (self.stride, self.stride) if isinstance(self.stride, int) else self.stride
        p_h, p_w = (self.padding, self.padding) if isinstance(self.padding, int) else self.padding

        padded_grad = np.pad(np.zeros_like(input),
                           ((0,0), (0,0), (p_h,p_h), (p_w,p_w)),
                           mode='constant',
                           constant_values=0)

        out_h, out_w = gradOutput.shape[2], gradOutput.shape[3]

        for b in range(batch_size):
            for c in range(channels):
                for i in range(out_h):
                    for j in range(out_w):
                        h_start = i * s_h
                        h_end = h_start + k_h
                        w_start = j * s_w
                        w_end = w_start + k_w

                        grad = gradOutput[b, c, i, j] / (k_h * k_w)
                        padded_grad[b, c, h_start:h_end, w_start:w_end] += grad

        self.gradInput = padded_grad[:, :, p_h:p_h+in_h, p_w:p_w+in_w] if (p_h > 0 or p_w > 0) else padded_grad
        return self.gradInput

    def __repr__(self):
        return "AvgPool2d"


#8. (0.3) Implement **GlobalMaxPool2d** and **GlobalAvgPool2d**. They do not have testing and parameters are up to you but they must aggregate information within channels. Write test functions for these layers on your own.

In [None]:
class GlobalMaxPool2d(Module):
    def __init__(self):
        super(GlobalMaxPool2d, self).__init__()

    def updateOutput(self, input):
        self.input = input

        self.output = np.max(input, axis=(2, 3), keepdims=True)
        return self.output

    def updateGradInput(self, input, gradOutput):
        self.gradInput = np.zeros_like(input)

        mask = (input == self.output)
        self.gradInput[mask] = gradOutput.repeat(input.shape[2], axis=2).repeat(input.shape[3], axis=3)[mask]

        return self.gradInput

    def __repr__(self):
        return "GlobalMaxPool2d"

class GlobalAvgPool2d(Module):
    def __init__(self):
        super(GlobalAvgPool2d, self).__init__()

    def updateOutput(self, input):
        self.input = input

        self.output = np.mean(input, axis=(2, 3), keepdims=True)
        return self.output

    def updateGradInput(self, input, gradOutput):
        self.gradInput = np.ones_like(input) * (gradOutput / (input.shape[2] * input.shape[3]))
        return self.gradInput

    def __repr__(self):
        return "GlobalAvgPool2d"

#9. (0.2) Implement [**Flatten**](https://pytorch.org/docs/stable/generated/torch.flatten.html)

In [None]:
class Flatten(Module):
    def __init__(self, start_dim=0, end_dim=-1):
        super(Flatten, self).__init__()

        self.start_dim = start_dim
        self.end_dim = end_dim

    def updateOutput(self, input):
        # Your code goes here. ################################################
        self.input_shape = input.shape

        start_dim = self.start_dim
        end_dim = self.end_dim if self.end_dim != -1 else len(input.shape) - 1

        new_shape = input.shape[:start_dim] + (-1,) + input.shape[end_dim + 1:]

        self.output = input.reshape(new_shape)
        return  self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        self.gradInput = gradOutput.reshape(self.input_shape)
        return self.gradInput

    def __repr__(self):
        return "Flatten"

# Activation functions

Here's the complete example for the **Rectified Linear Unit** non-linearity (aka **ReLU**):

In [None]:
class ReLU(Module):
    def __init__(self):
         super(ReLU, self).__init__()

    def updateOutput(self, input):
        self.output = np.maximum(input, 0)
        return self.output

    def updateGradInput(self, input, gradOutput):
        self.gradInput = np.multiply(gradOutput , input > 0)
        return self.gradInput

    def __repr__(self):
        return "ReLU"

## 10. (0.1) Leaky ReLU
Implement [**Leaky Rectified Linear Unit**](http://en.wikipedia.org/wiki%2FRectifier_%28neural_networks%29%23Leaky_ReLUs). Expriment with slope.

In [None]:
class LeakyReLU(Module):
    def __init__(self, slope = 0.03):
        super(LeakyReLU, self).__init__()

        self.slope = slope

    def updateOutput(self, input):
        self.output = np.maximum(input, self.slope * input)
        return self.output

    def updateGradInput(self, input, gradOutput):
        self.gradInput = gradOutput * (input > 0) + self.slope * (input <= 0) * gradOutput
        return self.gradInput

    def __repr__(self):
        return "LeakyReLU"

## 11. (0.1) ELU
Implement [**Exponential Linear Units**](http://arxiv.org/abs/1511.07289) activations.

In [None]:
class ELU(Module):
    def __init__(self, alpha = 1.0):
        super(ELU, self).__init__()

        self.alpha = alpha

    def updateOutput(self, input):
        # Your code goes here. ################################################
        self.output = np.where(input > 0, input, self.alpha * (np.exp(input) - 1))
        return  self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        self.gradInput = np.where(input > 0, gradOutput, gradOutput * self.alpha * np.exp(input))
        return self.gradInput

    def __repr__(self):
        return "ELU"

## 12. (0.1) SoftPlus
Implement [**SoftPlus**](https://en.wikipedia.org/wiki%2FRectifier_%28neural_networks%29) activations. Look, how they look a lot like ReLU.

In [None]:
class SoftPlus(Module):
    def __init__(self):
        super(SoftPlus, self).__init__()

    def updateOutput(self, input):
        # Your code goes here. ################################################
        self.output = np.log(1 + np.exp(input))
        return  self.output

    def updateGradInput(self, input, gradOutput):
        # Your code goes here. ################################################
        self.gradInput = gradOutput / (1 + np.exp(-input))
        return self.gradInput

    def __repr__(self):
        return "SoftPlus"

#13. (0.2) Gelu
Implement [**Gelu**](https://pytorch.org/docs/stable/generated/torch.nn.GELU.html) activations.

In [None]:
from scipy.special import erf

class Gelu(Module):
    def __init__(self):
        super(Gelu, self).__init__()

    def updateOutput(self, input):
        # Используем точную формулу через erf как в PyTorch
        self.output = input * 0.5 * (1.0 + erf(input / np.sqrt(2.0)))
        return self.output

    def updateGradInput(self, input, gradOutput):
        # Точная производная функции GELU через erf
        x = input
        sqrt_2 = np.sqrt(2.0)
        erf_term = erf(x / sqrt_2)
        pdf_term = np.exp(-0.5 * x**2) / np.sqrt(2.0 * np.pi)
        derivative = 0.5 * (1 + erf_term) + x * pdf_term
        self.gradInput = gradOutput * derivative
        return self.gradInput

    def __repr__(self):
        return "Gelu"

# Criterions

Criterions are used to score the models answers.

In [None]:
class Criterion(object):
    def __init__ (self):
        self.output = None
        self.gradInput = None

    def forward(self, input, target):
        """
            Given an input and a target, compute the loss function
            associated to the criterion and return the result.

            For consistency this function should not be overrided,
            all the code goes in `updateOutput`.
        """
        return self.updateOutput(input, target)

    def backward(self, input, target):
        """
            Given an input and a target, compute the gradients of the loss function
            associated to the criterion and return the result.

            For consistency this function should not be overrided,
            all the code goes in `updateGradInput`.
        """
        return self.updateGradInput(input, target)

    def updateOutput(self, input, target):
        """
        Function to override.
        """
        return self.output

    def updateGradInput(self, input, target):
        """
        Function to override.
        """
        return self.gradInput

    def __repr__(self):
        """
        Pretty printing. Should be overrided in every module if you want
        to have readable description.
        """
        return "Criterion"

The **MSECriterion**, which is basic L2 norm usually used for regression, is implemented here for you.
- input:   **`batch_size x n_feats`**
- target: **`batch_size x n_feats`**
- output: **scalar**

In [None]:
class MSECriterion(Criterion):
    def __init__(self):
        super(MSECriterion, self).__init__()

    def updateOutput(self, input, target):
        self.output = np.sum(np.power(input - target,2)) / input.shape[0]
        return self.output

    def updateGradInput(self, input, target):
        self.gradInput  = (input - target) * 2 / input.shape[0]
        return self.gradInput

    def __repr__(self):
        return "MSECriterion"

## 14. (0.2) Negative LogLikelihood criterion (numerically unstable)
You task is to implement the **ClassNLLCriterion**. It should implement [multiclass log loss](http://scikit-learn.org/stable/modules/model_evaluation.html#log-loss). Nevertheless there is a sum over `y` (target) in that formula,
remember that targets are one-hot encoded. This fact simplifies the computations a lot. Note, that criterions are the only places, where you divide by batch size. Also there is a small hack with adding small number to probabilities to avoid computing log(0).
- input:   **`batch_size x n_feats`** - probabilities
- target: **`batch_size x n_feats`** - one-hot representation of ground truth
- output: **scalar**



In [None]:
class ClassNLLCriterionUnstable(Criterion):
    EPS = 1e-15
    def __init__(self):
        a = super(ClassNLLCriterionUnstable, self)
        super(ClassNLLCriterionUnstable, self).__init__()

    def updateOutput(self, input, target):

        # Use this trick to avoid numerical errors
        input_clamp = np.clip(input, self.EPS, 1 - self.EPS)

        # Your code goes here. ################################################
        self.output = -np.sum(target * np.log(input_clamp)) / input.shape[0]
        return self.output

    def updateGradInput(self, input, target):

        # Use this trick to avoid numerical errors
        input_clamp = np.clip(input, self.EPS, 1 - self.EPS)

        # Your code goes here. ################################################
        self.gradInput = -target / (input_clamp * input.shape[0])
        return self.gradInput

    def __repr__(self):
        return "ClassNLLCriterionUnstable"

## 15. (0.3) Negative LogLikelihood criterion (numerically stable)
- input:   **`batch_size x n_feats`** - log probabilities
- target: **`batch_size x n_feats`** - one-hot representation of ground truth
- output: **scalar**

Task is similar to the previous one, but now the criterion input is the output of log-softmax layer. This decomposition allows us to avoid problems with computation of forward and backward of log().

In [None]:
class ClassNLLCriterion(Criterion):
    def __init__(self):
        a = super(ClassNLLCriterion, self)
        super(ClassNLLCriterion, self).__init__()

    def updateOutput(self, input, target):
        # Your code goes here. ################################################
        self.output = (-np.sum(target * input) / input.shape[0])
        return self.output

    def updateGradInput(self, input, target):
        # Your code goes here. ################################################
        self.gradInput = -target / input.shape[0]
        return self.gradInput

    def __repr__(self):
        return "ClassNLLCriterion"

1-я часть задания: реализация слоев, лосей и функций активации - 5 баллов. \\
2-я часть задания: реализация моделей на своих классах. Что должно быть:
  1. Выберите оптимизатор и реализуйте его, чтоб он работал с вами классами. - 1 балл.
  2. Модель для задачи мультирегрессии на выбраных вами данных. Использовать FCNN, dropout, batchnorm, MSE. Пробуйте различные фукнции активации. Для первой модели попробуйте большую, среднюю и маленькую модель. - 1 балл.
  3. Модель для задачи мультиклассификации на MNIST. Использовать свёртки, макспулы, флэттэны, софтмаксы - 1 балла.
  4. Автоэнкодер для выбранных вами данных. Должен быть на свёртках и полносвязных слоях, дропаутах, батчнормах и тд. - 2 балла. \\

Дополнительно в оценке каждой модели будет учитываться:
1. Наличие правильно выбранной метрики и лосс функции.
2. Отрисовка графиков лосей и метрик на трейне-валидации. Проверка качества модели на тесте.
3. Наличие шедулера для lr.
4. Наличие вормапа.
5. Наличие механизма ранней остановки и сохранение лучшей модели.
6. Свитч лося (метрики) и оптимайзера.