In [7]:
#  An input in DL can be represented as 
# NUMBER, VECTOR, MATRIX
# a tensor could be used to represent all the above
# tensor is basically n-Dimensional object

In [8]:
#  Pytorch has 8 types of tensor: 3 float 5 integer
# float 16 32 64
# integer 8 signed, 8 unsigned, 16,32,64 bit

In [10]:
# # Tensor can be created by three ways
#  calling a constructror of required type
# converting a numpy array into tensor
# using torch.zeros function

In [11]:
import torch
import numpy as np


In [14]:
a = torch.FloatTensor(3,2)
a.zero_()
n= np.zeros(shape=(3,3))
torch.tensor(n, dtype=torch.float32)

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

In [15]:
# Using double precision usually causes memory and performance overhead
#  double precision meaning using 64bits float tensor type 

In [18]:
# PyTorch transparently supports CUDA GPUs,
# which means that all operations have two versions—CPU and GPU—which are automatically selected.
# The decision is made based on the type of tensors that you are operating on. 
# Every tensor type that we mentioned is for CPU and has its GPU equivalent.
# The only difference is that GPU tensors reside in the torch.cuda package,
# instead of just torch. 
# For example, torch.FloatTensor is a 32-bit float tensor which resides in CPU memory, 
# but torch.cuda.FloatTensor is its GPU counterpart.
# To convert from CPU to GPU, there is a tensor method, **to(device)**, 
# which creates a copy of the tensor to a specified device (which could be CPU or GPU).
# If the tensor is already on the device, nothing happens and the original tensor will be returned.
# Device type can be specified in different ways. 
# 1) you can just pass a string name of the device, which is "cpu" for CPU memory or "cuda" for GPU. 
#   A GPU device could have an optional device index specified after the colon, 
#     for example, the second GPU card in the system could be 
#     addressed by "cuda:1" (index is zero-based)
# 2)  specify a device in the to() method is using the torch.device class,
# which accepts the device name and optional index.
# For accessing the device that your tensor is currently residing in, it has a device property.

In [19]:
ca = a.cuda()
ca

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]], device='cuda:0')

In [22]:
a+1,ca+1,ca.device

(tensor([[1., 1.],
         [1., 1.],
         [1., 1.]]),
 tensor([[1., 1.],
         [1., 1.],
         [1., 1.]], device='cuda:0'),
 device(type='cuda', index=0))

# Gradients:
Gradients can be calculated two different ways:
1) Static 2) Dynamic

1) *Static*: Calculations are need to be defined in advanced and no more changes are possible. Graph gets processed and optimized by the DL library before any computation can be made.

2) *Dynamic*: You don't need to define your graph in advance exactly as it will be executed. You just execute operations that you want to use for data transformation on your actual data. During this, the library records the order of operations performed, and when you ask it to calculate gradients, it unrolls its history of operations, accumulating the gradients of network parameters.This method is called **notebook gradients**


In static graph, the library has much more freedom in optimizing the order that computations are performed in or even removing parts of the graph. On the other hand, dynamic graph has a higher computation overhead, but gives a developer much more freedom. For example, they can say, "For this piece of data, I can apply this network two times, and for this piece of data, I'll use a completely different model with gradients clipped by the batch mean." Another very appealing strength of the dynamic graph model is that it allows you to express your transformation more naturally, in a more "Pythonic" way. In the end, it's just a Python library with bunch of functions, so just call them and let the library do the magic.


# Tensors and Gradients

there are several attributes related to gradients that every tensor has:
- grad: A property which holds a tensor of the same shop containing computed gradients.
- is_leaf: *True*, if this tensor was constructed by the user and *False*, if the object is a result of a function transformation.
- requires_grad: *True* if this tensor requires gradients to be calculated.

## Example

<img src="./image/summation_image.png">

In [32]:
# example of such
v1 = torch.tensor([1.0, 1.0], requires_grad=True)
v2 = torch.tensor([2.0, 2.0])
v_sum = v1 + v2
v_res = (v_sum*2).sum()

In [33]:
v1,v2,v_sum,v_res

(tensor([1., 1.], requires_grad=True),
 tensor([2., 2.]),
 tensor([3., 3.], grad_fn=<AddBackward0>),
 tensor(12., grad_fn=<SumBackward0>))

In [30]:
# check if they are leaf(meaning constructed by the user)
v1.is_leaf,v2.is_leaf,v_sum.is_leaf,v_res.is_leaf

(True, True, False, False)

In [31]:
# check if these tensors requires gradients
v1.requires_grad, v2.requires_grad, v_sum.requires_grad, v_res.requires_grad

(True, False, True, True)

## NN Building blocks

### Basic example
we created a randomly initialized feed-forward layer, with two inputs and five outputs, and applied it to our float tensor. All classes in the torch.nn packages inherit from the nn.Module base class, which you can use to implement your own higher-level NN blocks

Useful methods that all nn.Module children provide. They are as follows:

- **parameters()**: A function that returns iterator of all variables which require gradient computation (that is, module weights)
- **zero_grad()**: This function initializes all gradients of all parameters to zero
to(device): This moves all module parameters to a given device (CPU or GPU)
- **state_dict()**: This returns the dictionary with all module parameters and is useful for model serialization
- **load_state_dict()**: This initializes the module with the state dictionary


The whole list of available classes can be found in the documentation at http://pytorch.org/docs.

In [4]:
import torch.nn as nn
import torch
l = nn.Linear(2,5)
v = torch.FloatTensor([1,2])
l(v)

tensor([-1.6061, -0.9103,  1.8675, -0.1647,  0.4338], grad_fn=<AddBackward0>)

**Sequential** class allows you to combine other layers into the pipe

In [7]:
s = nn.Sequential(nn.Linear(2,5),nn.ReLU(),
                  nn.Linear(5,20),nn.ReLU(),
                  nn.Linear(20,10),nn.Dropout(p=0.3),nn.Softmax(dim=1))
s

Sequential(
  (0): Linear(in_features=2, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=20, bias=True)
  (3): ReLU()
  (4): Linear(in_features=20, out_features=10, bias=True)
  (5): Dropout(p=0.3, inplace=False)
  (6): Softmax(dim=1)
)

## Custom Layers
By subclassing the **nn.Module** class, you can create your own building blocks which can be stacked together, reused later, and integrated into the PyTorch framework flawlessly.
At its core, nn.Module provides quite rich functionality to its children:

- It tracks all submodules that the current module includes. For example, your building block can have two feed-forward layers used somehow to perform the block's transformation.
- It provides functions to deal with all parameters of the registered submodules. You can obtain a full list of the module's parameters (parameters() method), zero its gradients (zero_grads() method), move to CPU or GPU (to(device) method), serialize and deserialize the module (state_dict() and load_state_dict()), and even perform generic transformations using your own callable (apply() method).
- It establishes the convention of module application to data. Every module needs to perform its data transformation in the forward() method by overriding it.
- There are some more functions, such as the ability to register a hook function to tweak module transformation or gradients flow, but it's more for advanced use cases.

In **PyTorch**, to create a custom module, we usually have to do only two things: 
- register submodules 
- implement the forward() method


In [8]:
class OurModule(nn.Module):
    def __init__(self, num_inputs, num_classes, dropout_prob=0.3):
        super(OurModule, self).__init__()
        self.pipe = nn.Sequential(
            nn.Linear(num_inputs, 5),
            nn.ReLU(),
            nn.Linear(5, 20),
            nn.ReLU(),
            nn.Linear(20, num_classes),
            nn.Dropout(p=dropout_prob),
            nn.Softmax()
        )
    def forward(self, x):
        return self.pipe(x)

### Explaining above example
This is our module class that inherits nn.Module. In the constructor, we pass three parameters: the size of input, size of output, and optional dropout probability. The first thing we need to do is to call the parent's constructor to let it initialize itself. In the second step, we create an already familiar nn.Sequential with a bunch of layers and assign it to our class field named pipe. By assigning a Sequential instance to our field, we automatically register this module (nn.Sequential inherits from nn.Module as does everything in the nn package). To register, we don't need to call anything, we just assign our submodules to fields. After the constructor finishes, all those fields will be registered automatically (if you really want to, there is a function in nn.Module to register submodules)

Here, we override the forward function with our implementation of data transformation. As our module is a very simple wrapper around other layers, we just need to ask them to transform the data. Note that to apply a module to the data, you need to call the module as callable (that is, pretend that the module instance is a function and call it with the arguments) and not use the forward() function of the nn.Module class. This is because nn.Module overrides the __call__() method, which is being used when we treat an instance as callable. This method does some nn.Module magic stuff and calls your forward() method. **If you call forward() directly, you'll intervene with the nn.Module duty, which can give you wrong results**

In [12]:
if __name__ == "__main__":
    net = OurModule(num_inputs=2, num_classes=3)
    v = torch.FloatTensor([[2,3]])
    out = net(v)
    print(net)
    print(out)

NotImplementedError: 

Above code is to check execution and see how our custom module looks


### Loss Function and Loss value
We need to define our learning objective, which is to have a function that accepts two arguments: the network's output and the desired output. Its responsibility is to return to us a single number: how close the network's prediction is from the desired result. This function is called the **loss function**, and its output is the **loss value**. Using the loss value, we calculate gradients of network parameters and adjust them to decrease this loss value, which pushes our model to better results in the future. Both of those pieces—the loss function and the method of tweaking a network's parameters by gradients—are so common and exist in so many forms that both of them form a significant part of the PyTorch library. Let's start with loss functions.

Loss functions reside in the nn package and are implemented as an nn.Module subclass. Usually, they accept two arguments: output from the network (prediction), and desired output (ground-truth data which is also called the label of the data sample). At the time of writing, PyTorch 0.4 contains 17 different loss functions

The most commonly used are:

- **nn.MSELoss**: The mean square error between arguments, which is the standard loss for regression problems
- **nn.BCELoss and nn.BCEWithLogits**: Binary cross-entropy loss. The first version expects a single probability value (usually it's the output of the Sigmoid layer), while the second version assumes raw scores as input and applies Sigmoid itself. The second way is usually more numerically stable and efficient. These losses (as their names suggest) are frequently used in binary classification problems.
- **nn.CrossEntropyLoss and nn.NLLLoss**: Famous "maximum likelihood" criteria, which is used in multi-class classification problems. The first version expects raw scores for each class and applies LogSoftmax internally, while the second expects to have log probabilities as the input.

### Optimizers

The responsibility of the basic optimizer is to take gradients of model parameters and change these parameters, in order to decrease loss value. By decreasing loss value, we're pushing our model towards desired outputs, which can give us hope of better model performance in the future. 
PyTorch provides lots of popular optimizer implementations and the most widely known are as follows:

- **SGD**: A vanilla stochastic gradient descent algorithm with optional momentum extension
- **RMSprop**: An optimizer, proposed by G. Hinton
- **Adagrad**: An adaptive gradients optimizer

## Monitoring NN

DL practitioners have developed a list of things that you should observe during your training, which usually includes the following:

- Loss value, which normally consists of several components like base loss and regularization losses. You should monitor both total loss and individual components over time.
- Results of validation on training and test sets.
- Statistics about gradients and weights.
- Learning rates and other hyperparameters, if they are adjusted over time


In [14]:
import math
from tensorboardX import SummaryWriter

if __name__ == "__main__":
    writer = SummaryWriter()

    funcs = {"sin": math.sin, "cos": math.cos, "tan": math.tan}
    for angle in range(-360, 360):
        angle_rad = angle * math.pi / 180
        for name, fun in funcs.items():
            val = fun(angle_rad)
            writer.add_scalar(name, val, angle)
    writer.close()

command to run the writer

tensorboard --logdir runs --host localhost