<a href="https://colab.research.google.com/github/RickyMacharm/PyTorch/blob/master/05_Intro2DeepNeuralNets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Dealing with Neural Networks**
Deep learning is a class of machine learning algorithms that is designed to loosely mimic
the neurons in our brain. 

A neuron takes an input from a number of inputs from
surrounding neurons and sums it up, and if the sum exceeds a certain threshold, then the
neuron fires. Between each neuron there is a gap called a synapse.

Signals are carried across
these synapses by neurotransmitter chemicals, and the amount and type of these chemicals
will dictate how strong the input to the neuron is. 

The function of the biological neural
network is replicated by artificial neural networks using weights, biases (a bias is defined as
a weight multiplied by a constant input of 1), and activation functions.
The following is a diagrammatic representation of a neural unit:

![picture](https://drive.google.com/uc?id=1WyEkqvBsBx_Kzsl5GkhVpihBObyNX6V9)

All a neural network sees are sets of numbers, and it tries to identify a pattern in the data.

Through training, the neural network learns to recognize a pattern in the input; however,
there are certain specialized architectures that perform better when applied to a certain
category of problems than others. 

A simple neural network architecture consists of three
kinds of layer: the **input layer**, the **output layer**, and the **hidden layer**. When there is more
than one hidden layer, it is called a **deep neural network**.

![picture](https://drive.google.com/uc?id=1DkRXCm69pDrquYEd5kFB34kYf_oUGRxa)

In the preceding diagram the circles represent a neuron or in deep learning terms, a node,
which is a computation unit. The edges represent the connection between the nodes and
hold the connection weight (synapse strength) between the two nodes.

Here are the stuff we will work with in this segment:

* Defining the neural network class
* Creating a fully connected network
* Defining the loss function
* Implementing optimizers
* Implementing dropouts
* Implementing functional APIs

In [0]:
import torch
from torch import nn
from torchvision import datasets, transforms

**transforms:** An abstract class representing a Dataset.
It just a class which holds the data, on which Pytorch can perform manipulations.
Transforms are the methods which can be used to transform data from the dataset. 

The `transforms` module helps with a lot of
image preprocessing tasks. For the particular case that we are dealing with, an image
consisting of 28 x 28 grayscale pixels, we first need to read from the image and convert it
into a tensor using a `transforms.ToTensor()` transform. We then make the mean and
standard deviation of the pixel values 0.5 and 0.5 respectively so that it becomes easier for
the model to train; to do this, we use `transforms.Normalize((0.5,),(0.5,))`. We
combine all of the transformations together with `transform.Compose()`.


In [0]:
transform = transforms.Compose([transforms.ToTensor(), 
                                transforms.Normalize((0.5,), (0.5,)),])

We want to feed our model in chunks, so  let's define the batch_size to divide our dataset into those chunks.

With the transforms ready, we defined a suitable batch size. A higher batch size means that
the model has fewer training steps and learns faster, even though it also means
high memory requirements.

In [0]:
batch_size = 64

we will pull the dataset from `torchvision` and apply the transform and
create batches. 

TorchVision comes with a lot of popular datasets in its datasets module; if it's not
available on the machine, it will download it for you, pass the transformations, and convert
the data into the desired format for the model to train on. 

In our case, the dataset comes
with a training and testing set, and we load them accordingly. 

We use
`torch.utils.data.DataLoader` to load this processed data into batches, along with
other operations such as shuffling and loading to the right device—CPU or GPU.

We will first create a training dataset.

In [4]:
trainset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', 
                                 download=True, train=True, transform=transform)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/F_MNIST_data/FashionMNIST/raw/train-images-idx3-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/F_MNIST_data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/F_MNIST_data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /root/.pytorch/F_MNIST_data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to /root/.pytorch/F_MNIST_data/FashionMNIST/raw
Processing...
Done!





In [0]:
trainloader = torch.utils.data.DataLoader(trainset, 
                                          batch_size=batch_size, shuffle=True)

create the `testset`

In [0]:
testset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', 
                                download=True, train=False, transform=transform)

In [0]:
testloader = torch.utils.data.DataLoader(testset, 
                                         batch_size=batch_size, shuffle=True)

Now our main task is to define the neural network class, which has to be a
subclass of `nn.Module`.

We could define the model class with any name, but what is important is that it is a
subclass of `nn.Module` and has `super().__init__()`, which provides the model with a
lot of useful methods and attributes and retains knowledge of the architecture.

```python
class FashionNetwork(nn.Module):
  ```
we define the `init` method for the class:
```pytorch
def __init__(self):
  super().__init__()
```
define the layers for our model within `init`. The first hidden layer
looks like the following:
```python
self.hidden1 = nn.Linear(784, 256)
```
We use `nn.Linear()` to define fully connected layers by passing in the input and output
dimensions. We use a softmax layer for the last layer output because there are 10 output
classes. We use ReLU activation in the layers before the output layer to learn nonlinearity in
the data. The `hidden1` layer takes 784 inputs units and gives out 256 output units. The
`hidden2` phrase outputs 128 units and the output layer has 10 output units representing 10
output classes. The softmax layer converts the activations into probabilities so that it adds
to 1 along dimension 1.

define the second hidden layer:
```python
self.hidden2 = nn.Linear(256, 128)
```
define our output layer:
```python
self.output = nn.Linear(128, 10)
```
define our softmax activation for our last layer:
```python
self.softmax = nn.Softmax(dim=1)
```
Finally, we will define the activation function in the inner layers:
```python
self.activation = nn.ReLU()
```
With these steps, we have completed our network units.

In [0]:
class FashionNetwork(nn.Module):

  def __init__(self):
    super().__init__()

    self.hidden1 = nn.Linear(784, 256)
    self.hidden2 = nn.Linear(256, 128)
    self.output = nn.Linear(128, 10)
    self.softmax = nn.Softmax(dim=1)
    self.activation = nn.ReLU()

### **Fully Connected Network**
we will expand on the class that we defined in the previous section.

we only created
components of the architecture that we needed; now we will look at tying all these pieces
together to make a sensible network. 

The progression for our layers will be from 784 units
to 256, then to 128, and finally the output layer of 10 units.

the network is completed by setting up a `forward` network, wherein we tied
together the network components defined in the constructor. A network defined with
`nn.Module` needs to have a `forward()` method defined. It takes the input tensor and
passes it through the network components defined in the `__init__()` method in the
network class, in the sequence of operations defined in the forward method.

The `forward` method is called automatically when input is passed referring to the name of
the model object. 

The `forward` function is where all the magic happens. This is where the data enters and is fed into the computation graph (i.e., the neural network structure we have built).

The `nn.Module` automatically creates the weight and bias tensors that
we'll use in the forward method. 

The linear unit by itself defines a linear function, such
as $xW + B$; to have nonlinear capabilities, we need to insert nonlinear activation functions,
and here we use one of the most popular activation functions, `ReLU`, although you could
use other available activation functions in PyTorch.

Let's start with the `forward()` method in the class, passing in the input:

```python
def forward(self, x):
```

we will move the input to the first hidden layer, with 256 nodes:
```python
x = self.hidden1(x)
```
Next, we pass the outputs from the first hidden layer through the activation
function, which in our case is ReLU:
```python
x = self.activation(x)
```

We will repeat the same for the second layer, which has 128 nodes, and pass it
through ReLU:
```python
x = self.hidden2(x)
x = self.activation(x)
```

Now we pass the last output layer, with 10 output classes:
```python
x = self.output(x)
```
 Then we will push the output using the softmax function:
```python
output = self.softmax(x)
```
we return the output tensor:
```python
return output
```

We will then create the network object:
```python
model = FashionNetwork()
```
Our input layer has 784 units (from 28 x 28 pixels), and the first layer has 256 units with
ReLU activation, then 128 units with ReLU activation, and finally 10 units with softmax
activation. The reason we squish the final layer output through softmax is because we want
to have 1 output class with a higher probability than all the other classes, and the sum of
the output probabilities should equal 1. The softmax function has a parameter `dim=1` that
ensures that softmax is taken across the columns of the output. We then create an object
using the model class and print the details of the class using `print(model)`.

Let's have a quick look at our model:
```python
print(model)
FashionNetwork(
    (hidden1): Linear(in_features=784, out_features=256,
bias=True)
    (hidden2): Linear(in_features=256, out_features=128,
bias=True)
    (output): Linear(in_features=128, out_features=10,
bias=True)
    (softmax): Softmax()
    (activation): ReLU()
)
```

We can define the network architecture without defining a network class using the
`nn.Sequential` module, and it is important to ensure that the sequence of operation in
the forward method is ordered properly, although the sequence doesn't matter in
`__init__`. You can use `nn.Tanh` for tanh activation. You can access the weight and bias
tensors from the model object with `model.hidden.weight` and `model.hidden.bias`.

In [0]:
class FashionNetwork(nn.Module):

  def __init__(self):
    super().__init__()

    self.hidden1 = nn.Linear(784, 256)
    self.hidden2 = nn.Linear(256, 128)
    self.output = nn.Linear(128, 10)
    self.softmax = nn.Softmax(dim=1)
    self.activation = nn.ReLU()

  def forward(self, x):
    x = self.hidden1(x)
    x = self.activation(x)
    x = self.activation(x)
    x = self.hidden2(x)
    x = self.activation(x)
    x = self.output(x)
    output = self.softmax(x)
    return output

In [0]:
model = FashionNetwork()

In [11]:
print(model)

FashionNetwork(
  (hidden1): Linear(in_features=784, out_features=256, bias=True)
  (hidden2): Linear(in_features=256, out_features=128, bias=True)
  (output): Linear(in_features=128, out_features=10, bias=True)
  (softmax): Softmax(dim=1)
  (activation): ReLU()
)


### **Defining the loss function**

A machine learning model, when being trained, may have some deviation between the
predicted output and the actual output, and this difference is called the **error** of the model.

The function that lets us calculate this error is called the **loss function**, or **error function**.
This function provides a metric to evaluate all possible solutions and choose the most
optimized model. 

The loss function has to be able to reduce all attributes of the model
down to a single number so that an improvement in that loss function value is
representative of a better model.

Let us define a loss function for our fashion dataset using the loss function
available in PyTorch.

We will define our loss function thus:

* First, we will modify our existing network architecture to the output log of
softmax instead of softmax, starting with the __init__ method in the network
constructor:

```python
self.log_softmax = nn.LogSoftmax()
```

* Next, we will make the same change in the forward method of the neural
network:
```python
output = self.log_softmax(x)
```

We will use the cell to display our final code.

In [0]:
class FashionNetwork(nn.Module):
  def __init__(self):
    super().__init__()
    self.hidden1 = nn.Linear(784, 256)
    self.hidden2 = nn.Linear(256, 128)
    self.output = nn.Linear(128, 10)
    self.log_softmax = nn.LogSoftmax()
    self.activation = nn.ReLU()

  def forward(self, x):
    x = self.hidden1(x)
    x = self.activation(x)
    x = self.hidden2(x)
    x = self.activation(x)
    x = self.output(x)
    output = self.log_softmax(x)
    return output

We define the model object as follows

In [13]:
model = FashionNetwork(); model

FashionNetwork(
  (hidden1): Linear(in_features=784, out_features=256, bias=True)
  (hidden2): Linear(in_features=256, out_features=128, bias=True)
  (output): Linear(in_features=128, out_features=10, bias=True)
  (log_softmax): LogSoftmax()
  (activation): ReLU()
)

Now, we will define our loss function; we will use negative log likelihood loss
for this:

In [0]:
criterion = nn.NLLLoss()

We now have our loss function ready.

We replaced softmax with log softmax so that we could then use the log of
probabilities over probabilities, which has nice theoretic interpretations. 

There are various
reasons for doing this, including improved numerical performance and gradient
optimization. These advantages can be extremely important when training a model that can
be computationally challenging and expensive. 

Furthermore, it has a high penalizing effect
when it is not predicting the correct class.
We therefore use negative log likelihood when dealing with log softmax, as softmax is not
compatible. 

It is useful in classification between n number of classes. The log would ensure
that we are not dealing with very small values between 0 and 1, and negative values would
ensure that a logarithm of probability that is less than 1 is nonzero. 

Our goal would be to
reduce this negative log loss error function. In PyTorch, the loss function is called a
criterion, and so we named our loss function criterion.

We can provide an optional argument, weight, that has to be a 1D tensor that assigns
weights to each of the output classes to deal with unbalanced training sets.

### **Implementing optimizers**
**Backpropagation** is a method by
which the neural networks learn from errors; the errors are used to modify weights in such
a way that the errors are minimized.

`Optimization` functions are responsible for modifying
weights to reduce the error. Optimization functions calculate the partial derivative of errors
with respect to weights. The derivative shows the direction of a positive slope, and so we
need to reverse the direction of the gradient. 

The optimizer function combines the model
parameters and loss function to iteratively modify the model parameters to reduce the
model error. Optimizers can be thought of as fiddling with the model weights to get the
best possible model based on the difference in prediction from the model and the actual
output, and the loss function acts as a guide by indicating when the optimizer is going right
or wrong.

The `learning rate` is a hyperparameter of the optimizer, which controls the amount by
which the weights are updated. The learning rate ensures that the weights are not updated
by a huge amount so that the algorithm fails to converge at all and the error gets bigger and
bigger; however at the same time, the updating of the weight should not be so low that it
takes forever to reach the minimum of the cost function/error function.

![picture](https://drive.google.com/uc?id=1CY1o1eGIH09_nwhQyWNJBIOfdyMBagxA)



We will start by importing the `optim` module:
```python
from torch import optim
```
Next, we will create an optimizer object. We will use the Adam optimizer and pass
model parameters:
```python
optimizer = optim.Adam(model.parameters())
```
To check for the defaults of the optimizer, you can do the following:
```python
optimizer.defaults
```
output
```python
{'lr': 0.001,
'betas': (0.9, 0.999),
'eps': 1e-08,
'weight_decay': 0,
'amsgrad': False}
```
You can also add the learning rate as an additional parameter:
```python
optimizer = optim.Adam(model.parameters(), lr=3e-3)
```
Now we will start training our model, starting with the number of epochs:
```python
epoch = 10
```
 We will then start the loop:
```python
for _ in range(epoch):
```
 We initialize running_loss as 0:
```python
running_loss = 0
```
We will iterate through each image in training the image loader, which we defined
in an earlier recipe in this chapter: Defining the neural network class:
```python
for image, label in trainloader:
  ```
We then reset the gradients to zero:
```python
optimizer.zero_grad()
```
 Next, we will reshape the image:
```python
image = image.view(image.shape[0],-1)
```
Then we get the prediction from the model:
```python
pred = model(image)
```
Then we calculate the loss/error:
```python
loss = criterion(pred, label)
```
Then we call the .backward() method on the loss:
```python
loss.backward()
```
Then we call the .step() method on the optimizer:
```python
optimizer.step()
```
Then we append to the running loss:
```python
running_loss += loss.item()
```
Finally, we will print the loss after each epoch:
```python
else:
    print(f'Training loss: {running_loss/len(trainloader):.4f}')
```
The following is a sample output:
```python
Training loss: 0.4978
Training loss: 0.3851
Training loss: 0.3498
Training loss: 0.3278
Training loss: 0.3098
Training loss: 0.2980
Training loss: 0.2871
Training loss: 0.2798
Training loss: 0.2717
Training loss: 0.2596
```
Now we have completed the training

In [0]:
from torch import optim
optimizer = optim.Adam(model.parameters())

In [16]:
optimizer.defaults

{'amsgrad': False,
 'betas': (0.9, 0.999),
 'eps': 1e-08,
 'lr': 0.001,
 'weight_decay': 0}

In [0]:
optimizer = optim.Adam(model.parameters(), lr=3e-3)

In [0]:
epoch = 10

In [31]:
for _ in range(epoch):
  running_loss = 0
  for image, label in trainloader:
    optimizer.zero_grad()
    image = image.view(image.shape[0],-1)
    pred = model(image)
    loss = criterion(pred, label)
    loss.backward()
    optimizer.step()
    running_loss += loss.item()
  else:
    print(f'Training loss: {running_loss/len(trainloader):.4f}')

  app.launch_new_instance()


Training loss: 0.1955
Training loss: 0.1947
Training loss: 0.1933
Training loss: 0.1926
Training loss: 0.1934
Training loss: 0.1870
Training loss: 0.1839
Training loss: 0.1793
Training loss: 0.1782
Training loss: 0.1724


We started by defining the optimizer using an **`Adam optimizer`**, and then we
set a learning rate for the optimizer and had a look at the default parameters. We set an
epoch of 10 and started iterations for each epoch, setting `running_loss` to 0 on each
iteration and iterating over each image within the epoch (the number of times the model
sees the dataset). We started by clearing the gradients using the `.zero_grad()` method.
PyTorch accumulates gradients on each backward pass, which is useful in some cases, and
so it was imported to zero out the gradient to properly update the model parameters.

Next, we reshaped the image by flattening each batch of 64 images (consisting of 28 x 28
pixels in each image) to 784, thereby changing the tensor shape from 64 x 28 x 28 to 64 x
784, as our model expects this shape for the input. Next, we sent this input over to the
model and got the output predictions for the batch from the model, and then passed it to
the loss function, also called criterion; there, it assessed the difference between the
predicted and the actual class.
The `loss.backward()` function calculated the gradient—that is, the partial derivative of
the error with respect to the weights—and we called the `optimizer.step() function` to
update the weights of the model to adapt to the error that was evaluated. The `.item()`
method pulled a scalar out of a single element tensor, and so with `loss.item()` we get a
scalar value of error from the batch, accumulate it to the losses through all the batches,
and finally print the loss at the end of the epoch.

We can use a callback function called closure as a parameter for `.step(closure)` to
calculate the loss and update the weights by passing in a function as a parameter. we
could also explore other optimizer functions, such as Adadelta, Adagrad, SGD, and so on,
which are available with PyTorch.

## **Implementing dropouts**
We will look at implementing dropouts. One of the more common
phenomena that we might encounter while training a neural network model, or any
machine learning model in general, is overfitting. Overfitting happens when a model learns
the data that is given to it for training rather than generalizing on the solution space—that
is, it learns the minute details and noises of the training data, instead of grasping the bigger
picture, and so performs poorly on new data. 

Regularization is the process of preventing
models from overfitting.
Using a dropout is one of the most popular regularization techniques in neural networks, in
which randomly selected neurons are turned off while training—that is, the contribution of
neurons is temporarily removed from the forward pass and the backward pass doesn't
affect the weights, so that no single neuron or subset of neurons gets all the decisive power
of the model; rather, all the neurons are forced to make active contributions to predictions.

Dropouts can be intuitively understood as creating a large number of ensemble models,
learning to capture various features under one big definition of a model.
In this recipe, we will look at how to add dropouts to our model definition to improve the
overall model performance by preventing overfitting. It should be remembered that
dropouts are to be applied only while training; however, when testing and during the
actual prediction, we want all of the neurons to make contributions.

We will start with our initial model definition:
```python
class FashionNetwork(nn.Module):
  def __init__(self):
    super().__init__()
    self.hidden1 = nn.Linear(784, 256)
    self.hidden2 = nn.Linear(256, 128)
    self.output = nn.Linear(128, 10)
    self.log_softmax = nn.LogSoftmax()
    self.activation = nn.ReLU()

  def forward(self, x):
    x = self.hidden1(x)
    x = self.activation(x)
    x = self.hidden2(x)
    x = self.activation(x)
    x = self.output(x)
    output = self.log_softmax(x)
    return output
```

Then we will add a dropout to our model `__init__:`
```python
self.drop = nn.Dropout(p=0.25)
```

Our updated` __init__()` looks as follows:
```python
def __init__(self):
  super().__init__()
    self.hidden1 = nn.Linear(784, 256)
    self.hidden2 = nn.Linear(256, 128)
    self.output = nn.Linear(128, 10)
    self.log_softmax = nn.LogSoftmax()
    self.activation = nn.ReLU()
    self.drop = nn.Dropout(p=0.25)

```

 Now, we will add dropouts in our forward() method:
```python
def forward(self, x):
  x = self.hidden1(x)
  x = self.activation(x)
  x = self.drop(x)
  x = self.hidden2(x)
  x = self.activation(x)
  x = self.drop(x)
  x = self.output(x)
  output = self.log_softmax(x)
  return output
```
We now have a network with dropouts.

we altered the` __init__()` method to add the dropout layer with a dropout
probability of 0.25, which means that 25% of the neurons in the layer where this dropout is
applied will be turned off randomly. Then, we edited our forward function, applied it to
the first hidden layer with 256 units in it, and then we applied the dropout on the second
layer, which has 128 units. We applied the activation in both the layers after going through
the activation functions. We have to keep in mind that dropouts must be applied only on
hidden layers in order to prevent us from losing the input data and missing outputs.

In [0]:
class FashionNetwork(nn.Module):
  def __init__(self):
    super().__init__()
    self.hidden1 = nn.Linear(784, 256)
    self.hidden2 = nn.Linear(256, 128)
    self.output = nn.Linear(128, 10)
    self.log_softmax = nn.LogSoftmax()
    self.activation = nn.ReLU()
    self.drop = nn.Dropout(p=0.25)

  def forward(self, x):
    x = self.hidden1(x)
    x = self.activation(x)
    x = self.drop(x)
    x = self.hidden2(x)
    x = self.activation(x)
    x = self.drop(x)
    x = self.output(x)
    output = self.log_softmax(x)
    return output

### **Implementing functional APIs**
We will explore functional APIs in PyTorch; doing so will allow us to write
cleaner and more concise network architectures and components. We will be looking at
functional APIs and defining models, or a part of a model, with functional APIs.

In the following steps, we use our existing neural network class definition and then rewrite
it using functional APIs:

We will start by making an import:
```python
import torch.nn.functional as F
```
Then we define our FashionNetwork class with `F.relu()` and `F.log_softmax()`:
```pthon
class FashionNetwork(nn.Module):
  def __init__(self):
    super().__init__()
    self.hidden1 = nn.Linear(784,256)
    self.hidden2 = nn.Linear(256,128)
    self.output = nn.Linear(128,10)

  def forward(self,x):
    x = F.relu(self.hidden1(x))
    x = F.relu(self.hidden2(x))
    x = F.log_softmax(self.output(x))
    return x
```
We redefined our model with functional APIs.

we defined the exact same network as before, but replaced the activation
function and the log softmax with `function.relu` and `function.log_softmax`, which
makes our code look a lot cleaner and more concise.

You could use functional APIs for linear layers by using `functional.linear()` and
`functional.dropout()` to control dropouts, but you must take care to pass the model
state to indicate whether it is in training or evaluation/prediction mode.

In [0]:
import torch.nn.functional as F

In [0]:
class FashionNetwork(nn.Module):
  def __init__(self):
    super().__init__()
    self.hidden1 = nn.Linear(784,256)
    self.hidden2 = nn.Linear(256,128)
    self.output = nn.Linear(128,10)
 
  def forward(self,x):
    x = F.relu(self.hidden1(x))
    x = F.relu(self.hidden2(x))
    x = F.log_softmax(self.output(x))
    return x