<a href="https://colab.research.google.com/github/rahulsing/pytorch_demo/blob/master/04_pytorch_ClassificationUsingFullyConnectedCNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Classification using Single Neuron: Just like regression using single neuron

**Optimizers**

In this simple neural network that we built in order to predict automobile prices using regression, we updated the model parameters by calculating gradients. However, if you're building complex neural networks, you** won't manually update your model parameters**, **instead**, you'll **use** an **optimizer**. 

We spoke earlier of **updating our model parameter values using a technique called Gradient Descent Optimization**. This is converging on the best value for our model parameters using an optimization algorithm. But so far, we haven't used an optimizer yet in order to update our model parameters. 
When you're constructing neural networks, there are a wide variety of optimizers that you can use with neural networks. You first construct an optimizer object and pass to this optimizer an iterable of all your model parameters. These are the parameters that you want to train in your neural network. It's also possible for you to specify per-parameter options within your optimizer. In the back propagation pass of your training phase, you'll invoke the backward function on your loss as usual. This will use autograd for reverse auto-differentiation in order to compute gradients. All of this is as before. Instead of you applying the gradients to update your model parameters, you can use the optimizer instead. You'll invoke the optimizer. step function, which will then update all of your model parameters for that epoch. The** torch. optim ** model within PyTorch offers a number of different optimizers that you can use. The simplest and the most plain vanilla of all optimizers is the stochastic gradient descent. This basically uses the formula that you see on screen multiplying the learning rate by the gradients in order to calculate the updated parameter values. Other optimizers might use more complex computations in order to update your model weights. There are momentum-based optimizers available which use a momentum vector to accelerate in the direction where gradient is descending.
Momentum-based optimizers tend to work very well in the real world. The gradients at each step are weighed by those in the previous step. This results in a faster convergence of your machine learning model. Optimizers and neural networks are a very popular field of research and there are constantly new ones being developed with more hyper parameters and better performance. A popular optimizer that we'll use in our demos is the **Adam optimizer, which is a momentum-based optimizer**.


Neural Networks for Classification:

Let's now talk about the basic structure of a neural network that we'll set up to perform classification. We know that the simplest possible neural network that we could build is one that we use for linear regression. Linear regression can be performed with just one neuron. The affine transformation alone is enough. The activation function for linear regression is just the identity function. 
Just like we can perform regression with a single neuron, we can use a single neuron to perform classification as well. In this case, the **activation function is the softmax function**, a single neuron can be used** to perform binary classification where the outputs can be true or false, yes or no, 0 or 1 can be divided into 2 classes**. The output of a softmax function that we use in classification problems is a set of probabilities. The probability that the output Y is one category or the other, true or false. Once we get a set of probability values as the output of the softmax function, the actual classification or the prediction from our model is basically the one with the higher probability. If the probability of Y is equal to true is higher, then the output of the classifier is true. If the probability of Y equal to false is higher, the output of the classifier is false. Binary classification can be performed with a single neuron, **but the softmax function can be extended to N-category classification**. So you can have a**ny number of discreet classes as your output** and you can **use the softmax function to calculate probabilities**. Let's assume that you're classifying handwritten digits from 0 to 9, then you can use softmax to get a set of probability values that the output Y is 0 all the way through 9. The o**utput with the highest probability value is the prediction of your classifier**. When you're performing classification using a machine learning model, the output is a set of probabilities and **the last function** that you'll use when you're working with probabilities is the **cross-entropy loss.** *Cross entropy can be defined as the measure of the distance between two probability distributions*, in this case, the measure of the distance between Yactual and Ypredicted. A lower value of cross entropy between Ypredicted and Yactual implies that the labels of the two series are in sync, they're very similar to one another. A high value of cross entropy implies that the labels of the two Yactual and Ypredicted are out of sync. When you're using PyTorch to build a neural network for classification, you'll find that the output layer is often the log softmax layer as opposed to the clean softmax layer. When you're **training the neural network, the log softmax function works out to be slightly more stable and has nicer properties**, which is why it's preferred. When you use the softmax function to output probabilities, you use cross entropy as the loss function. If you're using the log softmax as your output layer, you'll use the NLL loss or the negative log likelihood as your loss function. We've already discussed cross entropy as a measure of the distance between two probability distributions and when we use the softmax function, the objective is to minimize the cross entropy of your predicted and actual labels. Now it turns out that the log softmax with the NLL loss can be considered to be mathematically equivalent to the softmax with the cross-entropy loss. When you're using the softmax function, you simply specify the softmax layer as your output layer. When you use the log softmax, you specify the softmax and then a log rhythm, but in PyTorch there is the LogSoftMax function that you can use directly. ***The LogSoftMax function in PyTorch is more numerically stable during training as compared to the softmax followed by a log function***. So when you're building up a classification model using a neural network, you have the layers of the neural network followed by a softmax that outputs probabilities and you'll minimize the cross-entropy loss. In PyTorch, you'll have the log softmax function and you'll minimize the NLL loss. If you're used to building neural networks on TensorFlow, this might seem a little different. The NLL loss with the log softmax is preferred in PyTorch because it's more numerically stable. Without writing about the nuances, s**oftmax plus cross-entropy loss can be considered to be mathematically equivalent** to the log softmax plus the NLL loss.

In [0]:
import pandas as pd

In [2]:

from google.colab import files

uploaded = files.upload()

Saving train.csv to train.csv


In [3]:
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

User uploaded file "train.csv" with length 61194 bytes


In [4]:
import io

titanic_data=pd.read_csv(io.StringIO(uploaded['train.csv'].decode('utf-8')))
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [0]:
unwanted_features=['PassengerId','Name','Ticket','Cabin','SibSp','Cabin','Embarked','Parch']

In [6]:
titanic_data=titanic_data.drop(unwanted_features,axis=1)
titanic_data.head()


Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [0]:
titanic_data=titanic_data.dropna()

In [0]:
from sklearn import preprocessing

In [0]:
# Encode Categorical values a numeric Values, instead of one-hot encoding
le=preprocessing.LabelEncoder()

In [10]:
titanic_data['Sex']=le.fit_transform(titanic_data['Sex'])
titanic_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,1,22.0,7.25
1,1,1,0,38.0,71.2833
2,1,3,0,26.0,7.925
3,1,1,0,35.0,53.1
4,0,3,1,35.0,8.05


In [0]:
features=['Pclass','Sex','Age','Fare']

In [13]:
titanic_features=titanic_data[features]
titanic_features.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,1,22.0,7.25
1,1,0,38.0,71.2833
2,3,0,26.0,7.925
3,1,0,35.0,53.1
4,3,1,35.0,8.05


In [14]:
# one hot encoding of categrorical variable Pclass, has more than 2 categories, and transposed to muiltiple columns
titanic_features=pd.get_dummies(titanic_features,columns=['Pclass'])
titanic_features.head()

Unnamed: 0,Sex,Age,Fare,Pclass_1,Pclass_2,Pclass_3
0,1,22.0,7.25,0,0,1
1,0,38.0,71.2833,1,0,0
2,0,26.0,7.925,0,0,1
3,0,35.0,53.1,1,0,0
4,1,35.0,8.05,0,0,1


In [18]:
titanic_target=titanic_data[['Survived']]
titanic_target.head()

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0


In [0]:
from sklearn.model_selection import train_test_split
X_train,x_test,Y_train,y_test=train_test_split(titanic_features,titanic_target,test_size=0.2,random_state=0)

In [20]:
X_train.shape,Y_train.shape

((571, 6), (571, 1))

In [0]:
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [0]:
import torch
import numpy as np


In [0]:
Xtrain_=torch.from_numpy(X_train.values).float()
Xtest_=torch.from_numpy(x_test.values).float()

In [0]:
Ytrain_=torch.from_numpy(Y_train.values).view(1,-1)[0]
Ytest_=torch.from_numpy(y_test.values).view(1,-1)[0]

In [0]:
input_size=6
output_size=2
hidden_size=10

In [0]:
class Net(nn.Module):
  def __init__(self):
    super(Net,self).__init__()
    #initating 3 fully connected layers
    self.fc1=nn.Linear(input_size,hidden_size)
    self.fc2=nn.Linear(hidden_size,hidden_size)
    self.fc3=nn.Linear(hidden_size,output_size)
  
  # Forward pass implemntation on input x data
  def forward(self,x):
    #input x to first layer, and then sigmoid activation funcation
    x=F.sigmoid(self.fc1(x))
    # output of first layer to second layer, again a sigmoid activation funcation
    x=F.sigmoid(self.fc2(x))
    # last layer a liner layer with no activation funcation
    x=(self.fc2(x))
    
    #output of 3rd fully connected layer passed to log softmax funcation
    return F.log_softmax(x,dim=-1)
    
    
    

In [0]:
model=Net()

In [0]:
import torch.optim as optim

# optim otimizer, a adaptive learning rate optimizer, works very well in NNs and is very popular
optimizer=optim.Adam(model.parameters())

loss_fn=nn.NLLLoss() 

In [0]:
epoch_data=[]
epochs=1001

In [56]:
# training
# run training on 1000 epochs


for epoch in range(1,epochs): 
  # every epoch zero out the gradient to calculate fresh gradient
  optimizer.zero_grad()
  # calcuate the predication based on the current model values by passing xtrain
  Ypred=model(Xtrain_)
  
  # perfrom backward pass
  # first calculate loss on predication (predication-actual)
  loss=loss_fn(Ypred,Ytrain_)
  # then call loss_fn to calculate the gredient
  loss.backward()
  
  # once grident calculate update the model paratmers by calling the below function
  optimizer.step()
  
  # calculate the loss on test data as well to see how our model peforms on test data
  # we will not call the loss.backward on test data as weights are not updated on test data
  Ypred_test=model(Xtest_)
  loss_test=loss_fn(Ypred_test,Ytest_)
  
  # when we perform classification, the predicated values are in form of probabilty
  # find the value with the heighest probablity - this is our predicated value
  
  _,pred=Ypred_test.data.max(1)
  
  accuracy=pred.eq(Ytest_.data).sum().item()/y_test.values.size
  epoch_data.append([epoch,loss.data.item(),loss_test.data.item(),accuracy])
  
  if epoch%100==0:
      print('epoch - %d (%d%%) train loss - %.2f test - %.2f loss accuracy - %.4f' \
           % (epoch, epoch/150*10,loss.data.item(),loss_test.data.item(),accuracy))



epoch - 100 (6%) train loss - 0.87 test - 0.88 loss accuracy - 0.5594
epoch - 200 (13%) train loss - 0.80 test - 0.81 loss accuracy - 0.5594
epoch - 300 (20%) train loss - 0.75 test - 0.77 loss accuracy - 0.5524
epoch - 400 (26%) train loss - 0.72 test - 0.74 loss accuracy - 0.5524
epoch - 500 (33%) train loss - 0.70 test - 0.71 loss accuracy - 0.6014
epoch - 600 (40%) train loss - 0.68 test - 0.69 loss accuracy - 0.6154
epoch - 700 (46%) train loss - 0.65 test - 0.67 loss accuracy - 0.6643
epoch - 800 (53%) train loss - 0.63 test - 0.65 loss accuracy - 0.6713
epoch - 900 (60%) train loss - 0.60 test - 0.62 loss accuracy - 0.7133
epoch - 1000 (66%) train loss - 0.58 test - 0.60 loss accuracy - 0.7343
