# Making Your First Neural Net in PyTorch

* For this task, I'll use a an example text dataset about baseball.

### Loading the data. 
* Before making the neural net, we need to understand the dataset.

In [None]:
#data is stored as pickle file
import pickle
import pandas as pd
with open('data/tfidf_1000.pkl','rb') as f:
    data = pickle.load(f)

In [None]:
#Data is a dictionary with 4 fields and 582 members in each field
# raw - emails
# data - term frequency encoding
# target - 1 if about baseball, 0 otherwise
# features - key for the encoding
data

In [None]:
# Lets check out 1 example
print(data['raw'][0])

In [None]:
#Clearly not about baseball, so its target should be 0
print(data['target'][0])

In [None]:
#Lets see the encoded version
encoded_data = data['data'][0]
encoded_ser = pd.Series(encoded_data, index = data['features'])
encoded_ser

### The task: Using Neural Networks, predict whether a document is about baseball based on its term-frequency encoding.

In [None]:
X = data['data']

In [None]:
X

In [None]:
X.shape

In [None]:
y = data['target']

In [None]:
y

In [None]:
y.shape

### The task : Use Neural Networks to use X to Predict y.

<img src="data/pic.png">

* Lets build the neural network above.
* Ours will have one big difference though. What is it?
* What does the output layer represent?

### Building the network in PyTorch
* We'll build up piece by piece.



In [None]:
#Import the neccessary packages
import torch
import torch.nn as nn
from torch.autograd import Variable

In [None]:
#Neural Nets in PyTorch are classes that you have to make.
#This is the class definition
class My_Neural_Net(nn.Module):
    #put in pass to avoid errors
    pass
    

In [None]:
class My_Neural_Net(nn.Module):    
    #constructor
    def __init__(self):
        super(My_Neural_Net, self).__init__()
        
        # Define the layers. This matches the image above 
        # Except that our input size is 1000 dimensions
        self.layer_1 = nn.Linear(1000, 4)
        self.layer_2 = nn.Linear(4,4)
        self.layer_3 = nn.Linear(4,1)

        # Define activation functions. I'll be using ReLU
        # for the hidden layers. Must use sigmoid for the 
        # final layer so we get a number between 0 and 1 for 
        # the probability of being about baseball.
        # Luckily PyTorch already has ReLU and 
        # sigmoid.
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
        # Define what optimization we want to use.
        # Adam is a popular method so I'll use it.
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)


### At the point we have the laid the ground work. The network is defined. Now we must tell PyTorch what going forward through our network means. Here is what it means in psuedocode

    1. input X
    2. linearly transform X into hidden data 1 via weights
    3. perform ReLU on hidden data
    4. linearly transform hidden data into hidden data 2 via weights
    5. perform ReLU on hidden data
    6. linearly transform hidden data into output layer via weights
    7. perform sigmoid on output data to get f(X) predictions between 0 and 1
    8. output f(X) which should approximate y
    
### In PyTorch:

In [None]:
    # 1. input X
    def forward(self, X):
        # 2. linearly transform X into hidden data 1 via weights
        X = self.layer_1(X)
        # 3. perform ReLU on hidden data
        X = self.relu(X)
        # 4. linearly transform hidden data into hidden data 2 via weights
        X = self.layer_2(X)
        # 5. perform ReLU on hidden data
        X = self.relu(X)
        # 6. linearly transform hidden data into output layer via weights
        X = self.layer_3(X)
        # 7. perform sigmoid on output data to get f(X) predictions between 0 and 1
        X = self.sigmoid(X)
        
        # 8. output predictions
        return X

### Now this needs to go into our class

In [None]:
class My_Neural_Net(nn.Module):    
    #constructor
    def __init__(self):
        super(My_Neural_Net, self).__init__()
        
        # Define the layers. This matches the image above 
        # Except that our input size is 1000 dimensions
        self.layer_1 = nn.Linear(1000, 4)
        self.layer_2 = nn.Linear(4,4)
        self.layer_3 = nn.Liner(4,1)

        # Define activation functions. I'll be using ReLU
        # for the hidden layers. Must use sigmoid for the 
        # final layer so we get a number between 0 and 1 for 
        # the probability of being about baseball.
        # Luckily PyTorch already has ReLU and 
        # sigmoid.
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
        # Define what optimization we want to use.
        # Adam is a popular method so I'll use it.
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
        
    # 1. input X
    def forward(self, X):
        # 2. linearly transform X into hidden data 1 via weights
        X = self.layer_1(X)
        # 3. perform ReLU on hidden data
        X = self.relu(X)
        # 4. linearly transform hidden data into hidden data 2 via weights
        X = self.layer_2(X)
        # 5. perform ReLU on hidden data
        X = self.relu(X)
        # 6. linearly transform hidden data into output layer via weights
        X = self.layer_3(X)
        # 7. perform sigmoid on output data to get f(X) predictions between 0 and 1
        X = self.sigmoid(X)
        
        # 8. output predictions
        return X


### Defining Loss: Now we need to define how PyTorch knows how good our predictions are. Aka our loss at each iteration.

* For a classification task you will typically use cross-entropy loss. However, you can define any loss function you want.
* For example, maybe you want to modify cross entropy loss to weight false positives higher than false negatives or vice versa. You could do that.
* For the popular loss functions, such as vanilla cross entropy loss, and mean square error, there is already a PyTorch function for it. These functions are highly optimized so it is reccomended to use them when they exist.

<img src="data/cross_entropy.png">



* cross entropy loss is very high when true label is a 1 but predicted label is a 0 and vice versa

In [None]:
    def loss(self, pred, true):
        #PyTorch's own cross entropy loss function.
        score = nn.BCELoss()
        return score(pred, true)
    


### Put the loss into our class

In [None]:
class My_Neural_Net(nn.Module):    
    #constructor
    def __init__(self):
        super(My_Neural_Net, self).__init__()
        
        # Define the layers. This matches the image above 
        # Except that our input size is 1000 dimensions
        self.layer_1 = nn.Linear(1000, 4)
        self.layer_2 = nn.Linear(4,4)
        self.layer_3 = nn.Liner(4,1)

        # Define activation functions. I'll be using ReLU
        # for the hidden layers. Must use sigmoid for the 
        # final layer so we get a number between 0 and 1 for 
        # the probability of being about baseball.
        # Luckily PyTorch already has ReLU and 
        # sigmoid.
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
        # Define what optimization we want to use.
        # Adam is a popular method so I'll use it.
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
        
    # 1. input X
    def forward(self, X):
        # 2. linearly transform X into hidden data 1
        X = self.layer_1(X)
        # 3. perform ReLU on hidden data
        X = self.relu(X)
        # 4. linearly transform hidden data into hidden data 2
        X = self.layer_2(X)
        # 5. perform ReLU on hidden data
        X = self.relu(X)
        # 6. linearly transform hidden data into output layer
        X = self.layer_3(X)
        # 7. perform sigmoid on output data to get f(X) predictions between 0 and 1
        X = self.sigmoid(X)
        
        # 8. output predictions
        return X
    
    def loss(self, pred, true):
        #PyTorch's own cross entropy loss function.
        score = nn.BCELoss()
        return score(pred, true)

* In other frameworks, now would be the time where you need to define a backwards pass through your network. 
* Not in PyTorch. 
* Once we define forward, PyTorch can automatically computer derivates so we don't need to caculate the formulas by hand.

### Essentially: We don't need a `backward()` function!

### What we do need now is a function that tells PyTorch how to fit the network. Here is the psuedocode
    
    1. input: N - number of iterations to train, X - data, y - target
    2. for n going from 0 to N - 1:
        3. f(X) = forward(X)
        4. l = loss(f(X),y)
        5. update weights in model to decrease loss and make f(X) closer to y
  
* How to update those weights is done by back propogation and gradient descent and outside the scope of this tutorial

In [None]:
# Writing a fit function

    # 1. input: N - number of iterations to train, X - data, y - target
    def fit(self,X,y,N = 5000):
        
        # 2. for n going from 0 to N -1 :
        for epoch in range(N):
            
            # Reset weights in case they are set for some reason
            self.optimizer.zero_grad()
            
            # 3. f(X) = forward(X) 
            pred = self.forward(X)
            
            # 4. l = loss(f(X),y)
            l = self.loss(pred, y)
            
            #print loss
            print(l)
            
            # 5. Back progation
            l.backward()
            # 5. Gradient Descent
            self.optimizer.step()

* Without any early stopping/dropout/regularization you can greatly overfit but cannot fit everything into one lecture.

# Lets put the fit function into our class.

In [None]:
class My_Neural_Net(nn.Module):    
    #constructor
    def __init__(self):
        super(My_Neural_Net, self).__init__()
        
        # Define the layers. This matches the image above 
        # Except that our input size is 1000 dimensions
        self.layer_1 = nn.Linear(1000, 4)
        self.layer_2 = nn.Linear(4,4)
        self.layer_3 = nn.Linear(4,1)

        # Define activation functions. I'll be using ReLU
        # for the hidden layers. Must use sigmoid for the 
        # final layer so we get a number between 0 and 1 for 
        # the probability of being about baseball.
        # Luckily PyTorch already has ReLU and 
        # sigmoid.
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
        # Define what optimization we want to use.
        # Adam is a popular method so I'll use it.
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
        
    # 1. input X
    def forward(self, X):
        # 2. linearly transform X into hidden data 1
        X = self.layer_1(X)
        # 3. perform ReLU on hidden data
        X = self.relu(X)
        # 4. linearly transform hidden data into hidden data 2
        X = self.layer_2(X)
        # 5. perform ReLU on hidden data
        X = self.relu(X)
        # 6. linearly transform hidden data into output layer
        X = self.layer_3(X)
        # 7. perform sigmoid on output data to get f(X) predictions between 0 and 1
        X = self.sigmoid(X)
        
        # 8. output predictions
        return X
    
    def loss(self, pred, true):
        #PyTorch's own cross entropy loss function.
        score = nn.BCELoss()
        return score(pred, true)
    

    # 1. input: N - number of iterations to train, X - data, y - target
    def fit(self,X,y,N = 5000):
        
        # 2. for n going from 0 to N -1 :
        for epoch in range(N):
            
            # Reset weights in case they are set for some reason
            self.optimizer.zero_grad()
            
            # 3. f(X) = forward(X) 
            pred = self.forward(X)
            
            # 4. l = loss(f(X),y)
            l = self.loss(pred, y)
            
            #print loss
            print(l)
            
            # 5. Back progation
            l.backward()
            # 5. Gradient Descent
            self.optimizer.step()
            

# What function are we missing?
* Once we've fit our model we probably want to use it to predict
    * Ex: Predict whether a document is about baseball.
* In practice, we typically split this into 2 functions.
1. `predict_proba` outputs probabilities
    * Our forward function outputs a vector probabilities of being a 1. For compatability with `sklearn` functions its a good idea to output a matrix where the first column is probability of being a `0` and second column is probability of being a `1`. Easy to create this as first and second column must add to 1 in each row.
2. `predict` outputs class values
    * generally, if `predict_proba` for an example (for being a `1`) outputs something `0.5 or greater` we say that we predicted a `1` and a `0` otherwise.
    
### Writing `predict_proba` and `predict`

In [None]:
    def predict_proba(self, X):
        # probability of being a 1
        prob_1 = self.forward(X)
              
        # vectorwise subtraction
        prob_0 = 1 - prob_1
        
        # make into a matrix
        probs = torch.cat((prob_0,prob_1), dim = 1)
        
        return probs

In [None]:
    def predict(self, X):
        probs = self.predict_proba(X)
        
        # get only second column (probability of being a 1)
        probs_1 = probs[:,1:]
        
        # 1 if prob_1 is greater or equal to than 0.5 for a given example
        # 0 if less than 0.5
        preds = (probs_1 >= 0.5).int()
        
        return preds
    

### Again, add to class.

In [None]:
class My_Neural_Net(nn.Module):    
    #constructor
    def __init__(self):
        super(My_Neural_Net, self).__init__()
        
        # Define the layers. This matches the image above 
        # Except that our input size is 1000 dimensions
        self.layer_1 = nn.Linear(1000, 4)
        self.layer_2 = nn.Linear(4,4)
        self.layer_3 = nn.Linear(4,1)

        # Define activation functions. I'll be using ReLU
        # for the hidden layers. Must use sigmoid for the 
        # final layer so we get a number between 0 and 1 for 
        # the probability of being about baseball.
        # Luckily PyTorch already has ReLU and 
        # sigmoid.
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
        # Define what optimization we want to use.
        # Adam is a popular method so I'll use it.
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
        
    # 1. input X
    def forward(self, X):
        # 2. linearly transform X into hidden data 1
        X = self.layer_1(X)
        # 3. perform ReLU on hidden data
        X = self.relu(X)
        # 4. linearly transform hidden data into hidden data 2
        X = self.layer_2(X)
        # 5. perform ReLU on hidden data
        X = self.relu(X)
        # 6. linearly transform hidden data into output layer
        X = self.layer_3(X)
        # 7. perform sigmoid on output data to get f(X) predictions between 0 and 1
        X = self.sigmoid(X)
        
        # 8. output predictions
        return X
    
    def loss(self, pred, true):
        #PyTorch's own cross entropy loss function.
        score = nn.BCELoss()
        return score(pred, true)
    

    # 1. input: N - number of iterations to train, X - data, y - target
    def fit(self,X,y,N = 5000):
        
        # 2. for n going from 0 to N -1 :
        for epoch in range(N):
            
            # Reset weights in case they are set for some reason
            self.optimizer.zero_grad()
            
            # 3. f(X) = forward(X) 
            pred = self.forward(X)
            
            # 4. l = loss(f(X),y)
            l = self.loss(pred, y)
            #print loss
            print(l)
            
            # 5. Back progation
            l.backward()
            # 5. Gradient Descent
            self.optimizer.step()
    
    def predict_proba(self, X):
        # probability of being a 1
        prob_1 = self.forward(X)
              
        # vectorwise subtraction
        prob_0 = 1 - prob_1
        
        # make into a matrix
        probs = torch.cat((prob_0,prob_1), dim = 1)
        
        return probs
    
    def predict(self, X):
        probs = self.predict_proba(X)
        
        # get only second column (probability of being a 1)
        probs_1 = probs[:,1:]
        
        # 1 if prob_1 is greater or equal to than 0.5 for a given example
        # 0 if less than 0.5
        preds = (probs_1 >= 0.5).int()
        
        return preds

### One last function we may want to add (optional, but a good idea): A scoring function.
### Tells us how we did after the whole process is over.

In [None]:
    def score(self, X, y):
        # proportion of times where we're correct
        acc = (self.predict(X) == y).mean()
        
        return acc

# Put it all together

In [None]:
class My_Neural_Net(nn.Module):    
    #constructor
    def __init__(self):
        super(My_Neural_Net, self).__init__()
        
        # Define the layers. This matches the image above 
        # Except that our input size is 1000 dimensions
        self.layer_1 = nn.Linear(1000, 4)
        self.layer_2 = nn.Linear(4,4)
        self.layer_3 = nn.Linear(4,1)

        # Define activation functions. I'll be using ReLU
        # for the hidden layers. Must use sigmoid for the 
        # final layer so we get a number between 0 and 1 for 
        # the probability of being about baseball.
        # Luckily PyTorch already has ReLU and 
        # sigmoid.
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
        # Define what optimization we want to use.
        # Adam is a popular method so I'll use it.
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
        
    # 1. input X
    def forward(self, X):
        # 2. linearly transform X into hidden data 1
        X = self.layer_1(X)
        # 3. perform ReLU on hidden data
        X = self.relu(X)
        # 4. linearly transform hidden data into hidden data 2
        X = self.layer_2(X)
        # 5. perform ReLU on hidden data
        X = self.relu(X)
        # 6. linearly transform hidden data into output layer
        X = self.layer_3(X)
        # 7. perform sigmoid on output data to get f(X) predictions between 0 and 1
        X = self.sigmoid(X)
        
        # 8. output predictions
        return X
    
    def loss(self, pred, true):
        #PyTorch's own cross entropy loss function.
        score = nn.BCELoss()
        return score(pred, true)
    

    # 1. input: N - number of iterations to train, X - data, y - target
    def fit(self,X,y,N = 5000):
        
        # 2. for n going from 0 to N -1 :
        for epoch in range(N):
            
            # Reset weights in case they are set for some reason
            self.optimizer.zero_grad()
            
            # 3. f(X) = forward(X) 
            pred = self.forward(X)
            
            # 4. l = loss(f(X),y)
            l = self.loss(pred, y)
            #print loss
            print(l)
            
            # 5. Back progation
            l.backward()
            # 5. Gradient Descent
            self.optimizer.step()
    
    def predict_proba(self, X):
        # probability of being a 1
        prob_1 = self.forward(X)
              
        # vectorwise subtraction
        prob_0 = 1 - prob_1
        
        # make into a matrix
        probs = torch.cat((prob_0,prob_1), dim = 1)
        
        return probs
    
    def predict(self, X):
        probs = self.predict_proba(X)
        
        # get only second column (probability of being a 1)
        probs_1 = probs[:,1:]
        
        # 1 if prob_1 is greater or equal to than 0.5 for a given example
        # 0 if less than 0.5
        preds = (probs_1 >= 0.5).int()
        
        return preds
    
    def score(self, X, y):
        # proportion of times where we're correct
        # conversions just allow the math to work
        acc = (self.predict(X) == y.int()).float().mean()
        
        return acc
    

# Fitting our Neural Net

In [None]:
#Create our neural net
neur_net = My_Neural_Net()

# Split into train and test so we can fit on some data and see performance 
# on data we havent seen yet.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Turn X and y (train and test) into PyTorch objects. We always have to do this step
X_train_tens = Variable(torch.Tensor(X_train).float())
X_test_tens = Variable(torch.Tensor(X_test).float())
y_train_tens = Variable(torch.Tensor(y_train).float())
y_test_tens = Variable(torch.Tensor(y_test).float())



In [None]:
neur_net.fit(X_train_tens,y_train_tens)

In [None]:
neur_net.score(X_train_tens,y_train_tens)

In [None]:
# We are overfitting, but still good score!
# There are ways to correct overfitting.
neur_net.score(X_test_tens,y_test_tens)

# One last thing:
* At the very beggining we specificed that our first layer took the 1000 dimensional input and converted it to the hidden data which was 5 dimensional.
* What if we didnt want to have to look at the dimensions of our input data every time?
* We can do that in our constructor.

In [None]:
X_train_tens.shape[1]

In [None]:
#constructor
    #take in X as a parameter
    def __init__(self, X):
        super(My_Neural_Net, self).__init__()
        
        #Find dimensionality of X
        X_dim = X.shape[1]
        
        # Define the layers. This matches the image above 
        # Except that our input size is X_dim dimensions
        self.layer_1 = nn.Linear(X_dim, 4)

# Thats all we have to modify

In [None]:
class My_Neural_Net(nn.Module): 
    
    #constructor
    #take in X as a parameter
    def __init__(self, X):
        super(My_Neural_Net, self).__init__()
        
        #Find dimensionality of X
        X_dim = X.shape[1]
        
        # Define the layers. This matches the image above 
        # Except that our input size is X_dim dimensions
        self.layer_1 = nn.Linear(X_dim, 4)
        self.layer_2 = nn.Linear(4,4)
        self.layer_3 = nn.Linear(4,1)

        # Define activation functions. I'll be using ReLU
        # for the hidden layers. Must use sigmoid for the 
        # final layer so we get a number between 0 and 1 for 
        # the probability of being about baseball.
        # Luckily PyTorch already has ReLU and 
        # sigmoid.
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
        # Define what optimization we want to use.
        # Adam is a popular method so I'll use it.
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
        
    # 1. input X
    def forward(self, X):
        # 2. linearly transform X into hidden data 1
        X = self.layer_1(X)
        # 3. perform ReLU on hidden data
        X = self.relu(X)
        # 4. linearly transform hidden data into hidden data 2
        X = self.layer_2(X)
        # 5. perform ReLU on hidden data
        X = self.relu(X)
        # 6. linearly transform hidden data into output layer
        X = self.layer_3(X)
        # 7. perform sigmoid on output data to get f(X) predictions between 0 and 1
        X = self.sigmoid(X)
        
        # 8. output predictions
        return X
    
    def loss(self, pred, true):
        #PyTorch's own cross entropy loss function.
        score = nn.BCELoss()
        return score(pred, true)
    

    # 1. input: N - number of iterations to train, X - data, y - target
    def fit(self,X,y,N = 5000):
        
        # 2. for n going from 0 to N -1 :
        for epoch in range(N):
            
            # Reset weights in case they are set for some reason
            self.optimizer.zero_grad()
            
            # 3. f(X) = forward(X) 
            pred = self.forward(X)
            
            # 4. l = loss(f(X),y)
            l = self.loss(pred, y)
            #print loss
            print(l)
            
            # 5. Back progation
            l.backward()
            # 5. Gradient Descent
            self.optimizer.step()
    
    def predict_proba(self, X):
        # probability of being a 1
        prob_1 = self.forward(X)
              
        # vectorwise subtraction
        prob_0 = 1 - prob_1
        
        # make into a matrix
        probs = torch.cat((prob_0,prob_1), dim = 1)
        
        return probs
    
    def predict(self, X):
        probs = self.predict_proba(X)
        
        # get only second column (probability of being a 1)
        probs_1 = probs[:,1:]
        
        # 1 if prob_1 is greater or equal to than 0.5 for a given example
        # 0 if less than 0.5
        preds = (probs_1 >= 0.5).int()
        
        return preds
    
    def score(self, X, y):
        # proportion of times where we're correct
        # conversions just allow the math to work
        acc = (self.predict(X) == y.int()).float().mean()
        
        return acc
    

In [None]:
# Only difference now is when we make our instance we need to pass in our X data.

neur_net = My_Neural_Net(X_train_tens)

In [None]:
# Everything else the same

neur_net.fit(X_train_tens,y_train_tens)
neur_net.score(X_test_tens,y_test_tens)

# Let Dr. Cook or me know if you're interested in:

### More advanced usage:
* Early stopping
* Regularization
* Dropout
* Batch training
* Etc.

### More complicated networks
* RNNs/LSTMs
* CNNs
* Others