# PyTorch Neural Network For Wine Quality Classification



A neural network is a computational model created from the basis of biologicial neural networks. Artificial Neural Networks (ANN) are able to somewhat mimic the procedures that our biological neural networks go through when recieving, processing, and transmitting information from neuron to neuron. A *neuron* in our ANN will be a part of a layer: input, hidden, or output, as shown in the diagram below. They will hold a weights that get updated using gradient descent. The neuron will recieve information, multiply the information by the weights it holds, and then transmit that information to the next layer. The end-goal of an ANN is to learn weights  that minimize error on our training data. Neural networks are highly useful for complex problems such as image classification, speech recognition, and natural language processing.

<img src="ANN.png">

The input layer takes in the information that is fed in by the user. Every neuron in the input layer represents an independent variable that influences the output of our neural network. The information from our input layer will be transferred into the hidden layer. The hidden layer is consisted of neurons that have activation functions applied to it, and holds *hidden* information that is useful to getting our output value. The hidden layer is in between the input and output layer, and its job is to process the information that is fed in by the input layer. This layer is responsible for extracting the important and required features from our input data in order to get the correct output value. Most problems in machine learning can be solved with just two hidden layers.The output layer of the ANN collects all the information held by the hidden layer and transmits the information out of the model. 

Before we move on further, if you don't have experience with the feedforward neural network, you can read more about the algorithm [here](https://en.wikipedia.org/wiki/Backpropagation).

We will cover the following topics in this tutorial:
- Introduction to PyTorch
- Data Loading & Pre-Processing
- Building the DataLoader and Two-Layer ANN
- Testing & Training

## Introduction to PyTorch

In [1]:
import torch
import numpy as np

PyTorch is a python package that provides two high-level features:

1. Tensor computation (like numpy) with strong GPU acceleration
2. Deep Neural Networks built on a tape-based autodiff system

You can install the package by going to the PyTorch website, and selecting your operating system. You can install it using anaconda on Mac using:

`$ conda install pytorch torchvision -c pytorch`

### Tensors

Tensors are data objects in Pytorch. We will be storing our data for the neural network model as Tensors. [Here](http://pytorch.org/docs/master/torch.html#tensors) you can find different inititalization functions for Tensors. They're similar to NumPy matrices, and they also carry a type. For example, we can initialize a Tensor from standard normal, a Tensor of byte zeros, and a Tensor of integer ones with 4 rows and 5 columns and also determine the size of the Tensors (similar to .shape in numpy) like this:

In [2]:
# initialize torch from standard normal
x = torch.randn(4, 5)
# initialize torch of zeros
y = torch.zeros(4, 5).byte()
# initialize torch of zeros
z = torch.ones(4, 5).int()

print(x, y, z)
print(x.size())


-0.6388 -0.7407  0.7655  0.3467  0.0453
-0.3150  0.5700  1.1143  0.7824 -0.6093
-0.4862 -1.2893  0.5329  0.0096 -0.9164
 0.5025 -0.9909  0.3173  0.3256 -0.7056
[torch.FloatTensor of size 4x5]
 
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
[torch.ByteTensor of size 4x5]
 
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
[torch.IntTensor of size 4x5]

torch.Size([4, 5])


You can slice through tensors as you can with numpy arrays. You also have the option to create tensors from numpy arrays, and switch back.

In [3]:
z = z[:, :3]

# convert numpy array to torch
a = np.random.randn(3)
print(torch.from_numpy(a))

# convert tensor to numpy array
print(z.numpy())


-0.9817
 0.9750
 1.2354
[torch.DoubleTensor of size 3]

[[1 1 1]
 [1 1 1]
 [1 1 1]
 [1 1 1]]


Lastly, we can do matrix operations just as we do in numpy. Note that adding an underscore after a function name allows for in-place calculations. For example:

In [4]:
y = torch.ones(4, 5)
x.add(y)
print(x)
x.add_(y)
print(x)

a = torch.randn(4, 4)
b = torch.randn(4, 4)
a.mul(b)
print(a) # a didn't change
a.mul_(b) # a changes
print(a)


-0.6388 -0.7407  0.7655  0.3467  0.0453
-0.3150  0.5700  1.1143  0.7824 -0.6093
-0.4862 -1.2893  0.5329  0.0096 -0.9164
 0.5025 -0.9909  0.3173  0.3256 -0.7056
[torch.FloatTensor of size 4x5]


 0.3612  0.2593  1.7655  1.3467  1.0453
 0.6850  1.5700  2.1143  1.7824  0.3907
 0.5138 -0.2893  1.5329  1.0096  0.0836
 1.5025  0.0091  1.3173  1.3256  0.2944
[torch.FloatTensor of size 4x5]


-0.1972 -1.5291  1.4242  0.3151
-0.1207 -0.1091  0.0299 -0.6320
 1.6606  1.6022  0.3049 -0.6754
 0.9234 -0.5473  0.7598  0.2179
[torch.FloatTensor of size 4x4]


 0.2149 -0.6179 -1.3002  0.1482
-0.0807 -0.0109 -0.0124  0.7098
 0.6654 -0.1049  0.3331  0.0191
-0.8010 -0.2615  0.6902 -0.0177
[torch.FloatTensor of size 4x4]



### Automatic Differentiation (Autograd)

Our neural network model learns optimal weights by doing backpropagation. Backpropagation updates the weights based on a loss function, usually through gradient descent. In order to decrease the loss, we need to calculate the derivatives and know the gradient. 

Autograd is a method that is built into PyTorch, and it computes derivatives without obtaining the closed-form solutions. It backpropagates gradients automatically, without us having to manually calculate them. In order to use `Autograd`, we will need to wrap a `Variable` object to wrapper around a tensor object. When using the `Variable` object, we have the option to turn on and off the preference to calculate gradients. For response variables, we won't need gradients, since we will simply be comparing the outputs, but we will need them for when we want to update weights.

In [5]:
from torch.autograd import Variable

x = Variable(torch.randn(4), requires_grad=True)
# FloatTensor (4,4)
print(x.data)
# Gradient (starts off with nothing since we didn't backpropagate yet)
print(x.grad)


 0.8868
 1.3589
-0.0781
 2.0799
[torch.FloatTensor of size 4]

None


Here, we will compute the gradient of $y = \sum_{i=0}^4 (X_i*10)$. We first need to calculate the gradient with respect to y by backwards push, and then we can see our x will be updated. 

In [6]:
y = x.mul(10).sum()
print(y)
y.backward()
print(y)
print(x.grad)

Variable containing:
 42.4742
[torch.FloatTensor of size 1]

Variable containing:
 42.4742
[torch.FloatTensor of size 1]

Variable containing:
 10
 10
 10
 10
[torch.FloatTensor of size 4]



## Data Loading & Pre-Processing

To load our data, we will be using [pandas](https://pandas.pydata.org/) to load in and pre-process our data, and [sklearn](http://scikit-learn.org/stable/model_selection.html#model-selection)'s `train_test_split` to split our training and testing dataset. We will be using a [wine quality dataset](http://archive.ics.uci.edu/ml/datasets/Wine+Quality) from the UCI Machine Learning Repository to classify the quality of our wine as high or low quality.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

red_df = pd.read_csv("winequality-red.csv", sep = ";")
white_df = pd.read_csv("winequality-white.csv", sep = ";")

Download the two datasets: winequality-red.csv and winequality-white.csv to your directory. We will load the datasets (note the csv file is separated with a semicolon) into two pandas dataframes. We will see the following columns:

1. fixed acidity 
2. volatile acidity 
3. citric acid 
4. residual sugar 
5. chlorides 
6. free sulfur dioxide 
7. total sulfur dioxide 
8. density 
9. pH 
10. sulphates 
11. alcohol 
13. type
12. quality (score between 0 and 10)

There are 6497 complete cases with 13 attributes each in the red and white wine datasets combined.

Because we want to use the type of wine (red or white) as one of our predictor variables, we will need to add in an additional column `type`, and mark red wines with 1, white wines with -1. We also need to change the variable `quality` to a categorical variable, where we mark wines with rating over 6 as high quality and 6 and below as low quality. 

In [8]:
red_df['type'] = 1
white_df['type'] = -1

df = pd.concat([red_df, white_df])
df = df[['type']+list(df)[0:12]]

df['quality'] = df['quality'].apply(lambda x: 1 if x > 6 else 0)

After processing our data into the format we want, we will normalize our input. Normalizing our inputs so that they are scaled similarly will help our model to converge faster. In our case, we will will normalize it so that our data ranges from -1 to 1. We will do so by scaling all the continuous columns to range from 0 to 1, multiplying the new values by 2, and subtracting one from it.

In [9]:
# columns: list of continuous variables
columns = list(df)
del columns[-1]
del columns[0]

# normalizing continuous columns
df[columns]=(df[columns]-df[columns].min())/(df[columns].max()-df[columns].min())
df[columns] = df[columns]*2-1

In [10]:
print(df.shape)

(6497, 13)


Finally, we will split our training and testing data. We will pick 80% of our data randomly for training and 20% of it for training, and then convert our dataframes to numpy matrices. Using the `train_test_split` function from `sklearn`, we can do it with one line.

In [10]:
# split train & test
train, test = train_test_split(df, test_size=0.2)
train, test = train.as_matrix(), test.as_matrix()

## Building the DataLoader and Two-Layer ANN

Because our training data is fairly large (it has 5197 rows), it would be time consuming to manually run through every row of data and compute the loss, compute the gradient, and update the weights. We are alternatively going to divide the entire dataset into small *batches*, and go through each batch at once to compute the loss, gradient, and then update the weights. PyTorch allows us to create iterable objects called DataLoaders that create batches to feed into our model. Once we have our custom DataLoader, we can simply iterate through the batches to compute the loss, gradient, and update the weights.

First, we will need to import the `Dataset` and `DataLoader` from `torch.utils.data`. Then we are going to make a class called WineDataset by extending the `Dataset` class. To do so, we need to store the data as x_data and y_data, write a `getitem(i)` function and `len()` function for our class. 

After making two instances of `WineDataset` using train and test data, we will feed the train and test Datasets into DataLoader, with batch size of 80. We will also turn on shuffling, and set the `num_workers` to the number of epochs, or the number of times we want to go through our dataset. 

In [11]:
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

torch.manual_seed(1013)

class WineDataset(Dataset):
    """
    Wine Dataset: contains 12 predictor variables and 1 response variable
    """
    
    def __init__(self, winedata):
        self.x_data = torch.from_numpy(winedata[:, 0:-1]).float()
        self.y_data = torch.from_numpy(winedata[:, -1]).float()
        
    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]
    
    def __len__(self):
        return self.x_data.shape[0]

# Create train and test datasets
train_data = WineDataset(train)
test_data = WineDataset(test)

# Create a custom dataloader with unique batch size
train_loader = DataLoader(dataset=train_data, batch_size=80, shuffle=True, num_workers =100)
test_loader = DataLoader(dataset=test_data, batch_size=80, shuffle=True, num_workers=100)

Next we will build the two-layer feedforward neural network model. This model will have an input layer, two hidden layers, and an output layer. The activation function we will use is the sigmoid function. Before we train our model, we have to define a unique class that suits our purpose. In our two-layered ANN, we will have 12 input nodes (since we have 12 predictor variables), two hidden layers with 10 nodes and 5 nodes, and an output layer with one node (since our task is to simply classify the wine as high or low quality (0 or 1). 

\begin{equation*}
\textrm{Sigmoid}(x)= \frac{1}{1 + e^{-x}}
\end{equation*}

There are a lot of design choices that we can make with a neural network model. we can choose the number of layers, the number of nodes in a hidden layer, and the activation function that we want to use. The more layers we use, the more complex our model becomes, and we will most likely be overfitting to our data. Using more hidden neurons allows us to extract more features, whereas using less hidden neurons can only keep the key features. Though it all depends on the amount of data you have, you should usually use less neurons than the layer before (so hidden layer 1 has less neurons than the input layer, and so on). 

We create the class and instantiate three linear layers for the two hidden layers and output. The layers will connect each other, and therefore we will need to set the inputs to layer 1 as the number of input neurons and the outputs as number of neurons in hidden layer 1. We will also define our activation function as the sigmoid function. Then, we are required to write the `forward()` function, where we have to "connect" our network using the layers and activation function defined in the init function. This function is in charge of the forward pass, calculating output values for a given input X.

*Note that we can use different activation functions, such as ReLu, LogSigmoid, and SoftMax, and different layers such as Conv2d, which are provided in torch.nn*

We will instantiate a `TwoLayerANN`, and call it `wine_model`. To see our network's parameters and the weights at each layer, we can call `list(wine_model.parameters())`

In [12]:
class TwoLayerANN(nn.Module):
    def __init__(self, inputs, hidden1, hidden2, output):
        super(TwoLayerANN,self).__init__()
        self.layer1 = nn.Linear(inputs, hidden1)
        self.layer2 = nn.Linear(hidden1, hidden2)
        self.layer3 = nn.Linear(hidden2, output)
        self.activation = nn.Sigmoid()

        
    def forward(self, X):
        # Feed input layer into hidden layer 1 & apply activation
        hidden1_in = self.layer1(X)
        hidden1_out = self.activation(hidden1_in)
        
        # Feed hidden layer 1 into hidden layer 2 & apply activation
        hidden2_in = self.layer2(hidden1_out)
        hidden2_out = self.activation(hidden2_in)
        
        # Feed hidden layer 2 into output layer & apply activation
        output_in = self.layer3(hidden2_out)
        output_out = self.activation(output_in)
        return output_out
    
wine_model = TwoLayerANN(12, 10, 5, 1)

## Training and Testing

Now that we have created the model, we will move on to training the weights. We did not have to implement a backwards function in which we calculate the loss, gradient, and update the weights using the delta rule because the `autograd` package has it build in.

### Loss Function

We will need a loss function that calculates how well our model performs. There are several options for the loss function in `torch.nn`, such as log-likelihood, cross-entropy loss, and negative log-likelihood loss. In our case, we are going to use the [Mean Squared Error loss function](http://pytorch.org/docs/master/nn.html#torch.nn.MSELoss), which calculates the mean squared error between our predicted and observed y values.


### Optimization Algorithm

Once we have a loss function, we will also need an optimization algorithm. There are also several options for the loss function in `torch.optim`, but we are going to stick to stochastic gradient descent. This means that we are going to update the weights for every batch, unlike gradient descent, where we update the weights for every epoch or full round of forward propagation. For the optimization algorithm, we will have to set the parameters to optimize as our model's parameters (`wine_model.parameters()`) and set the learning rate as 0.1. This is the step size that our algorithm takes to gradually improve our model. Here is the stochastic gradient descent algorithm, where $\theta$ are the parameters we want to optimize over, and $\delta$ is the learning rate. 


\begin{equation*}
\theta = \theta - \delta \times \nabla_\delta MSE(\theta; X_i, y_i)
\end{equation*}

### Training Algorithm

Once we have our loss function and optimization algorithm, we can set our model to training mode, and run through 100 epochs of training. We will use our `train_loader` that we created earlier, and iterate through the batches to update the weights. As we iterate through the batches and epochs, we will get inputs (predictor variable data) and outputs (quality), and we will wrap them in a Variable class. Then we will feed the model with our input variables to get an output, and calculate the loss using our MSE loss function. To update the weights, we will follow these steps:

1. Clear the existing gradients in order to make sure we are not accumulating existing gradients
2. Do a backward pass of network to calculate the sum of gradients
3. Update the weights using the learning rate defined in our optimization algorithm

In [13]:
lossfn = torch.nn.MSELoss(size_average=False) # MSE Loss Function
optimizer = torch.optim.SGD(wine_model.parameters(), lr=0.1) # optimization algorithm

wine_model.train()

for epoch in range(100):
    for i, (inputs, outputs) in enumerate(train_loader):
        # Wrap data in Variable class
        inputs, outputs = Variable(inputs.float()), Variable(outputs.float(), requires_grad=False)
        
        # Feed the model with inputs to get predictions
        y_pred = wine_model(inputs)
        
        # Calculate loss using MSE
        loss = lossfn(y_pred, outputs)
        
        # Clear existing gradients, backward pass, update weights
        optimizer.zero_grad() 
        loss.backward()
        optimizer.step()

### Testing our model

Now that we have trained our model, we can go ahead and see the accuracy level of the model. We will do so by setting the model to evaluation mode, since we won't be updating any of the weights. Then we will get our test data matrix, and split our inputs and outputs. Now all we have to do is gather the predicted values from our test data, and calculate the accuracy. 

We classify our predicted values as high quality if the output is greater than 0.5, and low quality if the output is lower than or equal to 0.5. 

In [14]:
wine_model.eval()

test_x = Variable(torch.from_numpy(test[:, 0:-1]).float())
test_y = torch.from_numpy(test[:, [-1]]).int()

correct = 0
j = 0

out = wine_model(test_x)
pred = torch.IntTensor(1300, 1)

for i in out.data:
    output = i > 0.5
    pred[j] = output.int()
    j += 1


print(torch.sum(test_y == pred)/1300)
        

0.816923076923077


## References

- [PyTorch](http://pytorch.org/)
- [PyTorch Tutorials](https://github.com/pytorch/tutorials)
- [UCI Machine Learning Repo](https://archive.ics.uci.edu/ml/datasets/Wine)