# CS 342 Neural Nets: Final Project

Authors: Ryan Gahagan (rg32643) and Dustan Helm (dbh878)

### Overview

In this project, we will create a neural network that will hopefully be able to predict the quality of various wines given their chemical compositions.

This project's data set and idea are based off another paper, cited here:

  "P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
  
  Modeling wine preferences by data mining from physicochemical properties.
  
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236."

We plan to make a feed-forward network to process this data, as well as some experiments to test and analyze our network.

## Data Pre-processing

First, we have to take the data set and process it into a format that Python can use.

In [2]:
# Run this block to load important libraries and set things up
import torch
from torch import nn
import numpy as np
import scipy.signal

%matplotlib inline
import matplotlib.pyplot as plt
%config InlineBackend.figure_format='retina'

from torch.utils.data.sampler import SubsetRandomSampler

from tqdm.notebook import tqdm

In [3]:
# Load in dataset files
white_file = open("winequality-white-nolabels.csv")
wine_quality_white = np.loadtxt(white_file, delimiter=";")
red_file = open("winequality-red-nolabels.csv")
wine_quality_red = np.loadtxt(red_file, delimiter=";")

In [4]:
# Create suitable data_arrays
num_samples_white = wine_quality_white.shape[0]
num_samples_red = wine_quality_red.shape[0]
num_samples_total = num_samples_white + num_samples_red

# Combine white and red wine datasets into one np array
wine_quality_combined_whitered = np.append(wine_quality_white, wine_quality_red, axis=0)

print(f"wine_quality_combined_whitered.shape = {wine_quality_combined_whitered.shape}")

assert wine_quality_combined_whitered.shape[0] == num_samples_total

# Rename
data_array = wine_quality_combined_whitered
data_array_white = wine_quality_white
data_array_red = wine_quality_red

wine_quality_combined_whitered.shape = (6497, 12)


Now we've loaded our datasets into lists. `data_array_red` is an `np.array` whose shape is `(1599,12)` and `data_array_white` is likewise an array of shape `(4898,12)`. `data_array` is simply those two arrays concatenated into an array of shape `(6497,12)`.

Note that these 12 columns represent both features and labels.

The columns, in order, are:
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
- quality (integer in \[0,10\])
where quality is our label.

In [5]:
# Split data array into features and label arrays
# Inputs: 
#     data_array: Data array to be split into features and labels
# Outputs:
#     data_array_feats: Features for data_array
#     data_array_labels: Labels for data_array
def split_features_labels(data_array):
    assert isinstance(data_array, np.ndarray)
    assert data_array.shape[1] == 12
    
    data_array_feats = data_array[:,:-1] # first 11 columns
    data_array_labels = data_array[:,-1] # last column

    assert data_array_feats.shape[1] == 11
    assert data_array_feats.shape[0] == data_array_labels.shape[0]
    
    return data_array_feats, data_array_labels

In [6]:
# Split data array into training and testing sets based on the provided train_proportion parameter
# Inputs: 
#     data_array: Dataset to split for training and testing
#     train_proportion: Proportion of datapoints to be kept as training data
# Outputs:
#     train_set: Training set containing a proportion of the datapoints contained in data_array specified by input parameter.
#     test_set: Testing set containing held-back datapoints to test the trained model
def train_test_split(data_array, train_proportion):
    assert isinstance(data_array, np.ndarray)
    assert data_array.shape[1] == 12
    
    num_samples = data_array.shape[0]
    
    feats, labels = split_features_labels(data_array)
    data_set = torch.utils.data.TensorDataset(torch.tensor(feats), torch.tensor(labels).long())
    
    train_size = int(train_proportion*num_samples)
    test_size = num_samples - train_size
    
    train_set, test_set = torch.utils.data.random_split(data_set, [train_size, test_size])
    
    assert abs(len(train_set) / len(data_array) - train_proportion) < 0.01
    assert abs(len(test_set) / len(data_array)  - (1 - train_proportion)) < 0.01
    
    return train_set, test_set

In [7]:
# Perform the actual split on all of our datasets into training and testing
train_proportion = 0.8
train_proportion_white = 0.8
train_proportion_red = 0.8

train_set, test_set = train_test_split(data_array, train_proportion)
train_set_white, test_set_white = train_test_split(data_array_white, train_proportion_white)
train_set_red, test_set_red = train_test_split(data_array_red, train_proportion_red)

In [8]:
# Split training set into true training and validation data based on the input proportion
# Inputs:
#     ntotal: Total number of datapoints in the original training set to be used to determine the split
#     train_proportion: Proportion of the training set examples which should not be placed into the validation set
# Outputs:
#     train_ix: Indices for training examples
#     val_ix: Indices for validation examples

def train_val_split_ix(ntotal, train_proportion):
    ntrain = int(train_proportion*ntotal)
    nval = ntotal - ntrain
    
    val_ix = np.random.choice(range(ntotal), size=nval, replace=False)
    train_ix = list(set(range(ntotal)) - set(val_ix))
    
    assert abs(len(train_ix) / ntotal - train_proportion) < 0.01
    assert abs(len(val_ix) / ntotal - (1 - train_proportion)) < 0.01
    
    return (train_ix, val_ix)

In [9]:
# Perform the training/validation split and then confirm array lengths
train_proportion2 = 0.9
train_proportion_white2 = 0.9
train_proportion_red2 = 0.9

train_ix, val_ix = train_val_split_ix(len(train_set), train_proportion2)
train_white_ix, val_white_ix = train_val_split_ix(len(train_set_white), train_proportion_white2)
train_red_ix, val_red_ix = train_val_split_ix(len(train_set_red), train_proportion_red2)

print(f"(len(train_ix), len(val_ix)) = ({len(train_ix)}, {len(val_ix)})")
print(f"(len(train_white_ix), len(val_white_ix)) = ({len(train_white_ix)}, {len(val_white_ix)})")
print(f"(len(train_red_ix), len(val_red_ix)) = ({len(train_red_ix)}, {len(val_red_ix)})")

assert len(train_ix) + len(val_ix) == len(train_set)
assert len(train_white_ix) + len(val_white_ix) == len(train_set_white)
assert len(train_red_ix) + len(val_red_ix) == len(train_set_red)

(len(train_ix), len(val_ix)) = (4677, 520)
(len(train_white_ix), len(val_white_ix)) = (3526, 392)
(len(train_red_ix), len(val_red_ix)) = (1151, 128)


In [10]:
# Set up data samplers for use in DataLoader objects
# Inputs: 
#     datalist_ix: Tuple of index lists used to determine each loader's data
# Outputs:
#     result: Tuple of SubsetRandomSamplers representing each index list object in datalist_ix
def setup_samplers(datalist_ix):
    result = ()
    for data_ix in datalist_ix:
        result += (SubsetRandomSampler(data_ix),)
    return result

In [11]:
# Set up a tuple of Data Loaders based on provided datasets, samplers, and batch_size
# Inputs:
#     datalist: List of datasets to give to the DataLoaders
#     samplers: List of samplers to use in the DataLoaders
#     batch_size: DataLoader batch size (Currently uses the same batch_size for every dataset passed)
# Outputs:
#     Tuple of DataLoader objects len(datalist) size long 
def setup_data_loaders(datalist, samplers, batch_size):
    assert len(datalist) == len(samplers)
    
    result = ()
    for i in range(len(datalist)):
        data = datalist[i]
        sampler = samplers[i]
        result += (torch.utils.data.DataLoader(data, batch_size, sampler=sampler),)
    return result

#TODO: It may be useful to have different batch_size values for training, validation, and testing. 
# In that event, batch_size should be replaced with a touple of batch_size values and the following should be added to the loop:
# batch_size = batch_sizes[i]

Now that we've declared methods to do various helper tasks, we're going to actually break our data into useful information.

The three sets of data (red, white, and combined) will be each partitioned into a training set, a validation set, and a testing set (whose sizes will be proportional to the variables declared above). We will then create `DataLoader` objects for each of these partitions so that we can iterate over them in our training section.

Note here that we also declare batch sizes to determine how many pieces of information are trained on at a time.

In [12]:
# Set up samplers and DataLoaders for all three datasets (Combined, White, Red)
batch_size = 100
batch_size_white = 100
batch_size_red = 100

#COMBINED DATASET
sampler_input = (train_ix, val_ix)
train_sampler, val_sampler = setup_samplers(sampler_input)

datalist = (train_set, train_set, test_set)
samplers = (train_sampler, val_sampler, None)
train_loader, val_loader, test_loader = setup_data_loaders(datalist, samplers, batch_size)

#JUST WHITE
sampler_input_white = (train_white_ix, val_white_ix)
train_sampler_white, val_sampler_white = setup_samplers(sampler_input_white)

datalist_white = (train_set_white, train_set_white, test_set_white)
samplers_white = (train_sampler_white, val_sampler_white, None)
train_loader_white, val_loader_white, test_loader_white = setup_data_loaders(datalist_white, samplers_white, batch_size_white)

#JUST RED
sampler_input_red = (train_red_ix, val_red_ix)
train_sampler_red, val_sampler_red = setup_samplers(sampler_input_red)

datalist_red = (train_set_red, train_set_red, test_set_red)
samplers_red = (train_sampler_red, val_sampler_red, None)
train_loader_red, val_loader_red, test_loader_red = setup_data_loaders(datalist_red, samplers_red, batch_size_red)

## Training

With our data processed and ready to be used, now we will write some functions to train and test a network.

In [13]:
# Train the provided model with data gathered from train_loader, the given criterion, and the given optimizer.
# Additionally perform validation checks with data from val_loader.
# Inputs:
#     model: Neural network to train
#     train_loader: DataLoader which provides training data to the model 
#     val_loader: DataLoader which provides validation data to the model
#     criterion: Loss Function which trains the model
#     optimizer: Optimization algorithm to improve loss during training 
#     nepoch: Number of epochs to train for (Defaults to 100)
# Outputs:
#     Prints the Training and Validation loss at each epoch
def train_network(model, train_loader, val_loader, criterion, optimizer, nepoch=100,silent=True):
    try:
        cur_range = tqdm(range(nepoch))
        if silent:
            cur_range = range(nepoch)
        for epoch in cur_range:
            # Train over each epoch with a progress bar (tqdm)
            if not silent: 
                print('EPOCH %d'%epoch)
            
            total_loss = 0
            count = 0
            for inputs, labels in train_loader:
                # For each train input: 
                optimizer.zero_grad()
                
                # Forward propagate inputs
                outputs = model.forward(inputs)
                
                # Compute loss and learn
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                
                # Add current loss to batch average
                total_loss += loss.item()
                count += 1
            
            # Show Training loss for current epoch over the train_loader data
            if not silent:
                print('{:>12s} {:>7.5f}'.format('Train loss:', total_loss/count))
            
            with torch.no_grad():
                # Perform Validation checks on the newly trained model
                total_loss = 0
                count = 0
                for inputs, labels in val_loader:
                    # Forward propagate inputs
                    outputs = model.forward(inputs)
                    
                    # Compute Loss
                    loss = criterion(outputs, labels)
                    
                    # Add current loss to batch average
                    total_loss += loss.item()
                    count += 1
                    
                # Show Validation loss for current epoch over the val_loader data
                if not silent:
                    print('{:>12s} {:>7.5f}'.format('Val loss:', total_loss/count))
            print()
    except KeyboardInterrupt:
        print('Exiting from training early')
    return

In [14]:
# Test the provided model with data from test_loader
# Inputs:
#     model: Model to test using unseen data
#     test_loader: DataLoader to provide held-back testing data to trained model
#     mode: String used at front of each loss printout
# Outputs:
#     acc: Top-1 accuracy of the model on the testing data in percent
#     true: Array of actual labels
#     pred: Array of model-predicted labels
def test_network(model, test_loader, mode, silent=True):
    correct = 0
    total = 0
    true, pred = [], []
    with torch.no_grad():
        
        for inputs, labels  in test_loader:
            # Forward propagate testing data
            outputs = model.forward(inputs)
            
            # Get the prediction for inputs
            vals, predicted = torch.max(outputs, dim=1) 
            
            # Tally results 
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            true.append(labels)
            pred.append(predicted)
    
    # Compute and print final accuracy, then format outputs
    acc = (100 * correct / total)
    if not silent:
        print('%s accuracy: %0.3f' % (mode, acc))
    true = np.concatenate(true)
    pred = np.concatenate(pred)
    return acc, true, pred

In [15]:
# (Kind of) placeholder method that wraps training and testing
def train_and_test(model, train_loader, val_loader, test_loader, criterion, optimizer, \
                   nepoch=100, mode="Model", train_silent=True, test_silent=True):
    train_network(model, train_loader, val_loader, criterion, optimizer, train_silent)
    model.eval()
    acc, true, pred = test_network(model, test_loader, mode, test_silent)
    return acc, true, pred

## Model

With our training and testing functionality equipped, now we will actually decide how to build our model.

Note here that there will be a heavy focus on making the model flexible so that we can tune hyperparameters or test new input varieties.

In [16]:
class WineQualityModel(torch.nn.Module):
    # Constructor for a WineQualityModel
    # Inputs:
    #     layers: a tuple of layers that you want in the model
    #       note that the output shape must be of length 10
    def __init__(self, layers):
        super().__init__()

        # NOTE: this gives the construction tons of flexibility
        # but also leaves plenty of room for dimensionality errors
        self.layers = torch.nn.Sequential(*layers)
        
    def forward(self, x):
        return self.layers.forward(x.float())

In [19]:
model = WineQualityModel((
    torch.nn.Linear(11, 81),
    torch.nn.LeakyReLU(),
    torch.nn.Dropout(),
    torch.nn.Linear(81, 11)
))

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

train_and_test(model, train_loader, val_loader, test_loader, criterion, optimizer, "Combined Model")

model_white = WineQualityModel((
    torch.nn.Linear(11, 81),
    torch.nn.LeakyReLU(),
    torch.nn.Dropout(),
    torch.nn.Linear(81, 11)
))

optimizer_white = torch.optim.Adam(model_white.parameters(), lr=1e-3)
train_and_test(model_white, train_loader_white, val_loader_white, test_loader_white, criterion, optimizer_white, "White Model")

model_red = WineQualityModel((
    torch.nn.Linear(11, 81),
    torch.nn.LeakyReLU(),
    torch.nn.Dropout(),
    torch.nn.Linear(81, 11)
))

optimizer_red = torch.optim.Adam(model_red.parameters(), lr=1e-3)
train_and_test(model_red, train_loader_red, val_loader_red, test_loader_red, criterion, optimizer_red, "Red Model")



  0%|          | 0/100 [00:00<?, ?it/s]

EPOCH 0
 Train loss: 6.78688
   Val loss: 5.16468

EPOCH 1
 Train loss: 3.62577
   Val loss: 2.55897

EPOCH 2
 Train loss: 2.14245
   Val loss: 1.73878

EPOCH 3
 Train loss: 1.68527
   Val loss: 1.58852

EPOCH 4
 Train loss: 1.50094
   Val loss: 1.47331

EPOCH 5
 Train loss: 1.41662
   Val loss: 1.44346

EPOCH 6
 Train loss: 1.35345
   Val loss: 1.37156

EPOCH 7
 Train loss: 1.33272
   Val loss: 1.40017

EPOCH 8
 Train loss: 1.31496
   Val loss: 1.41594

EPOCH 9
 Train loss: 1.29974
   Val loss: 1.27166

EPOCH 10
 Train loss: 1.27952
   Val loss: 1.32786

EPOCH 11
 Train loss: 1.29297
   Val loss: 1.27876

EPOCH 12
 Train loss: 1.28166
   Val loss: 1.35561

EPOCH 13
 Train loss: 1.28191
   Val loss: 1.37350

EPOCH 14
 Train loss: 1.27386
   Val loss: 1.33346

EPOCH 15
 Train loss: 1.27898
   Val loss: 1.34314

EPOCH 16
 Train loss: 1.26867
   Val loss: 1.26711

EPOCH 17
 Train loss: 1.26171
   Val loss: 1.26615

EPOCH 18
 Train loss: 1.26755
   Val loss: 1.25451

EPOCH 19
 Train loss: 

  0%|          | 0/100 [00:00<?, ?it/s]

EPOCH 0
 Train loss: 9.58724
   Val loss: 7.05852

EPOCH 1
 Train loss: 5.52440
   Val loss: 4.11414

EPOCH 2
 Train loss: 3.40280
   Val loss: 2.65779

EPOCH 3
 Train loss: 2.26251
   Val loss: 1.88995

EPOCH 4
 Train loss: 1.66877
   Val loss: 1.49958

EPOCH 5
 Train loss: 1.43899
   Val loss: 1.42410

EPOCH 6
 Train loss: 1.38683
   Val loss: 1.45508

EPOCH 7
 Train loss: 1.34084
   Val loss: 1.41675

EPOCH 8
 Train loss: 1.32461
   Val loss: 1.34208

EPOCH 9
 Train loss: 1.31467
   Val loss: 1.37092

EPOCH 10
 Train loss: 1.28189
   Val loss: 1.31587

EPOCH 11
 Train loss: 1.28602
   Val loss: 1.32465

EPOCH 12
 Train loss: 1.29119
   Val loss: 1.35254

EPOCH 13
 Train loss: 1.27810
   Val loss: 1.27902

EPOCH 14
 Train loss: 1.27376
   Val loss: 1.34762

EPOCH 15
 Train loss: 1.27572
   Val loss: 1.26647

EPOCH 16
 Train loss: 1.25634
   Val loss: 1.33309

EPOCH 17
 Train loss: 1.26837
   Val loss: 1.33992

EPOCH 18
 Train loss: 1.25904
   Val loss: 1.29679

EPOCH 19
 Train loss: 

  0%|          | 0/100 [00:00<?, ?it/s]

EPOCH 0
 Train loss: 7.19860
   Val loss: 4.35149

EPOCH 1
 Train loss: 3.58519
   Val loss: 2.58308

EPOCH 2
 Train loss: 2.83806
   Val loss: 2.45503

EPOCH 3
 Train loss: 2.48463
   Val loss: 2.43819

EPOCH 4
 Train loss: 2.24376
   Val loss: 1.86303

EPOCH 5
 Train loss: 1.93680
   Val loss: 2.31657

EPOCH 6
 Train loss: 1.87584
   Val loss: 1.77567

EPOCH 7
 Train loss: 1.72627
   Val loss: 1.85507

EPOCH 8
 Train loss: 1.55620
   Val loss: 1.75324

EPOCH 9
 Train loss: 1.58548
   Val loss: 1.35582

EPOCH 10
 Train loss: 1.48428
   Val loss: 1.35905

EPOCH 11
 Train loss: 1.41701
   Val loss: 1.18005

EPOCH 12
 Train loss: 1.36052
   Val loss: 1.22191

EPOCH 13
 Train loss: 1.29172
   Val loss: 1.43618

EPOCH 14
 Train loss: 1.28235
   Val loss: 1.54188

EPOCH 15
 Train loss: 1.27234
   Val loss: 1.24457

EPOCH 16
 Train loss: 1.25984
   Val loss: 1.24183

EPOCH 17
 Train loss: 1.23643
   Val loss: 1.26077

EPOCH 18
 Train loss: 1.25079
   Val loss: 1.25868

EPOCH 19
 Train loss: 

## Hypertuning

Now that we have the ability to initialize and train a model, we're going to try hypertuning some of the parameters. Specifically we want to try tuning the number of layers and the hidden size of each of those layers. 

In [23]:
def create_and_report_model(layers):
    loaders = [(train_loader, val_loader, test_loader), \
               (train_loader_white, val_loader_white, test_loader_white), \
               (train_loader_red, val_loader_red, test_loader_red)]
    #models = []
    #accuracies = []
    true_lists = []
    pred_lists = []
    for i in range(3):
        criterion = torch.nn.CrossEntropyLoss()
        cur_loaders = loaders[i]
        cur_model = WineQualityModel(layers)
        optimizer = torch.optim.Adam(cur_model.parameters(), lr=1e-3)
        acc, true, pred = train_and_test(cur_model, *cur_loaders, criterion, optimizer, f"Model {i}", 200)
        true_lists.append(true)
        pred_lists.append(pred)
    return true_lists, pred_lists

In [24]:
"""
num_layers = 1
hidden_sizes = range()

for i in range(num_layers):
"""

'\nnum_layers = 1\nhidden_sizes = range()\n\nfor i in range(num_layers):\n'