# Predicting Wine Types with Neural Networks

## Introduction

This notebook is a simple demonstration of how to use scikit-learn to build a Neural Network for classification. It uses a dataset of 178 wines and their various attributes. There are three different classes of wine in the data and the goal is to predict the wine class based upon the attributes.

## The Data

The data has been taken from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) and can be found [here](https://archive.ics.uci.edu/ml/datasets/Wine). 

Information on the data can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names). The main thing we are going to focus on our the attribute names as the raw data doesn't have a header. The first column is the class, which is what we wish to predict, and the rest of the attributes we will use as features:
1. Alcohol
2. Malic acid
3. Ash
4. Alcalinity of ash  
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Color intensity
11. Hue
12. OD280/OD315 of diluted wines
13. Proline   

We import python libraries

In [1]:
import pandas as pd
import numpy as np

We read in the data we've saved, passing the column names

In [2]:
wine = pd.read_csv("data/wine.csv", names=["class", "alcohol", "malic_acid", "ash", "alcalinity_of_ash", "magnesium", "total_phenols",
                                          "flavanoids", "nonflavanoid_phenols", "proanthocyanins", "color_intensity", "hue", 
                                           "od280_od315_of_diluted_wines", "proline"])

Let's check out the first few rows of data

In [3]:
wine.head()

Unnamed: 0,class,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


## Train & Test Data

The purpose of splitting the data is to be able to assess the quality of a predictive model when it is used on unseen data. When training, you will try to build a model that fits to the data as closely as possible, to be able to most accurately make a prediction. However, without a test set you run the risk of overfitting - the model works very well for the data it has seen but not for new data.

The split ratio is often debated and in practice you might split your data into three sets: train, validation and test. You would use the training data to understand which classifier you wish to use; the validation set to test on whilst tweaking parameters; and the test set to get an understanding of how your final model would work in practice. Furthermore, there are techniques such as K-Fold cross validation that also help to reduce bias.

For the purpose of this demonstration, we will only be randomly splitting our data into test and train, with a 80/20 split.

We import the required library from scikit-learn, [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [4]:
from sklearn.model_selection import train_test_split

We wish for all features to be used for training, therefore we are taking all columns except "class"

In [5]:
X = wine.drop(["class"], axis=1)

The column "class" is our target variable, we set y as this column

In [6]:
y = wine["class"]

We use the *train_test_split* function to create the appropriate train and test data for our features ("X_train" and "X_test" respectively) and target data ("Y_train" and "Y_test"). We are specifying our test data to be 20% of the total data. We are also providing a seed to be able to reproduce this split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

We can check the number of examples we have in each of our train and test data sets using "shape"

In [8]:
X_train.shape

(142, 13)

In [9]:
X_test.shape

(36, 13)

## Standardisation

All features are numeric so we do not need to worry about converting categorical data with techniques such as one-hot encoding. However, we will demonstrate how to standardise our data. Although not necessary with neural networks, it often makes training more efficient. Standardisation rescales our attributes so they have a mean of 0 and standard deviation of 1. It assumes that the distribution is Gaussian (it works better if it is), alternatively normalisation can be used to rescale between the range of 0 and 1

We use scikit-learn's [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [10]:
from sklearn.preprocessing import StandardScaler

We create the scaler, leaving parameters as default

In [11]:
scaler = StandardScaler()

We fit the scaler passing the training data but also request it transforms the data and returns it to a variable named "train_scaled"

In [12]:
train_scaled = scaler.fit_transform(X_train)

We then transform our test data with the same fitted scaler

In [13]:
test_scaled = scaler.transform(X_test)

## Neural Networks

Neural networks can learn complex patterns using layers of neurons which mathematically transform the data. The layers between the input and output are referred to as “hidden layers”. A neural network can learn relationships between the features that other algorithms cannot easily discover.

We are using scikit-learn's [MLP Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) which is a simple type of neural network called a multi-layer perceptron (MLP).

In [14]:
from sklearn.neural_network import MLPClassifier

We create an MLP model

In [15]:
model = MLPClassifier()

We train it with our scaled training data and target values

In [16]:
model.fit(train_scaled, y_train)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

## Model Evaluation

We wish to understand how good our model is; one simple metric to use for evaluation is accuracy. This will return the percentage of correct predictions.

We import [scikit-learn's accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

In [17]:
from sklearn.metrics import accuracy_score

We calculate accuracy for our training data

In [18]:
accuracy_score(y_train, model.predict(train_scaled))

1.0

We have an accuracy of 100%! More importantly, we should check if we find good results on unseen data (to ensure we haven't overfit). We calculate the accuracy for our test data

In [19]:
accuracy_score(y_test, model.predict(test_scaled))

0.9722222222222222

We are still seeing high results of 97.2% accuracy.

## Neural Network Parameters

More information on Neural Networks can be found in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/neural_networks_supervised.html)

There are a number of parameters that can be tuned that should be explored when trying to improve Neural Network models. A common approach is to test many different parameters, building multiple models and testing their accuracy to find the best combination.

### Parameters
For Multi-layer Perceptron (MLP) classifiers, the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) provides parameters that can be passed by the user; changing these are likely to have an impact on the performance of the model. 

Here is high-level information on the parameters, the documentation has more details:
- hidden_layer_sizes : default (100,)
    - The ith element represents the number of neurons in the ith hidden layer.

- activation : default ‘relu’
    - Activation function for the hidden layer.

- solver : default ‘adam’
    - The solver for weight optimization.

- alpha : default 0.0001
    - L2 penalty (regularization term) parameter.

- batch_size : default ‘auto’
    - Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples)

- learning_rate : default ‘constant’
    - Learning rate schedule for weight updates.

- learning_rate_init : default 0.001
    - The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.

- power_t : default 0.5
    - The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. Only used when solver=’sgd’.

- max_iter : default 200
    - Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For stochastic solvers (‘sgd’, ‘adam’), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.

- shuffle : default True
    - Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.

- random_state : default None
    - If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

- tol : default 1e-4
    - Tolerance for the optimization. When the loss or score is not improving by at least tol for two consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops.

- verbose : default False
    - Whether to print progress messages to stdout.

- warm_start : default False
    - When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

- momentum : default 0.9
    - Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’.

- nesterovs_momentum : default True
    - Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.

- early_stopping : default False
    - Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for two consecutive epochs. Only effective when solver=’sgd’ or ‘adam’

- validation_fraction : default 0.1
    - The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True

- beta_1 : default 0.9
    - Exponential decay rate for estimates of first moment vector in adam, should be in [0, 1). Only used when solver=’adam’

- beta_2 : default 0.999
    - Exponential decay rate for estimates of second moment vector in adam, should be in [0, 1). Only used when solver=’adam’

- epsilon : default 1e-8
    - Value for numerical stability in adam. Only used when solver=’adam’


### Parameter Search

To search for the best hyper-parameters for your algorithm and data, grid search cross validation is commonly used. Alternatively, there is a randomized search approach and this is often preferred for deep learning (working with neural networks with many layers). The [scikit-learn documentation](http://scikit-learn.org/stable/modules/grid_search.html) provides more thorough information on how to use these. 

#### Data Citation

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 