# Predicting Wine Types with Logistic Regression

## Introduction

This notebook is a simple demonstration of how to use scikit-learn to build a Logistic Regression model for classification. It uses a dataset of 178 wines and their various attributes. There are three different classes of wine in the data and the goal is to predict the wine class based upon the attributes.

## The Data

The data has been taken from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) and can be found [here](https://archive.ics.uci.edu/ml/datasets/Wine). 

Information on the data can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names). The main thing we are going to focus on our the attribute names as the raw data doesn't have a header. The first column is the class, which is what we wish to predict, and the rest of the attributes we will use as features:
1. Alcohol
2. Malic acid
3. Ash
4. Alcalinity of ash  
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Color intensity
11. Hue
12. OD280/OD315 of diluted wines
13. Proline   

We import python libraries

In [1]:
import pandas as pd
import numpy as np

We read in the data we've saved, passing the column names

In [2]:
wine = pd.read_csv("data/wine.csv", names=["class", "alcohol", "malic_acid", "ash", "alcalinity_of_ash", "magnesium", "total_phenols",
                                          "flavanoids", "nonflavanoid_phenols", "proanthocyanins", "color_intensity", "hue", 
                                           "od280_od315_of_diluted_wines", "proline"])

Let's check out the first few rows of data

In [3]:
wine.head()

Unnamed: 0,class,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


## Train & Test Data

The purpose of splitting the data is to be able to assess the quality of a predictive model when it is used on unseen data. When training, you will try to build a model that fits to the data as closely as possible, to be able to most accurately make a prediction. However, without a test set you run the risk of overfitting - the model works very well for the data it has seen but not for new data.

The split ratio is often debated and in practice you might split your data into three sets: train, validation and test. You would use the training data to understand which classifier you wish to use; the validation set to test on whilst tweaking parameters; and the test set to get an understanding of how your final model would work in practice. Furthermore, there are techniques such as K-Fold cross validation that also help to reduce bias.

For the purpose of this demonstration, we will only be randomly splitting our data into test and train, with a 80/20 split.

We import the required library from scikit-learn, [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [4]:
from sklearn.model_selection import train_test_split

We wish for all features to be used for training, therefore we are taking all columns except "class"

In [5]:
X = wine.drop(["class"], axis=1)

The column "class" is our target variable, we set y as this column

In [6]:
y = wine["class"]

We use the *train_test_split* function to create the appropriate train and test data for our features ("X_train" and "X_test" respectively) and target data ("Y_train" and "Y_test"). We are specifying our test data to be 20% of the total data. We are also providing a seed to be able to reproduce this split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

We can check the number of examples we have in each of our train and test data sets using "shape"

In [8]:
X_train.shape

(142, 13)

In [9]:
X_test.shape

(36, 13)

## Standardisation

All features are numeric so we do not need to worry about converting categorical data with techniques such as one-hot encoding. However, we will demonstrate how to standardise our data. Standardisation rescales our attributes so they have a mean of 0 and standard deviation of 1. It assumes that the distribution is Gaussian (it works better if it is), alternatively normalisation can be used to rescale between the range of 0 and 1

We use scikit-learn's [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [10]:
from sklearn.preprocessing import StandardScaler

We create the scaler, leaving parameters as default

In [11]:
scaler = StandardScaler()

We fit the scaler passing the training data but also request it transforms the data and returns it to a variable named "train_scaled"

In [12]:
train_scaled = scaler.fit_transform(X_train)

We then transform our test data with the same fitted scaler

In [13]:
test_scaled = scaler.transform(X_test)

## Logistic Regression

Logistic regression predicts the probability of a binary outcome. A new observation is predicted to be within the class if its probability is above a set threshold. There are methods to use Logistic Regression for scenarios where there are multiple classes.

We are using scikit-learn's [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [14]:
from sklearn.linear_model import LogisticRegression

We create an Logistic Regression model

In [15]:
model = LogisticRegression()

We train it with our scaled training data and target values

In [16]:
model.fit(train_scaled, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Model Evaluation

We wish to understand how good our model is; one simple metric to use for evaluation is accuracy. This will return the percentage of correct predictions.

We import [scikit-learn's accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

In [17]:
from sklearn.metrics import accuracy_score

We calculate accuracy for our training data

In [18]:
accuracy_score(y_train, model.predict(train_scaled))

1.0

We have an accuracy of 100%! More importantly, we should check if we find good results on unseen data (to ensure we haven't overfit). We calculate the accuracy for our test data

In [19]:
accuracy_score(y_test, model.predict(test_scaled))

0.9722222222222222

We are still seeing high results of 97.2% accuracy.

## Logistic Regression Parameters

More information on logistic regression can be found in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

There are a number of parameters that can be tuned that should be explored when trying to improve Logistic Regression models. A common approach is to test many different parameters, building multiple models and testing their accuracy to find the best combination.

### Parameters
For Logistic Regression, the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) provides parameters that can be passed by the user; changing these are likely to have an impact on the performance of the model. 

Here is high-level information on the parameters, the documentation has more details:
- penalty : default: ‘l2’
    - Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties.

- dual : default: False
    - Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.

- tol : default: 1e-4
    - Tolerance for stopping criteria.

- C : default: 1.0
    - Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

- fit_intercept : default: True
    - Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

- intercept_scaling : default 1.
    - Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic_feature_weight.

- class_weight : default: None
    - Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

- random_state : None
    - The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when solver == ‘sag’ or ‘liblinear’.

- solver : default: ‘liblinear’
    - Algorithm to use in the optimization problem.

- max_iter : default: 100
    - Useful only for the newton-cg, sag and lbfgs solvers. Maximum number of iterations taken for the solvers to converge.

- multi_class : default: ‘ovr’
    - Multiclass option can be either ‘ovr’ or ‘multinomial’. If the option chosen is ‘ovr’, then a binary problem is fit for each label. Else the loss minimised is the multinomial loss fit across the entire probability distribution. Does not work for liblinear solver.

- verbose : default: 0
    - For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.

- warm_start : default: False
    - When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver.

- n_jobs : default: 1
    - Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. This parameter is ignored when the 'solver' is set to ‘liblinear’ regardless of whether ‘multi_class’ is specified or not. If given a value of -1, all cores are used.

### Grid Search

To search for the best hyper-parameters for your algorithm and data, grid search cross validation is commonly used. The [scikit-learn documentation](http://scikit-learn.org/stable/modules/grid_search.html) provides more thorough information on how to use this. 

#### Data Citation

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 