In this notebook I will cover a basic machine learning pipeline using the sklearn library. For today we will use a dataset of wine https://archive-beta.ics.uci.edu/ml/datasets/Wine. The goal is to use chemical composition of wine determine their origin.

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [5]:
def get_wine_data():
    ! wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
    ! wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names

In [6]:
get_wine_data()

--2021-09-09 16:19:07--  https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10782 (11K) [application/x-httpd-php]
Saving to: ‘wine.data.1’


2021-09-09 16:19:07 (37.7 MB/s) - ‘wine.data.1’ saved [10782/10782]

--2021-09-09 16:19:07--  https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3036 (3.0K) [application/x-httpd-php]
Saving to: ‘wine.names’


2021-09-09 16:19:07 (39.1 MB/s) - ‘wine.names’ saved [3036/3036]



In [9]:
! tail wine.data

3,13.58,2.58,2.69,24.5,105,1.55,.84,.39,1.54,8.66,.74,1.8,750
3,13.4,4.6,2.86,25,112,1.98,.96,.27,1.11,8.5,.67,1.92,630
3,12.2,3.03,2.32,19,96,1.25,.49,.4,.73,5.5,.66,1.83,510
3,12.77,2.39,2.28,19.5,86,1.39,.51,.48,.64,9.899999,.57,1.63,470
3,14.16,2.51,2.48,20,91,1.68,.7,.44,1.24,9.7,.62,1.71,660
3,13.71,5.65,2.45,20.5,95,1.68,.61,.52,1.06,7.7,.64,1.74,740
3,13.4,3.91,2.48,23,102,1.8,.75,.43,1.41,7.3,.7,1.56,750
3,13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835
3,13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840
3,14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560


In [13]:
df = pd.read_csv("wine.data", header=None) # header=None because the csv has no header
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [15]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [21]:
# this first column is the label
df.iloc[:,0].value_counts()

2    71
1    59
3    48
Name: 0, dtype: int64

## Make a train / validation/ test split
The train set is used to train the data. The validation set is used to find model hyperparameters (more on this later). The test set is used to estimate the actual error/performance. We do this because the model is going to have a smaller error on the data that was used for training than on "never before seen" data. The goal of machine learning is to **generalize** - that is, to perform well on "never before seen" data.

Since, I am not planning to find any hyperparameters, I will split in train / test. 

In [25]:
df.shape

(178, 14)

In [23]:
Y = df.iloc[:,0].values
X = df.iloc[:,1:].values

In [24]:
X.shape

(178, 13)

In [28]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [29]:
x_train.shape, x_test.shape

((142, 13), (36, 13))

In [34]:
pd.Series(y_train).value_counts()/142

2    0.429577
1    0.338028
3    0.232394
dtype: float64

In [36]:
pd.Series(y_test).value_counts()/36

3    0.416667
1    0.305556
2    0.277778
dtype: float64

In [40]:
# we can split the daya in a stratified fashion, using this as the class labels 
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, stratify=Y)

In [41]:
pd.Series(y_train).value_counts()/142

2    0.401408
1    0.330986
3    0.267606
dtype: float64

In [42]:
pd.Series(y_test).value_counts()/36

2    0.388889
1    0.333333
3    0.277778
dtype: float64

## Standarize your data
We will be fitting a linear model. In this case is always a good idea to standarize your data. If we want to "regularize" we have to standarize. The `StandardScaler()` will compute mean and standard deviation from the training set and apply it to the training and test sets.

If we don't scale the data sometimes the linear model doesn't converge fast enough.

In [47]:
scaler = StandardScaler() # creates the scaler
scaler.fit(x_train) # computes mean and standard deviation from training data

StandardScaler()

In [48]:
scaler.mean_

array([1.30245775e+01, 2.33218310e+00, 2.36239437e+00, 1.93366197e+01,
       9.94507042e+01, 2.27225352e+00, 2.00978873e+00, 3.65704225e-01,
       1.55809859e+00, 5.08633802e+00, 9.63563380e-01, 2.58612676e+00,
       7.45640845e+02])

In [50]:
# transforms both the train and test sets
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)

In [53]:
X_train.std(0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [54]:
X_test.std(0) # it is not perfectly 1 because we used the std computed from training data

array([0.9511094 , 1.06975048, 0.75575318, 0.92528777, 1.33251968,
       0.91436947, 0.97037093, 1.05405061, 1.02187247, 0.93144176,
       0.9437501 , 1.02538909, 0.9986537 ])

## Fit a model

In [55]:
clf = LogisticRegression(random_state=0) # this creates a model
clf.fit(X_train, y_train)  # this fits the model to the training data

LogisticRegression(random_state=0)

## Predict on test data

## K-fold cross-validation
K-fold cross-validation is usedful method for estimating performace. It is particulary usedful for small dataset.