## Using sklearn for basic data transformation, cross-validation

Many common machine learning operations and procedures are already encoded in various `sklearn` libraries.  Here, we see one way of handling a couple basic tasks—data transformation and cross-validation testing.

In [1]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
np.set_printoptions(suppress=True, precision=2)
plt.style.use('seaborn') # pretty matplotlib plots
sns.set(font_scale=2)

### A built-in data set

There are a number of existing data-sets in `sklearn`, many drawn from real-world data-sources. Here, we use the Wisconsin breast cancer set, which allows high-accuracy classification with pretty simple models:  

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

Here, the data is loaded into a dataframe; in basic form, it consists of 569 data-points, each characterized by 30 real-valued features in a number of different units. For more information about the data, see the UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

In [15]:
dataset = load_breast_cancer(as_frame = True)
frame = dataset.frame
X = frame.iloc[:,:-1]
y = frame.iloc[:,-1]

X

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


### A basic perceptron model: one test

A simple perceptron model will often do an OK job on this data.  Performance can vary quite a bit, however, depending upon the exact test/train split we get, which by default is randomized across runs.  Performance can also be hampered somewhat by the fact that the original data is not scaled, and displays different orders of magnitude.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

In [29]:
print("-----------------\nClassify with base data, 1 split\n-----------------")

# Start by converting data to Numpy arrays for convenience
X = np.array(X)
y = np.array(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

model = Perceptron()
model.fit(X_train, y_train)
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

acc_train = accuracy_score(pred_train, y_train)
acc_test = accuracy_score(pred_test, y_test)
print("Train accuracy: ", acc_train)
print("Test accuracy: ", acc_test)

-----------------
Classify with base data, 1 split
-----------------
Train accuracy:  0.9090909090909091
Test accuracy:  0.9254385964912281


### Scaling data features

We can use the exact same test/train split, but scale all our features to the $[0,1]$ range, based upon the training set (i.e., each is scaled according to the maximum/minimum values in that set).  This simulates what we would do in practice, since our incoming test data would not be known in advance.

Scaling to uniform range tends to give significantly better performance, since coefficient-weights on large-magnitude features are less likely to exert undue influence in the solution process.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [31]:
print("-----------------\nClassify with scaled data, 1 split\n-----------------")

scaler = MinMaxScaler()
scaler.fit(X_train)
X_scaled_train = scaler.transform(X_train)
X_scaled_test = scaler.transform(X_test)

model.fit(X_scaled_train, y_train)
pred_train = model.predict(X_scaled_train)
pred_test = model.predict(X_scaled_test)

acc_train = accuracy_score(pred_train, y_train)
acc_test = accuracy_score(pred_test, y_test)
print("Train accuracy: ", acc_train)
print("Test accuracy: ", acc_test)

-----------------
Classify with scaled data, 1 split
-----------------
Train accuracy:  0.9824046920821115
Test accuracy:  0.9692982456140351


### Cross-validation testing

Rather than a single randomized test/train split, we can automate the process somewhat by using $k$-fold cross validation techniques.  Like most things, there are a number of ways of handling this; this is one that is pretty basic, using a `KFold` object to generate splits of our data automatically.  Here, we do this for our basic, non-scaled data, using 5 folds.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

In [33]:
print("-----------------\nClassify with base data, 5 folds\n-----------------")

k = 5
kfold = KFold(n_splits=k)
train_scores = []
test_scores = []

for train_idx, test_idx in kfold.split(X):
    X_train, X_test = X[train_idx,:], X[test_idx,:]
    y_train, y_test = y[train_idx], y[test_idx]
    
    model.fit(X_train, y_train)
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    acc_train = accuracy_score(pred_train, y_train)
    acc_test = accuracy_score(pred_test, y_test)
    print("Train accuracy: ", acc_train)
    print("Test accuracy: ", acc_test)
    
    train_scores.append(acc_train)
    test_scores.append(acc_test)
    
print("\nAverage train accuracy: ", np.average(train_scores))
print("Average test accuracy: ", np.average(test_scores))

-----------------
Classify with base data, 5 folds
-----------------
Train accuracy:  0.7714285714285715
Test accuracy:  0.8771929824561403
Train accuracy:  0.8945054945054945
Test accuracy:  0.8859649122807017
Train accuracy:  0.865934065934066
Test accuracy:  0.8771929824561403
Train accuracy:  0.9032967032967033
Test accuracy:  0.9298245614035088
Train accuracy:  0.9166666666666666
Test accuracy:  0.911504424778761

Average train accuracy:  0.8703663003663002
Average test accuracy:  0.8963359726750506


### Combining data transformation and cross-validation

We can also do our $k$-fold cross-validation of data after scaling each feature to $[0,1]$.  This gives us the best accuracy and most robust expected performance.  We scale the entire data-set first, and then do the cross-validation by splitting it up.
    
**NB**: the `KFold.split()` function can handle data in either pandas data-frame or 
    basic array-based format, and several more, like basic Python lists, so we could have 
    left it in pandas form, only changing a bit how we would index elements.  
    In general, most of sklearn is pretty good at handling all sorts of basic linear data.  
    See the entry for 'array-like' at:
    
https://scikit-learn.org/stable/glossary.html

In [37]:
print("-----------------\nClassify with scaled data, 5 folds\n-----------------")

# If we like, we can combine the fit() and transform() calls 
# into a single fit_transform() call.
X_scaled = MinMaxScaler().fit_transform(X)

train_scores = []
test_scores = []

for train_idx, test_idx in kfold.split(X_scaled):
    X_train, X_test = X_scaled[train_idx,:], X_scaled[test_idx,:]
    y_train, y_test = y[train_idx], y[test_idx]
    
    model.fit(X_train, y_train)
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    acc_train = accuracy_score(pred_train, y_train)
    acc_test = accuracy_score(pred_test, y_test)
    print("Train accuracy: ", acc_train)
    print("Test accuracy: ", acc_test)
    
    train_scores.append(acc_train)
    test_scores.append(acc_test)
    
print("\nAverage train accuracy: ", np.average(train_scores))
print("Average test accuracy: ", np.average(test_scores))

-----------------
Classify with scaled data, 5 folds
-----------------
Train accuracy:  0.9846153846153847
Test accuracy:  0.956140350877193
Train accuracy:  0.9406593406593406
Test accuracy:  0.9210526315789473
Train accuracy:  0.978021978021978
Test accuracy:  0.9736842105263158
Train accuracy:  0.978021978021978
Test accuracy:  0.9824561403508771
Train accuracy:  0.9649122807017544
Test accuracy:  0.9646017699115044

Average train accuracy:  0.9692461924040872
Average test accuracy:  0.9595870206489675
