# Classification example
UCI Wine recognition dataset. Three-class classification.

More information about the data here:
 * https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine
 * https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset

In [None]:
# Import packages
import numpy as np
import sklearn.datasets
import sklearn.linear_model

# Make the results reproducible
np.random.seed(42)

data = sklearn.datasets.load_wine()

# Split into training, validation, and test data sets
n_train = int(data.data.shape[0] * 0.75 + 0.5)  # Train on 75 %
n_val = int(data.data.shape[0] * 0.15 + 0.5)  # Validate on 15 %
n_test = int(data.data.shape[0] * 0.10 + 0.5)  # Test on 10 %

X = data.data[:n_train, :]
y = data.target[:n_train]
X_val = data.data[n_train:n_train + n_val, :]
y_val = data.target[n_train:n_train + n_val]
# Note! Do not use (at all!) the test data until the very end!
X_test = data.data[n_train + n_val:, :]
y_test = data.target[n_train + n_val:]

print(f"Training set size X  : {X.shape}")
print(f"Training set size y  : {y.shape}")
print(f"Validation set size X: {X_val.shape}")
print(f"Validation set size y: {y_val.shape}")
print(f"Test set size X      : {X_test.shape}")
print(f"Test set size y      : {y_test.shape}")
print(f"Output classes       : {set(y)}")
print(f"Feature names        : {data.feature_names}")

In [None]:
# Task1: Preprocess the data
#  - Try without preprocessing, try with different kinds.
#  - Evaluate and compare models on the validation data.
#
# Note that we fit the preprocessing function to the training data!
# Then we apply the learned transformation to the validation and test data sets.
standard_scaler = sklearn.preprocessing.StandardScaler()
standard_scaler.fit(X)
X_ = standard_scaler.transform(X)
X_val_ = standard_scaler.transform(X_val)
X_test_ = standard_scaler.transform(X_test)

In [None]:
# Fit baseline model
model_baseline = sklearn.linear_model.LogisticRegression(
    penalty="none",
    tol=0.0001,
    fit_intercept=True,
    solver="lbfgs",
    max_iter=100,
    multi_class="multinomial")
_ = model_baseline.fit(X_, y)

In [None]:
# Evaluate baseline model
yhat = model_baseline.predict(X_)
yhat_val = model_baseline.predict(X_val_)
acc = sklearn.metrics.accuracy_score(y, yhat)
acc_val = sklearn.metrics.accuracy_score(y_val, yhat_val)
print(f"Training data accuracy  : {acc:.2f}")
print(f"Validation data accuracy: {acc_val:.2f}")

The model does not make any errors on the training data, and a larger error on the validation data. What does this mean? Can we do anything about it?

In [None]:
# Task 2: Find a better model
#  - Try different classification methods
#  - Evaluate them on the validation data
#  - Beat the baseline model and select the best one you can find
#  - See what the penalty parameter does in the LogisticRegression class:
#    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
#  - You can look here for other potential models to use here:
#    https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

model = "... add your own classification model code here!"

_ = model.fit(X_, y)

In [None]:
# Evaluate better model
yhat = model.predict(X_)
yhat_val = model.predict(X_val_)
acc = sklearn.metrics.accuracy_score(y, yhat)
acc_val = sklearn.metrics.accuracy_score(y_val, yhat_val)
print(f"Training data accuracy  : {acc:.2f}")
print(f"Validation data accuracy: {acc_val:.2f}")

In [None]:
# Task 3: Determine the importance of the input variables
# ... your code here

In [None]:
# Evaluate the final model on the test data.
# This is only ever done once, and as the last thing we do.
# Training another model after this, based on the performance on the test data
# leads to biased results.
yhat = model.predict(X_)
yhat_val = model.predict(X_val_)
yhat_test = model.predict(X_test_)
acc = sklearn.metrics.accuracy_score(y, yhat)
acc_val = sklearn.metrics.accuracy_score(y_val, yhat_val)
acc_test = sklearn.metrics.accuracy_score(y_test, yhat_test)
print(f"Training data accuracy  : {acc:.2f}")
print(f"Validation data accuracy: {acc_val:.2f}")
print(f"Test data accuracy      : {acc_test:.2f}")