# AI Lab 08

Data Transformation and MLPs.


# **Problem Statement:**
Classification of Wines Using Machine Learning
Wine classification is a crucial task in the food and beverage industry, enabling the differentiation of wines based on their chemical composition. The Wine dataset from the UCI Machine Learning Repository

    https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

provides a structured collection of 178 wine samples, each described by 13 numerical features, including alcohol content, magnesium levels, and color intensity. These samples belong to three distinct classes, representing different types of wine grown in the same region of Italy.

The goal of this study is to develop a machine learning model, particularly artificial neural network model that can accurately classify wines based on their chemical attributes.

By leveraging classification algorithms, such models can assist in quality control, wine authentication, and recommendation systems. The dataset will first be explored and analyzed, followed by preprocessing steps such as normalization to ensure that all features contribute equally. The performance of different classifiers will be evaluated to determine the most effective model for accurate wine classification.

## Loading the dataset

The first cell below loads the ``wine dataset``.

https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The second cell prints out the description of this dataset. In short, it is for **classification of wines** based on features like alcohol, magnesium and colour.

In [1]:
from sklearn.datasets import load_wine

# Loading the dataset
dataset = load_wine()

# Get the X (feature matrix) and y (class label vector) from the data
X, y = dataset.data, dataset.target

# **Explanation:**

This code is using Scikit-learn's load_wine function to load the Wine dataset, which is a well-known dataset used for classification tasks. Below is a step-by-step explanation:

**Importing the Dataset**

    from sklearn.datasets import load_wine

a) This imports the load_wine function from the sklearn.datasets module.

b) load_wine() is a function that loads the Wine dataset, which contains chemical properties of different wines and their corresponding labels.

**Loading the Dataset**

    dataset = load_wine()

load_wine() returns a dictionary-like object (a Bunch object) containing the dataset.

This dataset includes:

a) data: The feature matrix (chemical composition of wine samples).

b) target: The class labels (wine types).

c) feature_names: The names of the features.

d) target_names: The names of the classes.

3. Extracting Features and Labels

X, y = dataset.data, dataset.target

    X = dataset.data

a) X (feature matrix) contains the numerical values of the chemical composition of wines.

b) Each row represents a different wine sample.

c) Each column represents a different feature (like alcohol content, malic acid, etc.).

    y = dataset.target

a) y is the target vector (class labels) that indicates the type of wine.

b) The dataset has three different classes (0, 1, 2), which correspond to different types of wine.


In [6]:
# Print out the dataset description
print(dataset.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
    - Alcohol
    - Malic acid
    - Ash
    - Alcalinity of ash
    - Magnesium
    - Total phenols
    - Flavanoids
    - Nonflavanoid phenols
    - Proanthocyanins
    - Color intensity
    - Hue
    - OD280/OD315 of diluted wines
    - Proline
    - class:
        - class_0
        - class_1
        - class_2

:Summary Statistics:

                                Min   Max   Mean     SD
Alcohol:                      11.0  14.8    13.0   0.8
Malic Acid:                   0.74  5.80    2.34  1.12
Ash:                          1.36  3.23    2.36  0.27
Alcalinity of Ash:            10.6  30.0    19.5   3.3
Magnesium:                    70.0 162.0    99.7  14.3
Total Phenols:                0.98  3.88    2.29  0.63
Flavanoids:                   0.34  5.08    2.03  1.00

# **Explanantion:**
The command print(dataset.DESCR) prints a detailed description of the Wine dataset. This description provides an overview of the dataset, including its origin, purpose, structure, and features. The dataset is sourced from the UCI Machine Learning Repository and is used for classification tasks. It consists of 178 wine samples, each described by 13 numerical features representing chemical properties such as alcohol content, malic acid, ash, alkalinity, magnesium, and phenols. The dataset is labeled into three classes (0, 1, and 2), corresponding to three different types of wine grown in the same region in Italy. The description also includes details about the feature names, class distribution, and references to original research papers where the dataset was used. This information helps researchers and data scientists understand the dataset before applying machine learning models.

## Scaling datasets

**TODO**: do the normalisation and scaling as per the lab notes to create 2 additional versions of the dataset.

In [4]:
# Normalising feature matrix
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
normalizer.fit(X)
X_normalised = normalizer.transform(X)

# **Explanation**

This code snippet is used to normalize the feature matrix X using L2 normalization, which scales each sample (row) to have a unit norm (i.e., the sum of the squared values of each row equals 1). The goal of normalization is to ensure that all features contribute equally to a machine learning model by removing scale differences.

**Importing the Normalizer**

    from sklearn.preprocessing import Normalizer

This imports the Normalizer class from Scikit-learn’s preprocessing module.

The Normalizer applies row-wise normalization, meaning each wine sample (row) is transformed individually.

Creating a Normalizer Instance

    normalizer = Normalizer()
This creates an instance of the Normalizer class.

By default, Normalizer() applies L2 normalization, which scales each row so that the sum of squared values equals 1.

Fitting the Normalizer to the Data

    normalizer.fit(X)

The fit(X) step computes necessary statistics for normalization, but in the case of Normalizer, this step is not strictly necessary because it does not learn parameters like mean and variance (as in standardization).

However, it is included as a convention.

Transforming the Feature Matrix

    X_normalised = normalizer.transform(X)

The transform(X) method applies L2 normalization to each row.

The transformed feature matrix X_normalised contains scaled values where the sum of squared feature values in each row equals 1.

This transformation ensures that features with larger numerical ranges do not dominate over others.

Key Takeaways:

Normalization is different from standardization (which centers data to have zero mean and unit variance).

Normalizer is useful in models where the magnitude of features is important, such as in K-Nearest Neighbors (KNN) or Neural Networks.

It ensures that all samples (rows) contribute equally to the learning process by rescaling them to have the same norm.


In [5]:
# TODO: Scaling feature matrix
from sklearn.preprocessing import StandardScaler



## MLP validation

The 1st cell below does the following:

 * Splits the dataset for training and testing (using the original feature values ``X``)
 * Creates an 3-layer ``MLPClassifier`` with 10 neurons in the hidden layer, to be trained for 100 epocs (iterations)
 * Trains the MLP and tests it
 * Calculates and prints out a confusion matrix and accuracy

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate
from sklearn import metrics

def print_results(scores):
    print("Accuracy:          %0.2f (+/- %0.2f)" % (scores['test_score'].mean(), scores['test_score'].std() * 2))
    print("Training time (s): %0.2f (+/- %0.2f)" % (scores['fit_time'].mean(), scores['fit_time'].std() * 2))
    print("Testing time (s):  %0.2f (+/- %0.2f)" % (scores['score_time'].mean(), scores['score_time'].std() * 2))

In [None]:
# Instantiating MLP
model = MLPClassifier(hidden_layer_sizes=(10), max_iter=10)

# Validating MLP model
scores = cross_validate(model, X, y, cv=5)

# Printing performance results
print_results(scores)

You should have seen warnings about the MLP not having converged above, and a rather sub-optimal performance!

**TODO**: see what the performance is like by just increasing the number of training iterations (epochs).

**QUESTION**: after increasing ``max_iter`` to the point the convergance warning disappears, what's the performance like? Is it good enough?

## Normalised vs Scaled feature values

If you tweak the number of neurons in the hidden layer and the maximum number of iterations (and other hyper-parameters), you will probably find that the performance remains quite poor.

So, let us now move on to comparing the performance when using the normalised and scaled feature matrices instead.

### Normalised feature matrix

### Scaled feature matrix

## Discussion and Conclusions

Above, you should be able to make a few key observations regarding:

* The performance of MLPs in general on this dataset
* How the performance is affected by
  - Hyper-parameters like the number of neurons and number of epochs
  - Data processing: original feature values vs normalised vs scaled

What seems to be best?

What seems to be worst?

What seems to make the biggest difference to the performance?

**PS**: Feel free to play around with other hyper-parameters as well, which you can see in the API reference documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html