Name | Matr.Nr. | Due Date
:--- | ---: | ---:
Firstname Lastname | 01234567 | 16.03.2023, 08:00 am

<h1 style="color:rgb(0,120,170)">Hands-on AI II</h1>
<h2 style="color:rgb(0,120,170)">Unit 1 &ndash; Recap Hands-on AI I (Assignment)</h2>

<b>Authors:</b> B. Schäfl, S. Lehner, J. Brandstetter, A. Schörgenhumer<br>
<b>Date:</b> 07-03-2023

This file is part of the "Hands-on AI II" lecture material. The following copyright statement applies to all code within this file.

<b>Copyright statement:</b><br>
This material, no matter whether in printed or electronic form, may be used for personal and non-commercial educational use only. Any reproduction of this material, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

<h3 style="color:rgb(0,120,170)">How to use this notebook</h3>

This notebook is designed to run from start to finish. There are different tasks (displayed in <span style="color:rgb(248,138,36)">orange boxes</span>) which require your contribution (in form of code, plain text, ...). Most/All of the supplied functions are imported from the file <code>u1_utils.py</code> which can be seen and treated as a black box. However, for further understanding, you can look at the implementations of the helper functions. In order to run this notebook, the packages which are imported at the beginning of <code>u1_utils.py</code> need to be installed.

<div class="alert alert-warning">
    <b>Important:</b> Set the random seed with <code>u1.set_seed(17)</code> to enable reproducible results in all tasks that incorporate randomness (e.g., t-SNE, splitting data intro train and test sets, initializing weights of a neural network, running the model optimization with random batches, etc.). You must use <code>17</code> as seed.
</div>

In [None]:
# Import pre-defined utilities specific to this notebook.
import u1_utils as u1

# Import additional utilities needed in this notebook.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import torch

from pathlib import Path
from PIL import Image
from scipy import signal
from sklearn.neighbors import KNeighborsClassifier

# Set default plotting style.
sns.set()

# Setup Jupyter notebook (warning: this may affect all Jupyter notebooks running on the same Jupyter server).
u1.setup_jupyter()

# Check minimum versions.
u1.check_module_versions()

<h2>1. Tabular data</h2>

<p>In this exercise you'll be working with another famous data set, the <i>breast cancer</i> data set. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image [<a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29">1</a>]. Publication:

<center><cite>W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.</cite></center>

<div class="alert alert-warning">
    <b>Exercise 1.1. [3 Points]</b>
    <ul>
        <li>Load the <i>breast cancer</i> data set using the appropriate function as supplied by us.</li>
        <li>Split the data set into the feature vector matrix and the label vector.</li>
        <li>Visualize the data set in tabular form.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 1.2. [3 Points]</b>
    <ul>
        <li>How many samples does the data set contain?</li>
        <li>How many features does the data set consist of (not counting the class label column <i>class</i>)?</li>
        <li>How many different classes are there?</li>
    </ul>
</div>

your answer goes here

<div class="alert alert-warning">
    <b>Exercise 1.3. [3 Points]</b>
    <ul>
        <li>Compute a pairplot of the data set with respect to all features that contain the <i>mean</i> value.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 1.4. [2 Points]</b>
    <ul>
        <li>Name one feature which might indicate linear separability of the classes.</li>
    </ul>
</div>

your answer goes here

<div class="alert alert-warning">
    <b>Exercise 1.5. [3 Points]</b>
    <ul>
        <li>Reduce the dimensionality of the data set using <i>PCA</i> with 2 components and visualize the downprojection.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 1.6. [3 Points]</b>
    <ul>
        <li>Apply <i>$k$-means</i> on the original data set and plot the resulting clusters (the plotting must be done using the PCA-downprojected data).</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 1.7. [3 Points]</b>
    <ul>
        <li>Apply <i>$k$-means</i> on the PCA-downprojected data set and plot the resulting clusters.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 1.8. [3 Points]</b>
    <ul>
        <li>Reduce the dimensionality of the data set using <i>t-SNE</i> with 2 components and visualize the downprojection.</li>
        <li>Choose a fitting perplexity.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 1.9. [3 Points]</b>
    <ul>
        <li>Apply <i>$k$-means</i> on the original data set and plot the resulting clusters (the plotting must be done using the t-SNE-downprojected data).</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 1.10. [3 Points]</b>
    <ul>
        <li>Apply <i>$k$-means</i> on the t-SNE-downprojected data set and plot the resulting clusters.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 1.11. [2 Points]</b>
    <ul>
        <li>Compare and interpret the results.</li>
    </ul>
</div>

your answer goes here

<h2>2. Sequence data</h2>
<p>In this exercise you'll be working with <i>electricity demand</i> data as collected from the <i>Australian New South Wales Electricity Market</i>. It was first published/described by:</p>

<p><center><cite>M. Harries. Splice-2 comparative evaluation: Electricity pricing. Technical report, The University of South Wales, 1999. </cite></center></p>

Currently, it is maintained by the <a href="https://www.openml.org/d/151">OpenML</a> project:

<center><cite>Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: networked science in machine learning. SIGKDD Explorations 15(2), pp 49-60, 2013.</cite></center></p>
<center><cite>Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, Frank Hutter. OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 [cs.LG], 2019</cite></center></p>

<div class="alert alert-warning">
    <b>Exercise 2.1. [2 Points]</b>
    <ul>
        <li>Load the <i>electricty</i> data set using the appropriate function as supplied by us.</li>
        <li>Visualize the electricity data set in tabular form.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 2.2. [2 Points]</b>
    <ul>
        <li>How many samples does the data set contain?</li>
        <li>How many features does the data set consist of (not counting the class label column <i>demand</i>)?</li>
    </ul>
</div>

your answer goes here

<div class="alert alert-warning">
    <b>Exercise 2.3. [5 Points]</b>
    <ul>
        <li>Visualize the electricity data set using <i>lineplots</i> with <i>period</i> as the x-axis, once <i>vicdemand</i> and once <i>vicprice</i> as the y-axis, colored by the feature <i>day</i>.</li>
        <li>Do you observe any correlations between both plots?</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 2.4. [2 Points]</b>
    <ul>
        <li>Do you observe any correlations between both plots?</li>
    </ul>
</div>

your answer goes here

<div class="alert alert-warning">
    <b>Exercise 2.5. [5 Points]</b>
    <ul>
        <li>Compute the average electricity demand <i>per day</i> for <i>NSW</i> and <i>VIC</i>, as well as the average electricity transfer. Hint: have a look at the <a href="https://pandas.pydata.org/docs/user_guide/groupby.html#splitting-an-object-into-groups">pandas documentation</a> to group by <i>day</i>.</li>
        <li>Visualize the average electricity demand for <i>NSW</i> and <i>VIC</i>, as well as the average electricity transfer (the <i>day</i> should be on the x-axis).</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 2.6. [2 Points]</b>
    <ul>
        <li>Does the above plot make sense?</li>
    </ul>
</div>

your answer goes here

<h2>3. Image data</h2>

<p>In this exercice you'll be working with a data set composed of various <i>images</i> of fashion items (e.g. shoes or shirts). The data set distinguishes <i>ten</i> different classes, one for each type of fashion item. For curious minds, more information regarding this data set can be found at:

<center><cite>Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747</cite></center>

<div class="alert alert-warning">
    <b>Exercise 3.1. [2 Points]</b>
    <ul>
        <li>Load the <i>Fashion-MNIST</i> data set using the appropriate function as supplied by us.</li>
        <li>Visualize the Fashion-MNIST data set in tabular form.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 3.2. [10 Points]</b>
    <ul>
        <li>Define the following two $3 \times 3$ filters (shown in the formulae below) and apply them on 12 random images $A$ from the above data set (with $*$ as the convolution and $\sigma{}$ as the sigmoid operation) to produce the following 4 outputs $G_x, G'_x, G_y, G'_y$:</li>
    </ul>
    <p>
        \begin{equation}G_x = \left(
            \begin{array}{rrr}
                -2 & 0 & 2 \\
                -2 & 0 & 2 \\
                -2 & 0 & 2
            \end{array}\right) * A
            \qquad
            G'_x = \sigma (G_x)
        \end{equation}
    </p>
    <p>
        \begin{equation}G_y = \left(
            \begin{array}{rrr}
                -2 & -2 & -2 \\
                 0 &  0 &  0 \\
                 2 &  2 &  2
            \end{array}\right) * A
            \qquad
            G'_y = \sigma (G_y)
        \end{equation}
    </p>
    <ul>
        <li>Hint: Make sure to exclude the class label column <i>item_type</i> before processing your data.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 3.3. [5 Points]</b>
    <ul>
        <li>Using the data of the 12 samples from above, create a plot with 5 rows (or 5 columns, choose what you like), where</li>
        <ul>
            <li>(1) shows the original samples</li>
            <li>(2) shows the samples after the convolution using the first filter, i.e., $G_x$</li>
            <li>(3) shows the samples after the convolution using the first filter and after the application of sigmoid, i.e., $G'_x$</li>
            <li>(4) shows the samples after the convolution using the second filter, i.e., $G_y$</li>
            <li>(5) shows the samples after the convolution using the second filter and after the application of sigmoid, i.e., $G'_y$</li>
        </ul>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 3.5. [8 Points]</b>
    <ul>
        <li>Implement a class <code>FNN</code> that derives from <code>torch.nn.Module</code> with the following architecture:</li>
    </ul>
    <table style="text-align:center;vertical-align:middle">
        <th>Position</th>
        <th>Element</th>
        <th>Comment</th>
        <tr>
            <td>0</td>
            <td>input</td>
            <td>input size = $28\times{}28$ (flattened)</td>
        </tr>
        <tr>
            <td>1</td>
            <td>fully connected</td>
            <td>$1000$ output features</td>
        </tr>
        <tr>
            <td>2</td>
            <td>ReLU</td>
            <td>-</td>
        </tr>
        <tr>
            <td>3</td>
            <td>fully connected</td>
            <td>$1000$ output features</td>
        </tr>
        <tr>
            <td>4</td>
            <td>ReLU</td>
            <td>-</td>
        </tr>
        <tr>
            <td>5</td>
            <td>fully connected</td>
            <td>$10$ output features</td>
        </tr>
    </table>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 3.5. [8 Points]</b>
    <ul>
        <li>Implement a class <code>CNN</code> that derives from <code>torch.nn.Module</code> with the following architecture:</li>
    </ul>
    <table style="text-align:center;vertical-align:middle">
        <th>Position</th>
        <th>Element</th>
        <th>Comment</th>
        <tr>
            <td>0</td>
            <td>input</td>
            <td>input size = $28\times{}28$</td>
        </tr>
        <tr>
            <td>1</td>
            <td>2D convolution</td>
            <td>$64$ output channels and a kernel size of $3\times{}3$</td>
        </tr>
        <tr>
            <td>2</td>
            <td>ReLU</td>
            <td>-</td>
        </tr>
        <tr>
            <td>3</td>
            <td>max pooling</td>
            <td>kernel size of $2\times{}2$</td>
        </tr>
        <tr>
            <td>4</td>
            <td>fully connected</td>
            <td>$1000$ output features</td>
        </tr>
        <tr>
            <td>5</td>
            <td>ReLU</td>
            <td>-</td>
        </tr>
        <tr>
            <td>6</td>
            <td>fully connected</td>
            <td>$10$ output features</td>
        </tr>
    </table>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 3.6. [3 Points]</b>
    <ul>
        <li>Split the Fashion-MNIST data set in a <i>training</i> set ($80\%$) as well as <i>test</i> set ($20\%$).</li>
        <li>Print the size of the full data set, the training set and the test set.</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 3.7. [5 Points]</b>
    <ul>
        <li>Create a corresponding <tt>TensorDataset</tt> for the training as well as the test set.</li>
        <li>Wrap the previously defined <tt>TensorDataset</tt> instances in separate <tt>DataLoader</tt> instances with a batch size of $64$ (shuffle the training data set).</li>
    </ul>
</div>

In [None]:
# your code goes here

<div class="alert alert-warning">
    <b>Exercise 3.8. [10 Points]</b>
    <ul>
        <li>For both an instance of your <code>FNN</code> and <code>CNN</code> model from above, train for $5$ epochs, print the training accuracy as well as the loss per epoch, and afterwards, print the final test set loss and accuracy.</li>
    </ul>
</div>

In [None]:
# your code goes here