# AI & Big Data - Time series - Lab. 2 - Classification

This lab consists of six parts, each building upon the previous one, gradually increasing in complexity.

Deliverables:
- A .ipynb file containing your solutions.
- A PDF export of the notebook, ensuring that all cells have been executed.

For every question, please provide **short** but **informative** answers.

⚠️ Important: While you are free to use external resources, please don't rely blindly on ChatGPT - I'll find out!

Let's go!

## Part 1: Setup environment and download datasets

In [None]:
!pip install aeon
!pip install seaborn
!pip install dtw-python
!pip install tqdm
!pip install fastdtw
!pip install pycatch22

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from aeon.datasets import load_arrow_head, load_basic_motions
from tqdm import tqdm
import pandas as pd
import time
import fastdtw

x_train_full, y_train = load_arrow_head(split='train')
x_test_full, y_test = load_arrow_head(split='test')
print(f"Arrow Head dataset of type {type(x_train_full)} and shapes, x_train: {x_train_full.shape}, y_train: {y_train.shape}, x_test: {x_test_full.shape}, y_test: {y_test.shape}")

In [None]:
x_train, x_test = x_train_full[:, 0], x_test_full[:, 0]

In [None]:
print(y_train)

**Questions:**
1. What are the dimensions of the downloaded datasets? Write their meaning and value for every subset.
2. What is the format of the labels? How many classes do we have?
3. Look into the documentation of the aeon library and write below the meaning of each class.

**Answers:**

Part 2: Visualizing and Understanding the Data

- Create a subplot with 3 rows and 2 columns.
- Each row should correspond to a different class in the dataset.
- Each column should distinguish between different subsets (e.g., training vs. test sets, different conditions, etc.).
- Plot all samples within each class to observe variations and patterns.

This visualization will help in understanding the structure of the dataset and how different classes are distributed.

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(10, 5))

# TODO

plt.tight_layout()
plt.show()

**Questions:**

1. What do you observe? Describe the differences between the classes based on their visual patterns.
2. Do you understand the classification task? If you were the classifier, what key characteristics would you focus on to distinguish between classes?
3. Do you notice any samples that could be problematic? If so, explain why they might be challenging for classification (e.g., overlap between classes, noise, outliers).

**Answers:**

## Part 3: Distance-based classifiers

### Part 3.1: Implement the Euclidean and the DTW distances

In [None]:
def euclidean_distance(a, b):
  # TODO
  pass

print(f"Euclidian distance value: {euclidean_distance(x_train[0], x_train[1])}")

In [None]:
def dtw(a, b, window=100):
  # TODO
  pass

distance_matrix = dtw(x_train[0], x_train[1], window=1)
print(f"DTW distance value: {distance_matrix[-1, -1]}")

In [None]:
def min_cost_path(matrix):
  i, j = matrix.shape[0] - 1, matrix.shape[1] - 1
  path = [(i, j)]

  while i > 0 or j > 0:
      neighbors = []
      if i > 0:
          neighbors.append((matrix[i-1, j], i-1, j))
      if j > 0:
          neighbors.append((matrix[i, j-1], i, j-1))
      if i > 0 and j > 0:
          neighbors.append((matrix[i-1, j-1], i-1, j-1))

      _, i, j = min(neighbors)
      path.append((i, j))

  return path[::-1]

def visualize_distance_matrix(matrix):
  path = min_cost_path(matrix)

  fig, ax = plt.subplots(figsize=(8, 6))
  # TODO

  path_x, path_y = zip(*path)
  ax.plot(np.array(path_y) + 0.5, np.array(path_x) + 0.5, marker="o", color="red", markersize=2)

  ax.invert_yaxis()
  plt.title("DTW Distance Matrix with Path")
  plt.show()

visualize_distance_matrix(distance_matrix)

In [None]:
def dtw_distance(a, b):
  # TODOD: Wrapper that returns the final distance

def my_fastdtw(a, b):
    return fastdtw.fastdtw(a, b, radius=20)[0]

Run a random example between two time series and answer the questions below.

**Questions:**
1. Does your implementation match the Euclidean distance when window = 1 ? Why or why not?
2. What do you observe as you change the window size? How does it affect the DTW alignment?
3. Up to which value does the window size affect the result? Justify your answer.

**Answers:**

### Part 3.2: Implement the kNN classifier

In [None]:
def knn_classifier_single(x_train, y_train, x_test, distance_function, k=5):
  # TODO
  # Compute the distance of the test value to every train
  distances =

  # Find the k min distances
  nearest_neighbors =

  # Return their voted most label
  votes =
  labels, counts =
  prediction =

  return prediction

ed_tic = time.time()
knn_classifier_single(x_train, y_train, x_test[0], euclidean_distance, k=5)
ed_toc = time.time()

dtw_tic = time.time()
knn_classifier_single(x_train, y_train, x_test[0], dtw_distance, k=5)
dtw_toc = time.time()

fast_dtw_tic = time.time()
knn_classifier_single(x_train, y_train, x_test[0], my_fastdtw, k=5)
fast_dtw_toc = time.time()

print(f"ED distance time: {(ed_toc - ed_tic):.4f} secs")
print(f"DTW distance time: {(dtw_toc - dtw_tic):.4f} secs")
print(f"Fast DTW distance time: {(fast_dtw_toc - fast_dtw_tic):.4f} secs")

In [None]:
def knn_classifier(x_train, y_train, x_test, distance_function, k=5):
  predictions = []
  # TODO: Implement a wrapper for multiple predictions

  return predictions

preds = knn_classifier(x_train, y_train, x_test, euclidean_distance, k=22)

In [None]:
def accuracy(a, b):
  # TODO

print(accuracy(preds, y_test))

In [None]:
ed_trials = []
for k in tqdm(range(1, 20)):
  # TODO: Test for multiple k and save results in list of dictionaries

In [None]:
# preds = knn_classifier(x_train, y_train, x_test, my_fastdtw, 7)

dtw_trials = []
for k in tqdm(range(1, 20, 2)):
  # TODO: Use best k from above and test for multiple windows
  # TODO: Save results in list of dictionaries

In [None]:
ed_trials_df = pd.DataFrame(ed_trials)
dtw_trials_df = pd.DataFrame(dtw_trials)

# TODO: Visualize acc for different k and different window sizes
plt.show()

**Questions:**
1. What is the best value for k and the window size? How do they impact the classification performance?

2. Which distance measure should we use? Justify your choice based on the dataset and DTW behavior.

3. What do you think about the results?

## Part 4: Feature-based classifiers

### Part 4.1: Feature extraction

1. Extract features from the time series data using the Catch22 library. Ensure that each sample is transformed into a feature vector.
2. Convert the extracted features from NumPy arrays to Torch tensors, preparing them for use in a PyTorch model. Don't forget the labels too.

In [None]:
import pycatch22

X_train, Y_train =
X_test, Y_test =

### Part 4.2: Setup the training pipeline with Torch

1. Define a simple linear model using PyTorch's nn.Linear. Ensure it takes the extracted feature vectors as input and outputs class probabilities.
2. Choose a loss function and an optimizer, such as Cross-Entropy Loss and Adam/SGD.
3. Implement the training loop, including forward pass, loss computation, backpropagation, and parameter updates. Train the model for a reasonable number of epochs.
4. Evaluate the model on the test set and report relevant metrics (e.g., accuracy, precision, recall).

In [None]:
import torch
import torch.nn as nn

model = nn.Sequential(
    # TODO
)

In [None]:
loss_fn =
optimizer =

In [None]:
num_epochs = 1000
batch_size = 16
dataset_size = Y_train.shape[0]

history = []
pbar = tqdm(range(num_epochs), desc='Training', leave=True)
for n in pbar:
  curr_indexes =
  curr_x =
  curr_y =
  # TODO: Complete the training loop

  acc =
  history.append({'epoch': n, 'loss': , 'accuracy': })
  pbar.set_description(f"Training loss {loss}, Training accuracy: {acc}")
  pbar.refresh()

In [None]:
# Print the accuracy on the test set
print(f"MLPs accuracy: {}")

In [None]:
df = pd.DataFrame(history)

# TODO: Plot the training loss and accuracy per epoch
plt.show()

Questions:
1. Experiment with different hyperparameters (n_layers, neurons per layer, activation functions, n_epochs, batch_size, etc.). What is the best combination of hyperparameters you found? Justify your choice.
2. Compare the results of the feature-based approach with the previous approach. Are the results better? Why do you think this is the case, considering the data?

## Part 5: Raw-based classifiers

In this section, we will train a classifier directly on the raw time series data.
1. Prepare the dataset in its raw form (without feature extraction).
2. Define a small 1D Convolutional Neural Network (1D-CNN) architecture for classification.
3. Implement the training pipeline, including:
  * Loss function and optimizer selection
  * Training loop
  * Model evaluation

In [None]:
model = nn.Sequential(
    # TODO
)

In [None]:
X_train, Y_train =
X_test, Y_test =

In [None]:
loss_fn =
optimizer =

num_epochs =
batch_size =
dataset_size = Y_train.shape[0]

cnn_history = []
pbar = tqdm(range(num_epochs), desc='Training', leave=True)
for n in pbar:
  # TODO

  pbar.set_description(f"Training loss {loss}, Training accuracy: {acc}")
  pbar.refresh()

In [None]:
# Evaluate the model on the test set
print(f"CNNs accuracy: {}")

In [None]:
df = pd.DataFrame(cnn_history)

# TODO: Produce the same plots as before
plt.show()

**Questions:**
1.  Are the results now better than before? If so, why do you think that is?
2. Adjust the hyper-parameters (e.g., number of filters, kernel size, learning rate, batch size, etc.) as you did before. What do you think are the best hyper-parameter settings based on your experiments? Justify your answer by presenting the results and explaining the reasoning behind your choices.

**Answers:**

## Part 6: Discussion

This part of the exercise focuses on discussing your results and demonstrating your understanding of the methods and concepts. Please answer the following questions in at most two short paragraphs.

**Questions:**
- Which method did you like the most, and why?
- Based on your experience, which distance measure do you think works best for this task?
- What did you find challenging to understand, and what concepts did you find straightforward?
- What are the pros and cons of each method you tried?

There are no strictly "correct" answers, but your grade will be based on how well you understand and articulate the concepts. Reflect on what you've done, and make sure to explain your reasoning clearly.

**Answer:**

**The End!**