<font size="+3"><mark>Explore the UCR Archive metadata</mark></font>

# Introduction

## README

_Associated GitHub repository: https://github.com/sylvaincom/astride._

This notebook explores the UCR Time Series Classification Archive metadata (number of data sets, number of samples, etc). All signals are univariate.
- Explores the univariate equal-size data sets with at least 100 samples
- Computes the space complexity on a data set for SAX, ABBA, and ASTRIDE

This notebook inputs:
- `data/DataSummary.csv` (downloaded from the [UCR Archive](https://www.cs.ucr.edu/~eamonn/time_series_data_2018))

This notebook outputs:
- the `data/DataSummary_prep_equalsize.csv` file which contains the 117 univariate and equal-size data sets from the UCR archive.
- the `data/DataSummary_prep_equalsize_min100samples.csv` file which contains the 94 univariate and equal-size data sets with at least 100 samples from the UCR archive.

## Configuration parameters

In [1]:
IS_EXPORT_DF = True

## Imports

In [2]:
import numpy as np
import pandas as pd
from pathlib import Path
import pprint

from src.utils import load_ucr_dataset
from src.metadata import l_datasets_classif_bench

In [3]:
pp = pprint.PrettyPrinter()
cwd = Path.cwd()

# Load and clean the (meta)data

## Load

In [4]:
df_ucr = pd.read_csv(cwd / "data/DataSummary.csv")

## Summary

In [5]:
df_ucr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 128 non-null    int64  
 1   Type               128 non-null    object 
 2   Name               128 non-null    object 
 3   Train              128 non-null    int64  
 4   Test               128 non-null    int64  
 5   Class              128 non-null    int64  
 6   Length             128 non-null    object 
 7   ED (w=0)           128 non-null    float64
 8   DTW (learned_w)    128 non-null    object 
 9   DTW (w=100)        128 non-null    float64
 10  Default rate       128 non-null    float64
 11  Data donor/editor  128 non-null    object 
dtypes: float64(3), int64(4), object(5)
memory usage: 12.1+ KB


In [6]:
df_ucr.head()

Unnamed: 0,ID,Type,Name,Train,Test,Class,Length,ED (w=0),DTW (learned_w),DTW (w=100),Default rate,Data donor/editor
0,1,Image,Adiac,390,391,37,176,0.3887,0.3913 (3),0.3964,0.9591,A. Jalba
1,2,Image,ArrowHead,36,175,3,251,0.2,0.2000 (0),0.2971,0.6057,L. Ye & E. Keogh
2,3,Spectro,Beef,30,30,5,470,0.3333,0.3333 (0),0.3667,0.8,K. Kemsley & A. Bagnall
3,4,Image,BeetleFly,20,20,2,512,0.25,0.3000 (7),0.3,0.5,J. Hills & A. Bagnall
4,5,Image,BirdChicken,20,20,2,512,0.45,0.3000 (6),0.25,0.5,J. Hills & A. Bagnall


In [7]:
n_datasets_total = df_ucr["Name"].nunique()
print(f"Total number of unique data sets:\n\t{n_datasets_total}")

Total number of unique data sets:
	128


## Feature names

In [8]:
pp.pprint(list(df_ucr.columns))

['ID',
 'Type',
 'Name',
 'Train ',
 'Test ',
 'Class',
 'Length',
 'ED (w=0)',
 'DTW (learned_w) ',
 'DTW (w=100)',
 'Default rate',
 'Data donor/editor']


There are some weird spaces in the feature names!

In [9]:
df_ucr.columns = df_ucr.columns.str.strip()

## `Length` feature

In [10]:
df_ucr["Length"].unique()

array(['176', '251', '470', '512', '577', '128', '166', '1639', '286',
       '720', '300', '345', '80', '96', '140', '136', '131', '350', '270',
       '463', '500', '150', '431', '2709', '1092', '1882', '256', '24',
       '637', '319', '1024', '448', '99', '84', '750', '570', '427',
       '144', '70', '65', '235', '398', '60', '277', '343', '275', '82',
       '945', '315', '152', '234', '900', '426', '1460', 'Vary', '46',
       '288', '1250', '1751', '301', '201', '2000', '601', '2844', '1500',
       '15'], dtype=object)

In [11]:
df_ucr.query("Length == 'Vary'")

Unnamed: 0,ID,Type,Name,Train,Test,Class,Length,ED (w=0),DTW (learned_w),DTW (w=100),Default rate,Data donor/editor
86,87,Sensor,AllGestureWiimoteX,300,700,10,Vary,0.4843,0.2829 (14),0.2843,0.9,J. Guna
87,88,Sensor,AllGestureWiimoteY,300,700,10,Vary,0.4314,0.2700 (9),0.2714,0.9,J. Guna
88,89,Sensor,AllGestureWiimoteZ,300,700,10,Vary,0.5457,0.3486 (11),0.3571,0.9,J. Guna
101,102,Trajectory,GestureMidAirD1,208,130,26,Vary,0.4231,0.3615 (5),0.4308,0.9615,H. A. Dau
102,103,Trajectory,GestureMidAirD2,208,130,26,Vary,0.5077,0.4000 (6),0.3923,0.9615,H. A. Dau
103,104,Trajectory,GestureMidAirD3,208,130,26,Vary,0.6538,0.6231 (1),0.6769,0.9615,H. A. Dau
104,105,Sensor,GesturePebbleZ1,132,172,6,Vary,0.2674,0.1744 (2),0.2093,0.814,I. Maglogiannis
105,106,Sensor,GesturePebbleZ2,146,158,6,Vary,0.3291,0.2215 (6),0.3291,0.8101,I. Maglogiannis
115,116,Sensor,PickupGestureWiimoteZ,50,50,10,Vary,0.44,0.3400 (17),0.34,0.9,J. Guna
119,120,Device,PLAID,537,537,11,Vary,0.4767,0.1657 (12),0.1639,0.838,P. Schafer


Some data sets are said to be of varying lengths. Let us remove them:

In [12]:
df_ucr = df_ucr.query("Length != 'Vary'")
df_ucr["Length"] = df_ucr["Length"].astype(int)

In [13]:
n_datasets_equalsize = df_ucr["Name"].nunique()
print(f"Total number of equal-size univariate datasets:\n\t{n_datasets_equalsize}")

Total number of equal-size univariate datasets:
	117


In [14]:
if IS_EXPORT_DF:
    df_ucr.to_csv(cwd / "data/DataSummary_prep_equalsize.csv", index=False)

In [15]:
df_ucr = df_ucr.query("Length >= 100")
n_datasets_equalsize_long = df_ucr["Name"].nunique()
print(f"Total number of equal-size univariate datasets that have at least 100 samples:\n\t{n_datasets_equalsize_long}")

Total number of equal-size univariate datasets that have at least 100 samples:
	94


In [16]:
if IS_EXPORT_DF:
    df_ucr.to_csv(cwd / "data/DataSummary_prep_equalsize_min100samples.csv", index=False)

# Focus on the 86 data sets from the classification benchmark

Note that some data sets encountered computational issues during the classification benchmark. Hence, out of the 94 equal-size univariate data sets with at least 100 samples, 86 data sets are used in the benchmark.

## Get data set names (hard coded)

In [17]:
print(len(l_datasets_classif_bench))

86


In [18]:
# Check if the data sets of the classification benchmark of equal-size and with at least 100 samples
l_datasets_scope = df_ucr.query("Length >= 100")["Name"].unique().tolist()
l = []
for dataset in l_datasets_classif_bench:
    l.append(dataset in l_datasets_scope)
print(sum(l))

86


In [19]:
# Check if the data sets that are explicitly mentioned in the paper are part of the classification benchmark
l_datasets_paper = ["Meat", "Strawberry", "CBF", "Beef"]
for dataset in l_datasets_paper:
    print(dataset in l_datasets_classif_bench)

True
True
True
True


In [20]:
df_ucr_prep = df_ucr.query(f"Name in {l_datasets_classif_bench}")

## Describe

In [21]:
df_ucr_prep_desc = df_ucr_prep.copy()
df_ucr_prep_desc["Train and Test"] = df_ucr_prep_desc["Train"].values + df_ucr_prep_desc["Test"].values
df_ucr_prep_desc = df_ucr_prep_desc[["Train and Test", "Length", "Class"]].describe().round(0).astype(int)

In [22]:
df_ucr_prep_desc.loc[["mean", "min", "50%", "max"]]

Unnamed: 0,Train and Test,Length,Class
mean,1357,644,10
min,40,128,2
50%,687,456,4
max,9236,2844,60


*Note*: It corresponds to Table 4 of the paper.

In [23]:
print(df_ucr_prep_desc.loc[["mean", "min", "50%", "max"]].style.to_latex())

\begin{tabular}{lrrr}
 & Train and Test & Length & Class \\
mean & 1357 & 644 & 10 \\
min & 40 & 128 & 2 \\
50% & 687 & 456 & 4 \\
max & 9236 & 2844 & 60 \\
\end{tabular}



Total number of samples:

In [24]:
pp.pprint(((df_ucr_prep["Train"]+df_ucr_prep["Test"])*df_ucr_prep["Length"]).sum())

66827003


# Compute the total space complexity of some symbolization methods on a data set

In [25]:
df_ucr_prep.head()

Unnamed: 0,ID,Type,Name,Train,Test,Class,Length,ED (w=0),DTW (learned_w),DTW (w=100),Default rate,Data donor/editor
0,1,Image,Adiac,390,391,37,176,0.3887,0.3913 (3),0.3964,0.9591,A. Jalba
1,2,Image,ArrowHead,36,175,3,251,0.2,0.2000 (0),0.2971,0.6057,L. Ye & E. Keogh
2,3,Spectro,Beef,30,30,5,470,0.3333,0.3333 (0),0.3667,0.8,K. Kemsley & A. Bagnall
3,4,Image,BeetleFly,20,20,2,512,0.25,0.3000 (7),0.3,0.5,J. Hills & A. Bagnall
4,5,Image,BirdChicken,20,20,2,512,0.45,0.3000 (6),0.25,0.5,J. Hills & A. Bagnall


In [26]:
dataset = "Meat"
w = 10  # word length
A = 9  # alphabet size
r = 64  # number of bits a real value is encoded on

In [27]:
N = df_ucr_prep.query(f"Name == '{dataset}'")[["Train", "Test"]].sum(axis=1).values[0]
print(f"Number of samples:\n\t{N}")

mem_sax = N*w*np.log2(A) + r*A
print(f"Total memory usage of SAX (bits):\n\t{mem_sax:.0f}")

mem_abba = N*w*np.log2(A) + 2*r*N*A
print(f"Total memory usage of ABBA (bits):\n\t{mem_abba:.0f}")

print(f"Comparison between ABBA and SAX (bits):\n\t{mem_abba/mem_sax = :.0f}")

mem_astride = N*w*np.log2(A) + r*(A+w)
print(f"Total memory usage of ASTRIDE (bits):\n\t{mem_astride:.0f}")

Number of samples:
	120
Total memory usage of SAX (bits):
	4380
Total memory usage of ABBA (bits):
	142044
Comparison between ABBA and SAX (bits):
	mem_abba/mem_sax = 32
Total memory usage of ASTRIDE (bits):
	5020


In [28]:
mem_abba_seq = N*w*np.log2(A)
print(f"Space complexity of the symbolic sequences of ABBA (bits):\n\t{mem_abba_seq:.0f}")

mem_abba_dict = 2*r*N*A
print(f"Space complexity of the dictionary of symbols of ABBA (bits):\n\t{mem_abba_dict:.0f}")

print(f"Comparison:\n\t{mem_abba_dict/mem_abba_seq = :.0f}")

Space complexity of the symbolic sequences of ABBA (bits):
	3804
Space complexity of the dictionary of symbols of ABBA (bits):
	138240
Comparison:
	mem_abba_dict/mem_abba_seq = 36
