# EDA of the dataset metadata

## Libraries

Let's load the libraries for this notebook:
- Pandas to create and manipulate DataFrames

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [2]:
print(f"Pandas: {pd.__version__}")

Pandas: 1.2.4


## Dataset

Let's see how many audio samples the train and test set have. To save time, I have created a one-column csv dataset with the names of each audio sample from the zip file.

In [3]:
filenames_df = pd.read_csv("https://raw.githubusercontent.com/xiaoxi-david/malfunctioning-machines/main/development/jupyter/csv/filenames.csv")

The filenames show whether the audio sample is normal or anomaly, which machine it belongs to and an audio id. 

Audio files are divided into two folders: train and test and the filename explains whether the audio sample is normal or anomaly,  

In [4]:
filenames_df.sample(5)

Unnamed: 0,filename
3542,pump\train\normal_id_06_00000273.wav
479,pump\test\normal_id_00_00000023.wav
340,pump\test\anomaly_id_04_00000086.wav
2344,pump\train\normal_id_02_00000582.wav
99,pump\test\anomaly_id_00_00000099.wav


Let's extract the split (train/test), the label (normal/anomaly), the machine id and the last four digits of the audio id for each filename. 

In [5]:
machines_df = (
    filenames_df["filename"]
    .str.extract(r"(train|test).(normal|anomaly)_id_(\d{2})_\d{4}(\d{4})", expand=True)
    .rename(columns={0: "split", 1: "label", 2: "machine_id", 3: "audio_id"})
)

In [6]:
machines_df.sample(5)

Unnamed: 0,split,label,machine_id,audio_id
371,test,anomaly,6,17
719,test,normal,4,63
3631,train,normal,6,362
900,train,normal,0,44
2148,train,normal,2,386


## Dataframe manipulation

Let's check the data types of the columns to see if we can save some memory.

In [7]:
machines_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4205 entries, 0 to 4204
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   split       4205 non-null   object
 1   label       4205 non-null   object
 2   machine_id  4205 non-null   object
 3   audio_id    4205 non-null   object
dtypes: object(4)
memory usage: 131.5+ KB


In [8]:
machines_df.memory_usage(deep=True)

Index            128
split         259854
label         265371
machine_id    248095
audio_id      256505
dtype: int64

Object type columns needs more memory than category type columns. So, let's transform columns from *object* to *category* to save memory.

In [9]:
dct_types = {
    "split": "category",
    "label": "category",
    "machine_id": "category",
    "audio_id": "category",
}
machines_df = machines_df.astype(dct_types)

In [10]:
machines_df.dtypes

split         category
label         category
machine_id    category
audio_id      category
dtype: object

In [11]:
machines_df.memory_usage(deep=True)

Index           128
split          4436
label          4440
machine_id     4613
audio_id      98570
dtype: int64

The info of the dataframe is the same, but we have saved some memory.

In [12]:
machines_df.sample(5)

Unnamed: 0,split,label,machine_id,audio_id
1239,train,normal,0,383
978,train,normal,0,122
485,test,normal,0,29
1113,train,normal,0,257
2770,train,normal,4,103


## Summary

Let's make a pivot table to count the audio samples per split (train/test), machine_id (00,02,04,06) and label (normal/anomaly).

In [13]:
(
    machines_df
    .filter(["machine_id", "split", "label", "audio_id"])
    .pivot_table(
        values="audio_id",
        index=["machine_id"],
        columns=["split","label"],
        aggfunc='count',
        fill_value=0,
        observed=True)
)

split,test,test,train
label,anomaly,normal,normal
machine_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,143,100,906
2,111,100,905
4,100,100,602
6,102,100,936


From the pivot table, we see that:
- The train set only has normal audios
- The test set has normal and anomaly audios.

## Conclusion

As we can see from the pivot table, this dataset is suitable for *semi-supervised anomaly detection* as the train split has only normal audios and the test split contains normal and anomaly audios.