# Anomaly sound detection in pumps

## Introduction

**Anomaly sound detection (ASD)** is the task to identify whether the sound emitted from a target machine is normal or anomalous. Automatically detecting mechanical failure is an essential technology in the fourth industrial revolution, including artificial intelligence (AI)-based factory automation. Prompt detection of machine anomaly by observing its sounds may be useful for machine condition monitoring.

 ![Anomaly detector](http://d33wubrfki0l68.cloudfront.net/268bbc4666654d6e5ef28c449067626fbfee7488/2ad7c/images/tasks/challenge2020/task2_unsupervised_detection_of_anomalous_sounds_for_machine_condition_monitoring_01.png)

Anomaly detection techniques can be categorized as:. 
- **Supervised anomaly detection** requires the entire dataset to be labeled "normal" or "abnormal". This technique is a binary classification task. 
- **Semi-supervised anomaly detection** requires only data considered "normal" to be labeled. In this technique, the model will learn what "normal" data are like. 
- **Unsupervised anomaly detection** involves unlabeled data. In this technique, the model will learn which data is "normal" and "abnormal".

## State of the art

This wordcloud shows that according to the paper [Anomalous Sound Detection with Machine Learning: A Systematic Review](https://www.arxiv-vanity.com/papers/2102.07820/), the ToyADMOS, MIMII, and Mivia datasets, the Mel-frequency cepstral coefficients (MFCC) method for extracting features, the Autoencoder (AE) and Convolutional Neural Network (CNN) models of ML, the AUC and F1-score evaluation methods were most cited.

![Wordcloud](wordcloud.png)

The **2-StateArt** notebook shows how to create this word cloud.

## Dataset

For this project, I will use the [development dataset](https://zenodo.org/record/3678171) of pumps from the 2nd task of the [2020 DCASE Challenge](http://dcase.community/challenge2020/task-unsupervised-detection-of-anomalous-sounds) 

This pivot table shows the number of audios per split (train/test), machine_id (00,02,04,06) and label (normal/anomaly).

In [1]:
import pandas as pd

# Load filenames of the audios
filenames_df = pd.read_csv("https://raw.githubusercontent.com/xiaoxi-david/malfunctioning-machines/main/development/jupyter/csv/filenames.csv")

dct_types = {
    "split": "category",
    "label": "category",
    "machine_id": "category",
    "audio_id": "category",
}

# Extract the split, machine_id and label from the filenames
machines_df = (
    filenames_df["filename"]
    .str.extract(r"(train|test).(normal|anomaly)_id_(\d{2})_\d{4}(\d{4})", expand=True)
    .rename(columns={0: "split", 1: "label", 2: "machine_id", 3: "audio_id"})
    # Convert columns to categories to save memory
    .astype(dct_types)
)

# Summary of the number of examples per split, machine_id and split
(
    machines_df
    .filter(["machine_id", "split", "label", "audio_id"])
    .pivot_table(
        values="audio_id",
        index=["machine_id"],
        columns=["split","label"],
        aggfunc='count',
        fill_value=0,
        observed=True)
)

split,test,test,train
label,anomaly,normal,normal
machine_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,143,100,906
2,111,100,905
4,100,100,602
6,102,100,936


This dataset is suitable for *semi-supervised anomaly detection* as the train split has only normal audios and the test split contains normal and anomaly audios.

All the audios last 10 seconds and were recorded at 16kHz with a resolution of 16 bits. You can listen to some of them on the **4-EDA_samples** notebook.  

## Metodology

Although this dataset is only suitable for semi-supervised anomaly detectors, we can practide supervised anomaly detection with the test data.

- Notebook 5 and 6 shows to create supervised anomaly detectors using CNNs.
- Notebook 7, 8 and 9 shows to create semi-supervised anomaly detectors using AEs.

I use [Tensorflow](https://www.tensorflow.org), [Tensorboard](https://www.tensorflow.org/tensorboard/) and [Tensorflow serving](https://www.tensorflow.org/tfx/guide/serving) to solve the problem.

## Conclusions

Train a semi-supervised anomaly detector is much more difficult than train a supervised anomaly detector because common guidelines for supervised machine learning don't apply to semi-supervised machine learning.