# Content
* [1  Problem description](#1)
    * [1.1  Testbed description](#1.1)
    * [1.2  Business problem](#1.2)
    * [1.3  Metrics](#1.3)
    * [1.4  Data description](#1.4)
    * [1.5  Strucuture of data](#1.5)
* [2  Data analysis](#2)
    * [2.1  Data loading](#2.1)
    * [2.2  Summarized information about data](#2.2)
        * [2.2.1  Number of experiements, sizes of datasets](#2.2.1)
        * [2.2.2  Aggregated data for experiments](#2.2.2)
    * [2.3  Signal analysis](#2.3)        
        * [2.3.1  Descriptive statistics](#2.3.1)
        * [2.3.2  Signal plots](#2.3.2)
        * [2.3.3  Pairwise Correlations](#2.3.3)
        * [2.3.4  Pairplot](#2.3.4)
        * [2.3.5  Distribution plot](#2.3.5)

<a id="1"></a>
# 1 Problem description

<a id="1.1"></a>
## 1.1 Testbed description  

![Стенд](https://github.com/waico/SKAB/blob/master/docs/pictures/testbed.png?raw=true)

Front panel and composition of the water circulation, control and monitoring systems: 1,2 - solenoid valve (amount - 1); 3 - a tank with water (1); 4 - a water pump (1); 5 - emergency stop button (1); 6 - electric motor (1); 7 - inverter (1); 8 - compactRIO (1); 9 - a mechanical lever for shaft misalignment (1). Not shown parts - vibration sensor (2); pressure meter (1); flow meter (1); thermocouple (2).

<a id="1.2"></a>
## 1.2 Business problem

The main advantages of anomaly detection in the operation of equipment:
- Reducing equipment maintenance costs
- Optimization of terms and duration of repair work
- Reducing the probability of failures

<a id="1.3"></a>
## 1.3 Metrics

When an anomaly detection problem is formulated, the classification problem is primarily solved. It is proposed to use the following metrics to evaluate the perfomance of the algorithms:
- False Alarm Rate
$$FAR = \frac{FP}{FP+TN}$$
- Missing Alarm Rate
$$MAR = \frac{FN}{TP+FN}$$

Also, to compare the results of the algorithm, you can consider the metrics described in:
https://tsad.readthedocs.io/en/latest/Evaluating.html 

<a id="1.4"></a>
## 1.4 Data description

Each file represents one experiment and contains one anomaly (the exception is the anomaly-free file, which does not contain any anomalies). The dataset is a multivariate time series collected from testbed sensors. The data folder contains the datasets from the test. Data Folder Structure:

1. anomaly-free - Data obtained from the experiments with normal mode
2. valve1 - Data obtained from the experiments with closing the valve at the outlet of the flow from the pump.
3. valve2 - Data obtained from the experiments with closing the valve at the flow inlet to the pump.
4. other - Data obtained from the other experiments 
> - 1.csv  Simulation of fluid leaks and fluid additions         
> - 2.csv  Simulation of fluid leaks and fluid additions
> - 3.csv  Simulation of fluid leaks and fluid additions
> - 4.csv  Simulation of fluid leaks and fluid additions         
> - 5.csv  Sharply behavior of rotor imbalance
> - 6.csv  Linear behavior of rotor imbalance
> - 7.csv  Step behavior of rotor imbalance
> - 8.csv  Dirac delta function behavior of rotor imbalance
> - 9.csv  Exponential behavior of rotor imbalance
> - 10.csv The slow increase in the amount of water in the circuit
> - 11.csv The sudden increase in the amount of water in the circuit
> - 12.csv Draining water from the tank until cavitation
> - 13.csv Two-phase flow supply to the pump inlet (cavitation)
> - 14.csv Water supply of increased temperature

<a id="1.5"></a>
## 1.5 Data structure

Columns in each data file are following:
* datetime - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss)
* Accelerometer1RMS - Shows a vibration acceleration (Amount of g units)
* Accelerometer2RMS - Shows a vibration acceleration (Amount of g units)
* Current - Shows the amperage on the electric motor (Ampere)
* Pressure - Represents the pressure in the loop after the water pump (Bar)
* Temperature - Shows the temperature of the engine body (The degree Celsius)
* Thermocouple - Represents the temperature of the fluid in the circulation loop (The degree Celsius)
* Voltage - Shows the voltage on the electric motor (Volt)
* RateRMS - Represents the circulation flow rate of the fluid inside the loop (Liter per minute)
* anomaly - Shows if the point is anomalous (0 or 1)
* changepoint - Shows if the point is a changepoint for collective anomalies (0 or 1)

<a id="2"></a>
# 2 Data analysis

In [None]:
import os

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import display
from IPython.core.display import Markdown

import warnings

warnings.filterwarnings("ignore")

<a id="2.1"></a>
## 2.1 Data loading

In [None]:
all_files = []
for root, dirs, files in os.walk("../data/"):
    for file in files:
        if file.endswith(".csv"):
            all_files.append(f"{root}/{file}")

all_files.sort()
display(all_files)

In [None]:
other_description = {
    "1.csv": "Simulation of fluid leaks and fluid additions",
    "2.csv": "Simulation of fluid leaks and fluid additions",
    "3.csv": "Simulation of fluid leaks and fluid additions",
    "4.csv": "Simulation of fluid leaks and fluid additions",
    "5.csv": "Sharply behavior of rotor imbalance",
    "6.csv": "Linear behavior of rotor imbalance",
    "7.csv": "Step behavior of rotor imbalance",
    "8.csv": "Dirac delta function behavior of rotor imbalance",
    "9.csv": "Exponential behavior of rotor imbalance",
    "10.csv": "The slow increase in the amount of water in the circuit",
    "11.csv": "The sudden increase in the amount of water in the circuit",
    "12.csv": "Draining water from the tank until cavitation",
    "13.csv": "Two-phase flow supply to the pump inlet (cavitation)",
    "14.csv": "Water supply of increased temperature",
}

In [None]:
# Группировка данных по типу аномалии
anomaly_free_data = pd.read_csv(
    "../data/anomaly-free/anomaly-free.csv",
    sep=";",
    index_col="datetime",
    parse_dates=True,
)
valve1_data = {
    file.split("/")[-1]: pd.read_csv(
        file, sep=";", index_col="datetime", parse_dates=True
    )
    for file in all_files
    if "valve1" in file
}
valve2_data = {
    file.split("/")[-1]: pd.read_csv(
        file, sep=";", index_col="datetime", parse_dates=True
    )
    for file in all_files
    if "valve2" in file
}
other_data = {
    file.split("/")[-1]: pd.read_csv(
        file, sep=";", index_col="datetime", parse_dates=True
    )
    for file in all_files
    if "other" in file
}

<a id="2.2"></a>
## 2.2 Summarized information about the data
This section focuses on general information about data and contains the following subsections:
* Number of experiments, dataset sizes
* Aggregated experiment data
* Descriptive statistics
* Gaps and outliers in the data
* Measurement resolution
* Pairwise correlations
* Determination of the operation modes: transfer mode, work mode, stop mode

<a id="2.2.1"></a>
### 2.2.1 Number of experiments, dataset sizes
To get started, you can look at the number of experiments in each group, the first few rows in these experiments, and the dimension of the data in the experiments.

In [None]:
display(Markdown("<br>__Number of experiments in each group:__"))
print(
    "Experiments with closing the valve at the outlet of the pump:",
    len(valve1_data),
)
print(
    "Experiments with closing the valve at the inlet flow to the pump:",
    len(valve2_data),
)
print("Other experiments:", len(other_data))
print("Datasets without anomalies: 1")

display(Markdown("<br><br>__Dataset without anomalies__"))
display(anomaly_free_data.head(3))
print("Dataset size:", anomaly_free_data.shape)

display(
    Markdown(
        '<br><br>__The first dataset from the group "Other experiments"__'
    )
)
display(other_data["1.csv"].head(3))
print("Dataset size:", other_data["1.csv"].shape)

<a id="2.2.2"></a>
### 2.2.2 Aggregated data for experiments
To compare different experiments, we will collect aggregated information for each experiment and create a pivot table that includes the following fields:
* type of experiment
* experiment number
* description of the experiment
* experiment duration
* number of lines
* percentage of rows with anomalies
* number of state change points
* Main time between samples, seconds
* Percent of main sample frequency among all samples
* Number of time gaps between samples


In [None]:
def experiment_describe(data, anomaly_type, description, experiment):
    start_time = data.index.min()
    finish_time = data.index.max()
    duration = finish_time - start_time

    nas = data.isna().sum().sum()

    rows = data.shape[0]
    if "anomaly" in data.columns:
        anomaly_percent = np.round(data["anomaly"].mean() * 100, 2)
        changepoints = data["changepoint"].sum()
    else:
        anomaly_percent = 0
        changepoints = 0

    dif_time = data.index.to_series().diff().dropna().dt.seconds.value_counts()
    main_sample_rate = dif_time.index[0]
    percent_of_main_sample_rate = round(
        dif_time.values[0] / dif_time.sum() * 100, 2
    )
    number_of_gaps = dif_time[dif_time.index > main_sample_rate * 3].sum()

    columns = [
        "anomaly_type",
        "experiment",
        "description",
        "duration",
        "rows",
        "anomaly_percent",
        "changepoints",
        "Nas",
        "Main time between samples, seconds",
        "Percent of main sample frequency among all samples",
        "Number of gaps",
    ]
    values = np.array(
        [
            anomaly_type,
            experiment,
            description,
            duration,
            rows,
            anomaly_percent,
            changepoints,
            nas,
            main_sample_rate,
            percent_of_main_sample_rate,
            number_of_gaps,
        ]
    ).reshape(1, -1)

    describe_df = pd.DataFrame(columns=columns, data=values)
    return describe_df


experiment = "1.csv"
display(
    Markdown(
        "<br>__An example of aggregated information for the experiment with the closing of the outlet valve__<br><br>"
    )
)
display(
    experiment_describe(
        valve1_data[experiment],
        "value1",
        "Closing the valve downstream of the pump",
        experiment,
    )
)

In [None]:
def get_summary_table():
    df = experiment_describe(
        anomaly_free_data, "anomaly_free", "Normal mode", "anomaly-free.csv"
    )

    for experiment in valve1_data:
        df = pd.concat(
            [
                df,
                experiment_describe(
                    valve1_data[experiment],
                    "valve1",
                    "Closing the valve downstream of the pump",
                    experiment,
                ),
            ]
        )

    for experiment in valve2_data:
        df = pd.concat(
            [
                df,
                experiment_describe(
                    valve2_data[experiment],
                    "valve2",
                    "Closing the valve at the inlet flow to the pump",
                    experiment,
                ),
            ]
        )

    for experiment in other_data:
        df = pd.concat(
            [
                df,
                experiment_describe(
                    other_data[experiment],
                    "other",
                    other_description[experiment],
                    experiment,
                ),
            ]
        )

    df.index = pd.Index([x for x in range(df.shape[0])])
    return df


summary_table = get_summary_table()
display(Markdown("<br>__Summary table for all experiments__<br><br>"))
display(summary_table)

__From this table, the following can be noted:__
* The target time for experiments is about 20 minutes, for anomaly_free the time was 2 hours 46 minutes
* The percentage of time with an abnormal mode is in the range from 25.83% to 57.74%, but for the main part, this value is in the region of 35%
* For most experiments, 4 state change points are recorded
* np.NaN values are missing in the data

<a id="2.3"></a>
## 2.3 Signal analysis

In [None]:
# Join datasets
all_data = (
    pd.concat(
        (
            [anomaly_free_data]
            + list(valve1_data.values())
            + list(valve2_data.values())
            + list(other_data.values())
        )
    )
    .sort_index()
    .drop_duplicates()
)

In [None]:
all_data

<a id="2.3.1"></a>
### 2.3.1 Descriptive statistics
To further familiarize yourself with the data, you can display descriptive statistics for one of the experiments.
They include the following fields:
* count - number of records that are not gaps (np.NaN)
* minimum, average, maximum values, median and 25 75 percentiles
* standard deviation

In [None]:
display(all_data.iloc[:, :-2].describe().T)

In [None]:
columns = anomaly_free_data.columns
mean_table = pd.DataFrame(index=[x + " mean" for x in columns])
mean_table["anomaly free"] = anomaly_free_data.describe().loc["mean"].values
mean_table["valve1[1.csv]"] = (
    valve1_data["1.csv"][columns].describe().loc["mean"].values
)
mean_table["valve2[1.csv]"] = (
    valve2_data["1.csv"][columns].describe().loc["mean"].values
)
mean_table["other[1.csv]"] = (
    other_data["1.csv"][columns].describe().loc["mean"].values
)

std_table = pd.DataFrame(index=[x + " std" for x in columns])
std_table["anomaly free"] = anomaly_free_data.describe().loc["std"].values
std_table["valve1[1.csv]"] = (
    valve1_data["1.csv"][columns].describe().loc["std"].values
)
std_table["valve2[1.csv]"] = (
    valve2_data["1.csv"][columns].describe().loc["std"].values
)
std_table["other[1.csv]"] = (
    other_data["1.csv"][columns].describe().loc["std"].values
)

std_mean_table = pd.concat(
    [
        mean_table,
        std_table,
    ]
)
display(std_mean_table)

__From this table, the following points can be distinguished:__
* The average value of vibration acceleration for anomaly free is almost an order of magnitude higher than for experiments with closing inlet and outlet valves (valve1[0], valve2[0])
* The average water flow rate for anomaly free is 4 times higher than for experiments with closing inlet and outlet valves (valve1[0], valve2[0])
* Average values of temperature, pressure, current, thermocouple for anomaly free are higher than for experiments with closing inlet and outlet valves (valve1[0], valve2[0])

<a id="2.3.2"></a>
### 2.3.2 Signal plots

In [None]:
for column in all_data.columns[:-2]:
    plt.figure(figsize=(16, 3))
    plt.plot(all_data[column].values)  # without gaps
    plt.title(column)
    plt.show()

<a id="2.3.3"></a>
### 2.3.3 Pairwise Correlations
Pairwise correlation information can often be used to search for relationships between features.

In [None]:
plt.figure(figsize=(8, 8))
display(Markdown("__Data without anomalies:__"))

# Assuming `all_data` is your DataFrame
corr = all_data.corr()

plt.imshow(corr, cmap="viridis", interpolation="nearest")
plt.colorbar()
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)  # type: ignore
plt.yticks(range(len(corr.columns)), corr.columns)  # type: ignore
plt.title("Correlation Heatmap")
plt.show()

<a id="2.3.4"></a>
### 2.3.4 Pairplot
To search for relationships between features, you can also use the pairplot function from the seaborn library, which combines all possible scatterplots and histograms

In [None]:
%%time
features = [
    "Accelerometer1RMS",
    "Current",
    "Temperature",
    "Thermocouple",
    "Voltage",
    "Volume Flow RateRMS",
]

# Resetting the index if needed
all_data_reset = all_data.reset_index()

# Creating a scatter matrix
scatter_matrix = pd.plotting.scatter_matrix(
    all_data_reset[features], figsize=(10, 10)
)

# Adding labels to the subplots
for ax in scatter_matrix.flatten():
    ax.xaxis.label.set_rotation(45)
    ax.yaxis.label.set_rotation(0)
    ax.yaxis.label.set_ha("right")

plt.suptitle("Pairplot of Features")
plt.show()

<a id="2.3.5"></a>
### 2.3.5 Distribution plot

First of all, for the analysis of parameters, it is useful to look at the distribution of values using a histogram

In [None]:
bins = None
features = ["Volume Flow RateRMS"]
for col in features:
    plt.figure(figsize=(16, 6))
    plt.hist(all_data[col].values, label="all data")  # without gaps
    plt.hist(valve1_data["1.csv"].values, label="valve1 1.csv")  # without gaps
    plt.hist(valve2_data["1.csv"].values, label="valve2 1.csv")  # without gaps
    plt.hist(other_data["1.csv"].values, label="other 1.csv")  # without gaps
    plt.title(column)
    plt.legend()
    plt.show()