# Anomaly Detection Project: A Demo
- Author: Heba Mahdi (htmwtw@umsl.edu)
- University of Missouri - St. Louis
- Last modified date: July 09, 2024

#### Summary

The anomaly-detection process begins with sampling equal-sized GONG H-Alpha solar observation anomalous and non-anomalous images, followed by partitioning each image into an n × n grid. This grid superimposes on the image, dividing it into n² cells. Each cell's average pixel intensity is computed, effectively downsampling the image to a grid of average values. Users can then choose between two methods, 'minmax' or 'anova', to establish a normal intensity range for each cell. The 'minmax' approach utilizes labeled normal data to determine the optimal cell ranges based on the minimum and maximum pixel average values across non-anomalous images. Alternatively, the 'anova' approach employs a one-way ANOVA F-test to statistically analyze the two groups of classified data, refining the normal intensity range for each cell. This step optimizes upper and lower bounds for each cell's intensity using the statistic 𝑆 = ∣𝑈 − 𝐼∣ + ∣𝐿 − 𝐼∣, where 𝑈 and 𝐿 are candidate upper and lower bounds, and 𝐼 is the cell's average intensity. This ensures accurate detection of cells with significantly deviant pixel intensities. A cell is flagged as anomalous if its Sigmoid-transformed standardized S statistic value exceeds a specified threshold. Furthermore, an entire image is classified as anomalous if it surpasses a predefined minimum number of corrupt cells.

#### Import Required Modules and Classes

In [None]:
import os
import random

from anomalyzer import Anomalyzer
from performance_evaluator import f1_metric
from plots_generator import get_plot_data, plot_cell_avg, plot_cell_wise_scatter, plot_cell_wise_hist_plot

#### Set Up Sampled Paths

This cell defines the paths to the folders containing anomalous and non-anomalous images, generates lists of file paths for images in the specified folders and randomly samples a number of image paths from these lists.

In [None]:
anomalous_folder_path     = '../../Data/Anomalous'
non_anomalous_folder_path = '../../Data/Non-Anomalous'

anomalous_paths = [os.path.join(anomalous_folder_path, file_name) for file_name in os.listdir(anomalous_folder_path) if file_name.lower().endswith(('.jpg', '.jpeg', '.png'))]
non_anomalous_paths = [os.path.join(non_anomalous_folder_path, file_name) for file_name in os.listdir(non_anomalous_folder_path) if file_name.lower().endswith(('.jpg', '.jpeg', '.png'))]

sampled_anomalous_paths = random.sample(anomalous_paths, 20)
sampled_non_anomalous_paths = random.sample(non_anomalous_paths, 20)

#### Compute Best Range for Each Cell Using ANOVA or MinMax

This cell creates an instance of the Anomalyzer class to processes the sampled images by reading them in grayscale, dividing them into a grid of cells and calculating their average pixel values. It then computes the optimal upper and lower range values for each grid cell using the sampled anomalous and non-anomalous paths. The `method` argument determines whether the `anova` or `minmax` method is used to establish normal pixel average ranges per cell. When using the `anova` method, additional parameters like `lower_range_end`, `upper_range_start`, and `step_size` can be specified to define candidate ranges for the One-way ANOVA F-test. It's important to consider two points: First, executing this cell takes approximately 10 minutes with 40 sampled images, when using the `anova` method, on a typical personal computer. Second, ensuring the sampled images include at least one image with issues in all four corners prevents the F-statistic from becoming undefined (returns `nan` and raises a `ConstantInputWarning`) due to constant average pixel values (e.g., 0.0) in those cells. Any `nan` F-statistics resulting from undefined calculations are handled by replacing them with a unique value (i.e. -1), ensuring accurate ranking of candidate ranges based on F-statistics.

In [None]:
grid_size = 8
anomalyzer = Anomalyzer(grid_size=grid_size)
anomalyzer.compute_best_ranges(sampled_non_anomalous_paths, sampled_anomalous_paths, method='anova')

#### Set Up Test Paths

This cell extracts test image paths by filtering those already sampled for analysis from all image paths.

In [None]:
all_img_paths = anomalous_paths + non_anomalous_paths
sampled_img_paths = sampled_non_anomalous_paths + sampled_anomalous_paths
test_img_paths = [path for path in all_img_paths if path not in sampled_img_paths]

#### Find Corrupt Images

This cell processes the test images by reading them in grayscale, dividing them into a grid, calculating the average pixel values for each cell, and determining the Sigmoid of standardized S statistic values (anomaly likelihoods) based on each cell's best range. It identifies corrupt cells if their anomaly likelihood values exceed a specified threshold and marks images as corrupt if the number of corrupt cells exceeds a minimum number. The detected corrupt images data is returned in a dataframe.

In [None]:
corrupt_images = anomalyzer.find_corrupt_images(test_img_paths, likelihood_threshold=0.6, min_corrupt_cells=0)

#### Performance Evaluation

This cell calculates the F1 score and confusion matrix metrics, prints the F1 score, and computes and prints the True Positive Rate (TPR), False Positive Rate (FPR), and False Negative Rate (FNR).

In [None]:
F1, TP, FP, TN, FN = f1_metric(anomalous_paths=anomalous_paths, corrupt_images=corrupt_images, test_img_paths=test_img_paths)
print(f"Number of corrupt images detected: {len(corrupt_images['image_name'].unique())}\nF1 Score = {F1}, TPR = {TP / (TP + FN)}, FPR = {FP / (FP + TN)}, FNR = {FN / (TP + FN)}")

#### Plot Corrupt Image

This cell plots the corrupt image with the path `image_path`, marks corrupt cells with a purple bounding box if their anomaly likelihood value exceeds the threshold, and uses a blue colormap to indicate anomaly likelihood.

In [None]:
anomalyzer.plot_corrupt_image(corrupt_images['image_path'].unique()[15], corrupt_images)

#### Prepare Plot Data for Visualizations

This cell prepares plot data to include labels, average pixel values, S statistics, centralized S statistics, standardized S statistics, Sigmoid of centralized S statistics, and Sigmoid of standardized S statistics for all sampled image cells.

In [None]:
plot_data = get_plot_data(images_data=anomalyzer.images_data, best_ranges=anomalyzer.best_ranges)

#### Visualize the Distribution of Cells' Average Pixel Values using Histograms and KDE with Markings for Best Range Values

This cell creates a histogram plot for each cell of sampled images to visualize the distribution of average pixel values and marks the best range values of cells on the plot. The `kde_flag` is used to overlay a Kernel Density Estimate (KDE) on the histograms if set to `True`, which smooths the histograms.

In [None]:
plot_cell_avg(
    images_data=anomalyzer.images_data,
    best_ranges=anomalyzer.best_ranges,
    grid_size=grid_size,
    kde_flag=False,
    save=False,
    save_path=None
)

#### Plot Cell-wise Scatter Plot of Standardized S Statistic Values Against Their Sigmoid Values for Each Cell

This cell creates a scatter plot for each cell of sampled images to visualize the relationship between the standardized S statistic values and their Sigmoid values.

In [None]:
plot_cell_wise_scatter(
    plot_data,
    col1="standardized_S",
    col2="standardized_S_Sigmoid",
    grid_size=grid_size,
    save=False,
    save_path=None
)

#### Plot Cell-wise Histogram Plot of Standardized S Statistic Sigmoid Values for Each Cell

This cell creates a histogram plot for each cell of sampled images to visualize the distribution of standardized S statistic Sigmoid values. The `kde_flag` is used to overlay a Kernel Density Estimate (KDE) on the histograms if set to `True`, which smooths the histograms.

In [None]:
plot_cell_wise_hist_plot(
    plot_data,
    col="standardized_S_Sigmoid",
    grid_size=grid_size,
    kde_flag=False,
    save=False,
    save_path=None
)