# Anomaly Detection

This example is based on the anomaly detection tutorial of yggdrasil decision forests. Go to that tutorial for a more complete introduction to anomaly detection and inspecting decision forest anomaly detectors: https://ydf.readthedocs.io/en/latest/tutorial/anomaly_detection/

Anomaly detection techniques are non-supervised learning algorithms for identifying rare and unusual patterns in data that deviate significantly from the norm. For example, anomaly detection can be used for fraud detection, network intrusion detection, and fault diagnosis, without the need for defining of abnormal instances.

Anomaly detection with decision forests is a straightforward but effective technique for tabular data. The model assigns an anomaly score to each data point, ranging from 0 (normal) to 1 (abnormal). Decision forests also offer interpretability tools and properties, making it easier to understand and characterize detected anomalies.

In anomaly detection, labeled examples are used not for training but for evaluating the model. These labels ensure that the model can detect known anomalies.

We train and evaluate two anomaly detection models on the UCI Covertype dataset, which describes forest cover types and other geographic attributes of land cells. The first model is trained on pine and willow data. Given that willow is rarer than pine, the model differentiates between them without labels. This first model will then be interpreted and characterize what constitute a pine cover type.

In [None]:
import ydf  # For learning the anomaly detection model
import pandas as pd  # We use Pandas to load small datasets
from sklearn import metrics  # Use sklearn to compute AUC
from ucimlrepo import fetch_ucirepo  # To download the dataset
import matplotlib.pyplot as plt  # For plotting
import seaborn as sns  # For plotting
import conifer # conifer
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logger = logging.getLogger('conifer')
logger.setLevel(logging.DEBUG)

## Prepare dataset

In [None]:
# https://archive.ics.uci.edu/dataset/31/covertype
covertype_repo = fetch_ucirepo(id=31)
raw_dataset = pd.concat([covertype_repo.data.features, covertype_repo.data.targets], axis=1)

Select the columns of interest and clean the labels.

In [None]:
dataset = raw_dataset.copy()

# Features of interest
features = ["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology",
            "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways",
            "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm",
            "Horizontal_Distance_To_Fire_Points"]
dataset = dataset[features + ["Cover_Type"]]

# Covert type as text
dataset["Cover_Type"] = dataset["Cover_Type"].map({
    1: "Spruce/Fir",
    2: "Lodgepole Pine",
    3: "Ponderosa Pine",
    4: "Cottonwood/Willow",
    5: "Aspen",
    6: "Douglas-fir",
    7: "Krummholz"
})

dataset.head()

The first model is trained on the "filtered dataset" than only contain spruce/fir and cottonwood/willow examples.

In [None]:
filtered_dataset = dataset[dataset["Cover_Type"].isin(["Spruce/Fir", "Cottonwood/Willow"])]

As you can see, the spruce/fir cover is much more common than the cottonwood/willow cover:

In [None]:
filtered_dataset["Cover_Type"].value_counts()

We train a popular anomaly detection decision forest algorithm called isolation forest.

## Anomaly detection model

The model trained here is a bit smaller (fewer trees and shallower) than the one from the `ydf` tutorial, to make it faster to synthesize and with a smaller FPGA footprint

In [None]:
model = ydf.IsolationForestLearner(num_trees=50, max_depth=4, features=features).train(filtered_dataset)

We can then generate "predictions" i.e. anomaly scores.

In [None]:
predictions = model.predict(filtered_dataset)
predictions[:5]

Next, we plot the model anomaly score's distribution for spruce/fir and cottonwood/willow cover. We se than both distributions are "separated", indicating the model's ability to differentiate between the two covers.

Note: It's important to note that since cottonwood/willow cover is less frequent, the two distributions are normalized separately. Otherwise, the cottonwood/willow distribution would appear fla

In [None]:
sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Spruce/Fir"], label="Spruce/Fir")
sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"], label="Cottonwood/Willow")
plt.xlabel("predicted anomaly score")
plt.ylabel("distribution")
plt.legend()
None

## Convert to conifer

Firstly we'll convert the anomaly detection model to `conifer` with the C++ backend to verify that we get correct outputs

In [None]:
cfg = conifer.backends.cpp.auto_config()
cfg['OutputDir'] = 'prj_ydf_anomaly_detection_cpp'
cfg['Precision'] = 'float'
cnf_model_cpp = conifer.converters.convert_from_ydf(model, cfg)
cnf_model_cpp.compile()

Make predictions with the conifer model

In [None]:
cnf_predictions = cnf_model_cpp.decision_function(filtered_dataset[features].to_numpy())

To compare anomaly predictions with `ydf` we have to base-two exponentiate our conifer predictions

In [None]:
print(f'First 5 conifer predictions  : {2**cnf_predictions[:5][:,0]}')
print(f'First 5 yggdrasil predictions: {predictions[:5]}')

Plot again the distribution of the anomaly score, adding the predictions from conifer. They should overlap well with the yggdrasil predictions

In [None]:
sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Spruce/Fir"], label="Spruce/Fir (yggdrasil)", color='b')
sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"], label="Cottonwood/Willow (yggdrasil)", color='orange')
sns.kdeplot(2**cnf_predictions[filtered_dataset["Cover_Type"] == "Spruce/Fir"], label="Spruce/Fir (conifer)", linestyle='--', color='g')
sns.kdeplot(2**cnf_predictions[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"], label="Cottonwood/Willow (conifer)", linestyle='--', color='red')
plt.xlabel("predicted anomaly score")
plt.ylabel("distribution")
plt.legend()
None

## FPGA

Now we saw that we can convert yggdrasil isolation forests and make predictions on CPU with conifer, we'll convert the same model to HLS. For that we also need to choose precisions

In [None]:
hls_cfg = conifer.backends.xilinxhls.auto_config(granularity='full')
hls_cfg['InputPrecision'] = 'ap_fixed<14,14,AP_RND_CONV,AP_SAT>'     # 14 bit integer based on the range of the input features
hls_cfg['ThresholdPrecision'] = 'ap_fixed<16,14,AP_RND_CONV,AP_SAT>' # 14 bit integer + 2 bit fractional to have some resolution between the features
hls_cfg['ScorePrecision'] = 'ap_fixed<20,11,AP_RND_CONV,AP_SAT>'     # 11 bit integer + 9 bit fractional to cover both leaf values and normalisation factor
cnf_model_hls = conifer.converters.convert_from_ydf(model, hls_cfg)
cnf_model_hls.compile()

Validate the fixed precision choices by emulating the predictions and comparing to the original

In [None]:
cnf_predictions_hls = cnf_model_hls.decision_function(filtered_dataset[features].to_numpy())

In [None]:
sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Spruce/Fir"], label="Spruce/Fir (yggdrasil)", color='b')
sns.kdeplot(predictions[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"], label="Cottonwood/Willow (yggdrasil)", color='orange')
sns.kdeplot(2**cnf_predictions_hls[filtered_dataset["Cover_Type"] == "Spruce/Fir"], label="Spruce/Fir (conifer)", linestyle='--', color='g')
sns.kdeplot(2**cnf_predictions_hls[filtered_dataset["Cover_Type"] == "Cottonwood/Willow"], label="Cottonwood/Willow (conifer)", linestyle='--', color='red')
plt.xlabel("predicted anomaly score")
plt.ylabel("distribution")
plt.legend()
None

Now run the HLS and HDL synthesis steps so that we can inspect the resources and latency. Check the `hls_accelerator.py` example to see how to target a supported board to produce a binary that can run on an FPGA device.

In [None]:
cnf_model_hls.build(synth=True, vsynth=True)

Print the HLS and HDL Synthesis reports

In [None]:
cnf_model_hls.read_report()