In [None]:
What is anomaly detection and what is its purpose?

In [None]:
Anomaly detection refers to the process of identifying data points or patterns that deviate from what is
considered normal or expected within a dataset. Anomaly detection can be used in a wide range of applications, such as
fraud detection, network intrusion detection, predictive maintenance, and identifying outliers in scientific data.

The purpose of anomaly detection is to automatically identify data points or patterns that are unusual or potentially interesting. 
In many cases, anomalies can be indicative of important or abnormal events that may require further investigation or action. By identifying 
anomalies in data, anomaly detection can help to improve decision-making, increase efficiency, reduce costs, and prevent potential problems
or issues from arising.

In [None]:
What are the key challenges in anomaly detection?

In [None]:
Lack of labeled data: In many cases, it can be difficult to obtain labeled data for anomaly detection,
which can make it challenging to train supervised machine learning models. This can be particularly challenging for rare or novel
anomalies, which may not be well-represented in the available data.

High-dimensional data: Many real-world datasets are high-dimensional, meaning they have a large number of features or variables. This can
make it difficult to identify patterns or anomalies in the data, as the number of possible combinations of features is very large.

Imbalanced datasets: Anomaly detection datasets are often highly imbalanced, meaning that the proportion of normal data points greatly exceeds
the proportion of anomalous data points. This can make it difficult to train models that accurately identify the anomalies, as they may be
overshadowed by the abundance of normal data.

Dynamic environments: In some applications, such as network intrusion detection or fraud detection, the characteristics of normal and anomalous
behavior may change over time. This can make it difficult to build models that remain effective over long periods of time.

Interpretability: Many anomaly detection techniques, particularly those based on deep learning or other complex machine learning algorithms,
can be difficult to interpret. This can make it challenging to understand why a particular data point has been identified as anomalous, which 
can limit the usefulness of the technique in some applications.

Cost of false positives and false negatives: Depending on the application, false positives and false negatives can have different costs. In
some cases, false positives can be costly in terms of resources or missed opportunities, while false negatives can lead to serious consequences
such as security breaches or equipment failure. Finding a balance between false positives and false negatives can be challenging, especially 
when there are trade-offs between the two.

In [None]:
How does unsupervised anomaly detection differ from supervised anomaly detection?

In [None]:
Supervised anomaly detection requires labeled data, meaning data that has already been classified as either normal 
or anomalous. In this approach, the algorithm is trained using examples of both normal and anomalous data points, and i
t learns to differentiate between the two classes based on the labeled examples. Once the algorithm is trained, it can be used to classify
new data points as normal or anomalous based on what it learned during training.

Unsupervised anomaly detection does not require labeled data. Instead, it aims to identify patterns or data points that are significantly 
different from the rest of the data, without explicitly knowing what constitutes a normal or anomalous data point. In this approach, the
algorithm learns to identify patterns in the data that are different from what is expected based on the characteristics of the majority of 
the data points. Unsupervised anomaly detection techniques include clustering-based methods, density-based methods, and distance-based methods.

In [None]:
What are the main categories of anomaly detection algorithms?

In [None]:
Statistical methods: These methods use statistical techniques to identify data points that are significantly different from the rest of the 
data based on measures such as mean, standard deviation, or variance. Examples of statistical methods include Z-score, Dixon's Q-test, and 
Grubbs' test.

Machine learning methods: These methods use machine learning algorithms to identify anomalies based on patterns in the data. Supervised machine
learning algorithms, such as support vector machines (SVM) and decision trees, can be used when labeled data is available. Unsupervised machine
learning algorithms, such as k-means clustering and isolation forest, can be used when labeled data is not available.

Distance-based methods: These methods measure the distance between data points and use this distance to identify anomalies that are
significantly different from the rest of the data. Examples of distance-based methods include k-nearest neighbors (k-NN) and local outlier 
factor (LOF).

Density-based methods: These methods identify anomalies based on the density of the data points. Anomalies are identified as data points that
are in low-density areas or that have a significantly different density than the rest of the data. Examples of density-based methods include 
DBSCAN and LOF.

Rule-based methods: These methods use predefined rules to identify anomalies. These rules are often based on domain knowledge or expert 
opinions. Examples of rule-based methods include expert systems and decision rules.

Deep learning methods: These methods use neural networks to identify anomalies based on patterns in the data. Examples of deep learning methods
include autoencoders and convolutional neural networks (CNN).

In [None]:
What are the main assumptions made by distance-based anomaly detection methods?

In [None]:
uclidean distance: Distance-based methods assume that the distance between data points can be measured using Euclidean 
distance. This assumption works well for numerical data, but it may not be appropriate for other types of data, such as categorical or 
text data.

Normal distribution: Distance-based methods assume that the data is normally distributed. This means that the majority of the data points are
close to the mean, and there are fewer data points further away from the mean. If the data is not normally distributed, distance-based methods
may not be effective.

Independent features: Distance-based methods assume that the features or variables in the data are independent of each other. This means that
the value of one feature does not depend on the value of another feature. If the features are dependent on each other, distance-based methods 
may not be effective.

Fixed number of clusters: Some distance-based methods, such as k-means clustering, assume that the data can be partitioned into a fixed number
of clusters. This may not be appropriate for all datasets, especially those with complex or overlapping clusters.

Stationarity: Distance-based methods assume that the statistical properties of the data remain constant over time. This means that the 
characteristics of normal and anomalous data points do not change over time. If the characteristics of the data change over time, distance-based methods may not be effective.

In [None]:
How does the LOF algorithm compute anomaly scores?

In [None]:
The LOF algorithm computes a score for each data point based on its local density compared to the densities of its k-nearest neighbors. The algorithm works as follows:

For each data point, the k-nearest neighbors are identified based on a distance metric such as Euclidean distance.

The local reachability density (LRD) of each data point is computed. The LRD of a data point measures the inverse of the average distance of its k-nearest neighbors. It provides an estimate of the density of the data points in its local neighborhood.

The local outlier factor (LOF) of each data point is computed. The LOF of a data point measures the extent to which it is an outlier compared to its k-nearest neighbors. It is computed as the ratio of the average LRD of the k-nearest neighbors of a data point to its own LRD.

A threshold is applied to the LOF scores to identify anomalous data points. Data points with a LOF score above the threshold are considered to be anomalous.

In [None]:
What are the key parameters of the Isolation Forest algorithm?

In [None]:
The Isolation Forest algorithm has two main parameters: the number of trees (n_estimators) and the maximum depth of each tree (max_depth).

n_estimators: This parameter controls the number of trees in the isolation forest. Increasing the number of trees can improve the accuracy of
the anomaly detection, but it also increases the computational cost. A value of 100 is often used as a default.

max_depth: This parameter controls the maximum depth of each tree in the isolation forest. A deeper tree can potentially capture more complex 
relationships in the data, but it can also lead to overfitting. A smaller value of max_depth can prevent overfitting, but it may also decrease 
the effectiveness of the anomaly detection. A value of 8 is often used as a default.

There are also other optional parameters that can be used to fine-tune the performance of the Isolation Forest algorithm, such as the subsample
size (max_samples), the contamination parameter (contamination), and the random seed (random_state).

max_samples: This parameter controls the size of the subsample used to build each tree. A smaller subsample can speed up the computation, but 
it can also decrease the effectiveness of the anomaly detection.

contamination: This parameter controls the expected proportion of anomalies in the data. It is used to compute the decision threshold for
classifying data points as anomalies or non-anomalies. If the contamination parameter is not specified, a default value of 0.1 is used.

random_state: This parameter sets the random seed for the Isolation Forest algorithm. It is used to ensure reproducibility of the results.

In [None]:
If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?

In [None]:
To compute the anomaly score of a data point using the KNN algorithm, we need to find its K nearest neighbors and compute the average distance
to those neighbors. The anomaly score is then defined as the inverse of this average distance.

In this case, the data point has only 2 neighbors of the same class within a radius of 0.5, which means that it does not have enough neighbors 
to compute an anomaly score using the KNN algorithm with K=10.

In [None]:
Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees?

In [None]:
In the Isolation Forest algorithm, the anomaly score of a data point is computed as the average path length of the data point in the isolation
trees normalized by the expected average path length of a randomly generated data point in the same dataset. The expected average path length 
is given by a formula that depends on the number of data points in the dataset, which is 3000 in this case.

If a data point has an average path length of 5.0 compared to the average path length of the trees, we can compute its anomaly score as
follows:

Compute the expected average path length for a randomly generated data point in the dataset:
c(n) = 2 * (log(n - 1) + 0.5772156649) - 2 * (n - 1) / n

expected_avg_path_length = c(3000)

The constant c(n) is a correction factor that depends on the number of data points n in the dataset. In this case, c(3000) ≈ 5.523.

Compute the anomaly score of the data point:
anomaly_score = 2 ** (-5.0 / expected_avg_path_length)

The average path length of the data point is divided by the expected average path length and exponentiated with base 2.

For example, if expected_avg_path_length ≈ 5.523 and the data point has an average path length of 5.0, then:

anomaly_score = 2 ** (-5.0 / 5.523) ≈ 0.753