Q1. What is anomaly detection and what is its purpose?
--
---
Anomaly detection is a process in machine learning that identifies data points, events, and observations that deviate from a data set’s normal behavior. It's generally understood to be the identification of rare items, events, or observations which deviate significantly from the majority of the data and do not conform to a well-defined notion of normal behavior.

The purpose of anomaly detection is manifold:
- It's critical for industrial applications, especially for predictive and prescriptive maintenance.
- It helps in defining system baselines, identifying deviations from that baseline, and investigating inconsistent data.
- It's used to minimize risk factors, improve communication, and enable real-time anomaly monitoring.
- It's important within financial data because it can indicate potential risks, control failures, or business opportunities.

Q2. What are the key challenges in anomaly detection?
--
---
Anomaly detection is a complex task and faces several challenges:

1. Data Quality: The quality of the underlying dataset is the biggest driver in creating an accurate model. Problems can include null data or incomplete datasets, inconsistent data formats, duplicate data, different scales of measurement, and human error.

2. Training Sample Sizes: If the training set is too small, the algorithm doesn’t have enough exposure to past examples to build an accurate representation of the expected value at a given time¹. Anomalies will skew the baseline, which will affect the overall accuracy of the model.

3. Feature Extraction: Appropriate feature extraction is a challenge in anomaly detection.

4. Defining Normal Behaviors: Defining what constitutes normal behavior can be difficult.

5. Handling Imbalanced Distribution: There is often an imbalanced distribution of normal and abnormal data.

Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?
---
----
The main difference between supervised and unsupervised anomaly detection lies in the approach and the type of data they deal with¹²³⁴:

Supervised Anomaly Detection: This type of anomaly detection depends on predefined algorithms that have been used to train artificial intelligence systems. It requires labeled data where for each row you know if it is an outlier/anomaly or not. Any modeling technique for binary responses will work here, such as logistic regression or gradient boosting. The use of a supervised approach for anomaly detection is most common for rule-based recognition of outliers, especially for highly predictable and repetitive multivariate dataset patterns. However, its use is not recommendable where datasets are of high volume, complexity, or are defined by several multiple criteria. It can also be more demanding in terms of resources and maintenance, as engineers are required to continuously assess and modify algorithms.

Unsupervised Anomaly Detection: This type of anomaly detection involves analyzing large amounts of anomalous and/or unlabeled data to identify areas of dis-uniformity, without using any predefined algorithm. The approach used in unsupervised anomaly detection is based on pattern-matching data-trend visualization; where data points are evaluated in clusters to identify general patterns of shapes, so that all outliers are identified as anomalies and used to diagnose a possible drawback. Unsupervised approach is better for anomaly detection than a supervised approach, because such tasks often involve multiple variables and relatively-high complexity, so that a definite, rule-based or supervised approach may not be effective in many cases.

Q4. What are the main categories of anomaly detection algorithms?
--
----
The main categories of anomaly detection algorithms are:

   - Isolation Forest: This is an unsupervised anomaly detection algorithm that uses a random forest algorithm, or decision trees, under the hood to detect outliers in the dataset.
   - Local Outlier Factor: This technique takes the density of data into account.
   - Robust Covariance: This technique is used for anomaly detection.
   - One-Class Support Vector Machine (SVM): This is a type of SVM used for anomaly detection.
   - One-Class SVM with Stochastic Gradient Descent (SGD): This is a variation of the One-Class SVM that uses SGD.

Q5. What are the main assumptions made by distance-based anomaly detection methods?
--
---
The main assumptions made by distance-based anomaly detection methods are:

1. Normal instances are related and appear close to each other: This assumption is based on the idea that normal data points in a dataset will have similar characteristics and therefore will be located near each other in the data space.

2. Anomalies are different and relatively far from other instances: Anomalies, or outliers, are assumed to be different from the normal data points and therefore will be located far from the other data points.

3. Anomaly score is calculated as a sum of the distance between a data point and its k-nearest neighbours: In distance-based methods, the anomaly score of a data point is calculated as a sum of the distance between a data point and its k-nearest neighbours.

4. Normal data points are close to their neighbors, while the anomalous data points are far from the normal data: This assumption is based on the idea that normal data points will be close to their neighbors, while anomalous data points will be far from the normal data.

Q6. How does the LOF algorithm compute anomaly scores?
--
---
The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method. It computes the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors.

Here's how it works:
- The LOF algorithm measures the local density deviation of a given data point with respect to the data points near it.
- It calculates the anomaly score representing how much a data point is considered an outlier in the dataset.
- The anomaly score values greater than 1.0 usually indicate the anomaly.
- The lower the density of the points, the more likely these points might be considered as outliers or anomalies.

In essence, the LOF algorithm identifies those data points that have a substantially lower density than their neighbors as anomalies.

Q7. What are the key parameters of the Isolation Forest algorithm?
--
----
The key parameters of the Isolation Forest algorithm are:

1. n_estimators: The number of base estimators in the ensemble.
2. max_samples: The number of samples to draw from X to train each base estimator.
3. contamination: The amount of contamination of the data set, i.e., the proportion of outliers in the data set². This parameter is used when fitting to define the threshold on the scores of the samples.
4. max_features: The number of features to draw from X to train each base estimator.
5. bootstrap: If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
6. n_jobs: The number of jobs to run in parallel for both fit and predict.
7. random_state: Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest.

Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?
--
----
The anomaly score for a data point in a K-nearest neighbors (KNN) algorithm is typically determined by the number of neighbors that belong to a different class. In your case, if a data point has only 2 neighbors of the same class within a radius of 0.5 and K is set to 10, then the anomaly score can be calculated as follows:

1. If the majority of the neighbors (K/2 + 1) have the same class as the data point, the anomaly score is low.
2. If the majority of the neighbors have a different class than the data point, the anomaly score is high.

In this scenario, K is set to 10, so the majority would be at least 6 neighbors. If 6 or more out of the 10 nearest neighbors have the same class as the data point, the anomaly score would be low. If fewer than 6 neighbors have the same class, the anomaly score would be high.

Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees?
--
---
The anomaly score for the data point, given the information you provided, is **0.1767766952966369**.

Here's how we can calculate it:

1. **Isolation Forest scores:** Isolation Forest algorithms typically assign anomaly scores between 0 and 1, where higher values indicate greater anomaly.
2. **Path length ratio:** We need to compare the data point's path length (5.0) to the average path length of the trees. Let's say the average path length for all trees is 2.0.
3. **Score formula:** The Isolation Forest anomaly score is calculated using the following formula:

```
anomaly_score = 2 ^ (-data_point_path_length / average_path_length)
```

4. **Plugging in values:** In your case, `data_point_path_length = 5.0` and `average_path_length = 2.0`. Substituting these values into the formula, we get:

```
anomaly_score = 2 ^ (-5.0 / 2.0) = 2 ^ (-2.5) = 0.1767766952966369
```

Therefore, the anomaly score for the data point is 0.1767766952966369, indicating a moderate anomaly compared to the data it's surrounded by.