# 1] What is anomaly detection and what is its purpose?


### => Anomaly detection is the process of identifying data points, events or observations that deviate significantly from the norm. The main purposes of anomaly detection are:

## 1) Identifying outliers: 
### => Anomaly detection can help identify outliers that are due to errors, faults or abnormalities in the data. This allows for correcting or removing the outliers before they negatively impact model training or predictions.
## 2) Detecting novelties: 
### => It can allow the discovery of unknown phenomena and new patterns in data that do not conform to expected behavior. This includes detecting fraud, network intrusions, breakdowns, etc.
## 3) Enhancing robustness: 
### => Anomaly detection improves the robustness of models by detecting examples that differ from the majority. Models trained on clean, regular data are less likely to fail on new data.
## 4) Identifying change: 
### => It enables the monitoring of changes in behavior over time to detect if a process or system has deviated from its baseline. This allows identifying if something fundamental has changed.
## 5) Providing insights: 
### => Understanding anomalies and their underlying causes can provide useful business insights and opportunities for improvements.

# 2] What are the key challenges in anomaly detection?


## 1) Defining normal: 
### => It can be difficult to precisely define what constitutes "normal" behavior especially in complex systems. Without a good sense of normal, it's hard to identify anomalies.
## 2) Imbalanced data:
### => Anomalies by definition are rare events compared to normal data. The extreme imbalance between anomalies and normal data makes modeling challenging.
## 3) Masking:
### => Anomalies may mask or absorb into large clusters of normal data points. This makes them difficult to discern from surrounding data.
## 4) Data quality:
### => Poor data quality such as missing values, noise, outliers etc. can substantially increase false positives and negatives. Data preprocessing is critical.
## 5) Context dependence:
### => Anomalies are often context specific and what is anomalous in one scenario may be perfectly normal in another. Accounting for context is challenging.
## 6) Concept drift: 
### => In dynamic systems, the concept of normal behavior can itself drift over time requiring the models to be updated.
## 7) Interpretability: 
### => It's often not enough just to detect anomalies, but also critical to understand why a data point is anomalous. Lack of interpretability makes debugging and root cause analysis tough.
## 8) Evaluation:
### => Defining good evaluation metrics and getting annotated anomaly data for validation is difficult.

# 3] How does unsupervised anomaly detection differ from supervised anomaly detection?


## 1) Training data:
### => Unsupervised anomaly detection does not require labeled or annotated data. It relies only on unlabeled normal data.
### => Supervised requires data labeled as normal and anomalous to train models.
## 2) Assumptions:
### => Unsupervised methods assume anomalies are rare and different from normal instances.
### => Supervised methods make no assumptions on data distribution.
## 3) Models:
### => Unsupervised methods typically rely on statistical models or unsupervised ML models like clustering.
### => Supervised methods leverage classification algorithms, neural networks etc.
## 4) Detection:
### => Unsupervised methods detect anomalies by identifying points deviating from clusters or density estimates of normal data.
### => Supervised models detect anomalies based on learned decision boundaries between normal and anomalous classes.
## 5) Performance:
### => Unsupervised methods generally have higher false positives due to reliance on assumptions.
### => Supervised models can leverage labeled data to improve detection accuracy.

# 4] What are the main categories of anomaly detection algorithms?


## 1) Statistical models: 
### => These assume data is generated from a statistical distribution and identify anomalies as deviations from that distribution. Examples include Gaussian mixture models and multivariate Gaussians.
## 2) Proximity-based models:
### => These use distance or density measures to identify anomalies that are far from their neighbors. Examples include k-nearest neighbors and local outlier factor algorithms.
## 3) Clustering models:
### => These rely on clustering algorithms like k-means to group similar data points. Points not belonging to clusters are flagged as anomalies.
## 4) Classification models:
### => These leverage classification algorithms like SVM and random forest trained on labeled data to classify new points as anomalous or normal.
## 5) Neural networks:
### => Models like autoencoders and deep belief networks are used to learn representations of normal data. Reconstruction errors are used to detect anomalies.
## 6) Information theoretic models:
### => These detect anomalies based on metrics like entropy and Kolmogorov complexity which quantify information content of data points.
## 7) Ensemble models: 
### => Combinations of multiple models can be used together to improve anomaly detection through techniques like stacking and voting.

# 5] What are the main assumptions made by distance-based anomaly detection methods?


### => Anomalies are rare occurrences within the data. Typically it is assumed anomalies make up a very small percentage (1-5%) of the overall data.
### => Normal instances lie close to their closest neighbors while anomalies are far away. The distance between a normal data point and its k-nearest neighbors will be small compared to the distances for anomalous points.
### => Normal data points belong to dense neighborhoods and clusters. The local density around most normal points is assumed to be higher than that around anomalies.
### => Similar instances have small pairwise distances while anomalies are dissimilar to normal points. The distance metrics used accurately capture the notion of similarity for the data.
### => The boundaries between normal and anomalous regions are defined by the distances of points from their neighbors. Thresholds based on these distances can segregate anomalies from normal instances.
### => The distance metrics generalize to unseen data. Distance measures that work well on training data will also be meaningful for new test data.
### => There are no camouflage anomalies. Anomalies do not take on values that mask their presence within clusters of normal data points

# 6] How does the LOF algorithm compute anomaly scores?


### => For each point, identify its k-nearest neighbors (the local neighborhood).
### => Calculate the reachability distance of each point to its neighbors. This is the maximum of the k-distance (distance to kth neighbor) of the two points.
### => Compute local reachability density (LRD) which is inverse of average reachability distance of a point to its neighbors.
### => Calculate local outlier factor (LOF) for each point which is ratio of average LRD of its neighbors and its own LRD.
### => Points with higher LOF scores have lower density than their neighbors and are considered anomalies.
### 
### Intuitively, anomalies have lower local density and are more isolated. LOF captures this by comparing the relative densities. It doesn't require knowing the global data distribution.

# 7] What are the key parameters of the Isolation Forest algorithm?


## 1) n_estimators 
### => The number of isolation trees to construct in the ensemble. More trees leads to better accuracy but higher computing cost. Typical values range from 10 to 100.
## 2) max_samples
### => The number of samples to draw from the dataset when constructing each isolation tree. Smaller values result in more random trees, reducing variance but increasing bias. Typical range is between 50 to 300.
## 3) max_features 
### => The number of features to consider when determining splits on isolation tree nodes. Can be an integer or float in [0,1]. Lower values make the model more random but may improve anomaly detection.
## 4) random_state 
### => The seed used to initialize the random number generator for reproducibility.
## 5) n_jobs 
### => The number of CPU cores to use for parallel processing. -1 uses all available cores.
## 6) contamination 
### => The expected proportion of anomalies in the data, used to tune threshold on the anomaly scores.
## 7) max_depth 
### => Maximum depth of each isolation tree. Does not normally need tuning from the default of log2(n_samples).

# 8] If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?



### =>In KNN anomaly detection, the anomaly score for a data point is typically calculated based on the distance to its Kth nearest neighbor.

### =>Since K=10 in this case, we are interested in the distance to the 10th nearest neighbor.

### =>The question states that within a radius of 0.5, there are only 2 neighbors of the same class.

### =>This means the distance to the 10th nearest neighbor must be greater than 0.5.

### =>As the anomaly score is based on this distance, with a higher distance implying a higher chance of being an anomaly, the anomaly score for this point will be relatively high.

### =>Without knowing the exact distance to the 10th neighbor, it is not possible to calculate the precise anomaly score.

### =>But we can conclude that having only 2 neighbors within a radius of 0.5 when K=10, results in this data point likely having a high anomaly score and low local density compared to its neighbors.

# 9] Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees?

In [1]:
import math

In [2]:
def anomaly_score(n,average_path_length):
    H=2*(math.log(n-1)) + 0.5772156649
    c= 2*H - ((2*(n-1))/n)
    score= 1/(2**(average_path_length / c))
    return score

In [3]:
score=anomaly_score(3000,5.0)
score

0.8947998126198259