In [None]:
#Q1. What is anomaly detection and what is its purpose?
"""
Anomaly detection is a technique used to identify patterns or instances that deviate significantly from the normal behavior or expected patterns 
in a dataset. Anomalies, also known as outliers, are data points or patterns that differ from the majority of the data and may indicate unusual or 
potentially interesting observations.

The purpose of anomaly detection is to detect and identify these unusual patterns or instances that do not conform to the expected behavior

In [None]:
#Q2. What are the key challenges in anomaly detection?
"""
The key challenges in anomaly detection include:

1. Lack of labeled data: Anomalies are often rare and unexpected, making it difficult to obtain a sufficient amount of labeled data for training 
        anomaly detection models.

2. Imbalanced datasets: Anomalies typically represent a small portion of the overall dataset, leading to imbalanced class distributions. This can 
        result in biased models that struggle to accurately detect anomalies.

3. Dynamic and evolving anomalies: Anomalies can change over time, requiring anomaly detection models to adapt and be capable of detecting new and 
        emerging anomalies.
"""

In [2]:
#Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?
"""
Unsupervised Anomaly Detection:
1. Approach: Unsupervised anomaly detection does not require labeled data or prior knowledge of anomalies. It focuses on identifying patterns or 
    instances that deviate significantly from the normal behavior observed in the dataset.
2. Training: Unsupervised methods learn from the characteristics of the majority of the data to establish a notion of "normal" behavior. They aim to 
    find anomalies that do not conform to the learned normal patterns.
3. Anomaly Detection: Unsupervised methods detect anomalies by comparing data points to the learned normal patterns or by identifying statistical 
    deviations, clustering anomalies away from the majority of the data, or using density-based techniques.
4. Applicability: Unsupervised anomaly detection is useful when labeled anomalies are scarce or unavailable, and the goal is to discover unknown or 
    novel anomalies in the data.
    
Supervised Anomaly Detection:
1. Approach: Supervised anomaly detection requires labeled data where anomalies are explicitly identified. It involves training a model to classify 
    instances as normal or anomalous based on the provided labels.
2. Training: Supervised methods learn from labeled data, which consists of both normal and anomalous instances, and build a model that can generalize
    from these labeled examples.
3. Anomaly Detection: Once trained, supervised models can predict and classify new instances as normal or anomalous based on the patterns learned 
    during training. They rely on the labeled anomalies to guide their decision-making process.
4. Applicability: Supervised anomaly detection is suitable when labeled examples of anomalies are available, and the focus is on classifying new 
    instances as either normal or anomalous based on the learned patterns.
"""

In [None]:
#Q4. What are the main categories of anomaly detection algorithms?
"""
Anomaly detection algorithms can be broadly categorized into the following main categories:

Statistical Methods:
1. Gaussian Distribution: These methods assume that the data follows a Gaussian (normal) distribution, and anomalies are identified as instances that
    significantly deviate from the expected distribution.
2. Extreme Value Analysis: These methods focus on modeling the tail behavior of the data distribution and identify anomalies as extreme values that 
    fall outside a predefined threshold.
    
Machine Learning Methods:
1. Clustering-Based: These methods group similar instances together and identify anomalies as data points that do not belong to any cluster or are 
    distant from any cluster center.
2. Density-Based: These methods estimate the density of the data and identify anomalies as instances that lie in regions of low density.
3. Isolation Forest: This method constructs random decision trees to isolate anomalies by identifying instances that require fewer partitions to be 
    separated from the majority of the data.
4. k-Nearest Neighbors (k-NN): These methods measure the distance or similarity between instances and identify anomalies as data points that have 
    few or no nearby neighbors.
5. Local Outlier Factor (LOF): LOF calculates the density of an instance compared to its neighbors and identifies anomalies as instances with a 
    significantly lower density.
"""

In [4]:
#Q5. What are the main assumptions made by distance-based anomaly detection methods?
"""
Distance-based anomaly detection methods rely on certain assumptions to identify anomalies based on the distances or dissimilarities between data 
points. The main assumptions made by distance-based anomaly detection methods include:

1. Normality Assumption: These methods assume that the majority of the data points in a dataset represent normal or non-anomalous instances. Anomalies
    are considered as deviations from this normal behavior.

2. Distance Metric Assumption: Distance-based methods assume the availability of a meaningful distance metric or dissimilarity measure that quantifies
    the similarity or dissimilarity between data points. Common distance metrics include Euclidean distance, Mahalanobis distance, cosine similarity, 
    or other domain-specific similarity measures.

3. Nearest Neighbor Assumption: These methods often assume that anomalies are located far away from their nearest neighbors or have fewer neighbors
    compared to normal instances. Anomalies are expected to have dissimilarities or distances that exceed a certain threshold.

4. Density-Based Assumption: Some distance-based methods assume that anomalies lie in regions of low data density or have dissimilarities that deviate
    significantly from the average dissimilarity of the data. Anomalies are identified as instances that are sparsely surrounded by other data points.

5. Independence Assumption: Distance-based methods often assume that the attributes or dimensions of the data are independent and do not have strong 
    correlations. This assumption allows the distances between instances to be appropriately computed and interpreted.
"""

In [None]:
#Q6. How does the LOF algorithm compute anomaly scores?
"""
The LOF (Local Outlier Factor) algorithm computes anomaly scores by comparing the local density of a data point with the densities of its neighbors. 
It measures the degree to which a point is an outlier by considering the ratio of the average local density of its k-nearest neighbors to its own 
local density, where a significantly lower ratio indicates a higher anomaly score.
"""

In [None]:
#Q7. What are the key parameters of the Isolation Forest algorithm?
"""
The Isolation Forest algorithm has two key parameters:

1. n_estimators: This parameter specifies the number of isolation trees to be created in the forest.

2. contamination: This parameter defines the expected proportion of anomalies in the dataset. 

3. max_depth: It limits the maximum depth or height of each isolation tree. 

4. max_samples: It specifies the number of samples to be used when creating each isolation tree.
"""

In [1]:
#Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?
"""
Since there are only 2 neighbors within the radius of 0.5, it is not possible to calculate the anomaly score based on KNN with K=10. 
"""

In [6]:
#Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an 
#    average path length of 5.0 compared to the average path length of the trees?
"""
E(h(x)) = 5
c(m) = 3000/100 = 30

Anomaly score S(x,m) = 2 ^ -E(h(x))/c(m)
                     = 2^ -5/30
                     = 0.31
"""