## Supervised Anomaly Detection
In our earlier lessons, we talked about different ways to find unusual patterns in data without using any extra information. The challenge with this unsupervised anomaly detection is that sometimes regular noise gets mislabeled as anomalies due to lack of knowledge. Now, what if you have examples of both anomalies and normal data? How does that change things?

**Knowledge is key**
- Make use of any (domain-specific) information you have regarding your anomalies.
- If you're extremely fortunate, you may have a straightforward selection criterion:
    - Every height greater than 6' 5" is anomalies.
- More often, information consists of examples of normal data and anomalies
- Extra information usually improves anomaly detection accuracy significantly

Note: `Whenever possible, use supervised methods.`

**A special case of classification**  
In general, supervised anomaly detection is a special case of classification problem.
- Examples of anomalies and normal points = training data
- Unlabeled points = test data
- Therefore, the many classification techniques available (supervised machine learning) can be used for anomaly detection

**A supervised anomaly detection may faced few problems**
1. Class Imbalance
    - Anomalies by definition are rare, so there will be few examples
2. Contaminated normal data
    - Only anomalies labeled; normal class contaminated by unlabeled anomalies
3. Partially labeled data (“semi-supervised anomaly detection”)
    - Typical case: only normal class labeled

**Illustration of Class Imbalance Problem**
- There's a test using X-ray scans to find a rare cancer.
- In the group taking the test, 99% are healthy, and only 1% have the cancer.
- We're making a computer program to decide if a scan is normal or anomalous.

Note: `Must evaluate algorithm performance with care`

Now, if we train an anomaly detector in this setting we might end up getting a useless anomaly detector. Because:
- Label all scans as normal without any analysis
- Confusion matrix will be:
![image](https://github.com/surajkarki66/Freyja/assets/50628520/24c4fc91-bf64-4808-8b2b-c6fc5b8f1478)
-  Accuracy = (0 + 99)/100 = 99%
- Very high accuracy, but never find a sick patient

**In order to perform effective anomaly detection with class imbalance dataset, we need to evaluate the algorithm effectively**

- When the anomalies represent a tiny percentage of the total data, overall accuracy is not a relevant metric to evaluate the algorithm.
- Typically, it is more costly to misclassify an anomalous point than normal data
    - For the cancer example: a false positive (normal point misclassified) will lead to additional, diagnostic tests, which hopefully will correct the error
    -  A false negative (anomalous point misclassified) will lead to overlooking the disease at an early, treatable stage and perhaps ignoring it until it is too late to treat.
- This cost should be included when evaluating the effectiveness of the algorithm

**Algorithm should take into account cost of making a mistake (misclassification)**
- Use a cost-weighted approach when implementing algorithm
- Two main ways to do so:
    - Cost-sensitive learning
    - Adaptive resampling

In [7]:
!pip install scikit-learn==1.2.2
!pip install mlxtend
!pip install imblearn



## 1. Imports

In [8]:
import warnings
import sys
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import random
import seaborn as sns

from sklearn.datasets import make_classification
from sklearn.svm import SVC
from collections import Counter
from mlxtend.plotting import plot_decision_regions
from sklearn.metrics import confusion_matrix, accuracy_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

%matplotlib inline
warnings.filterwarnings('ignore', category=FutureWarning)

ImportError: cannot import name '_MissingValues' from 'sklearn.utils._param_validation' (/home/surajkarki/anaconda3/lib/python3.11/site-packages/sklearn/utils/_param_validation.py)