# Anomoly detection

Examples: find "unusual" features in a dataset (aircraft engine $x_i$, heat output, vibration intensity, etc)
Algorithm is trained in $x_{\rm train}$ most of which are good engines, and evaluates $x_{\rm test}$ and finds whether it is anomolus with respect to others. 
In other words, it checks whether new data is within the range of old data.

Done via `Density estimation`.  
Compute probability of $p(x)$, assess high and low probability of a new data to be within the dataset. If $p(x) < \epsilon$, the _anomoly_ is detected. 

Used in:
- Fraud detection (assessing features that user does and find _anomolies_ user), e.g., finantial freaud. 
- Fault detection in production. 
- Monoitor computers in a data center

Data is modelled using `gaussian distribution` (normal distribution). 
$$
p(x) = \frac{1}{\sqrt{2}\pi\sigma}e^{-(x-\mu)^2 / 2\sigma^2}
$$

Then, given a dataset $x^{(i)}$, the $\mu$ and $\sigma$ have to be found to fit the distribution to data. 
$$
\mu = \frac{1}{m-1}\sum_{i=1}^m x^{(i)} \\
\sigma^2 = \frac{1}{m-1}\sum_{i=1}^m(x^{(i)}-\mu)^2
$$
These are **maximum likelihood estimates** of $\mu$ and $\sigma$. 

## Anomoly detection algorithm

COnsider training set $\vec{x}^{(i)}$ with $n$ features. 
Estimate the $p(\vec{x})$, where $\vec{x}$ is the feature vector. Assuming that features are independent, $p(\vec{x}) = \Pi_{i=0}^n p(x_i;\mu_i,\sigma_1^2)$, 
where $\mu_1$ and $\sigma_1$ are mean and varaince of the feature $1$. For each feature $\mu_{i}$ and $\sigma_{i}$ are different

Building an algorithm 
1. Choose features
2. Fit parameters, vectorized $\vec{\mu}=(1/(m-1))\sum_{i=1}^m \vec{x}^{(i)}$ and $\sigma_i$
3. Given a new example $x$, compute $p(x)$ as 
$$
p(\vec{x}) = \Pi_{j=1}^n p(x_j;\mu_j,\sigma_j^2) =
\Pi_{j=1}^n\frac{1}{\sqrt{2}\pi\sigma_j}e^{-(x_j-\mu_j)^2 / 2\sigma_j^2} 
$$
4. Compare $p(x)$ to a threshold $\epsilon$

## Evaluating the algorithm

It is helpfull to have some of the labelled training data.
Assume that all the unlabelled data, $x^{(i)}$ is normal, with $y^{(i)}=0$ (not anlomoly). 
Extract that $x_{\rm cv}^{(i)}$ is also given abd has labells $y_{\rm cv}^{(i)}$ some which is $1$ others $0$. 
Extract a $x_{\rm test}^{(i)}$, $y_{\rm test}^{(i)}$ that also include $y=1$ and $y=0$. 

Generally, training set should be labelled correctly, but it is _ok_ to have some mislabelled data. 
The split of the dataset is generally $60+10+10$ percent. Use the cross-validation set to tune $\epsilon$ so that algorithm correctly identifies $y=1$. 
Assess algorithm on test set. 

**Note** usually, only training and cv sets are used as the amount of data is small. 
But you then cannot evaluate the model performance on a new data. 
_Risk of overfitting_

Evalating the model:  
1. Fit mode lto training set
2. on cv set compute $p(x)$ 
$$
y = 
\begin{cases}
1 \text{ if } p(x) < \epsilon \text{ (anomoly) } \\ 
0 \text{ if } p(x) > \epsilon \text{ (normal) }
\end{cases}
$$
**Note** that dataset is heavily skewed and to evaluate the performace of the model consider  
- True/False positives; True/False negatives; 
- precision/recall; 
- $F_1$ score

#### Supervised learing VS anomoly detection
Use _Anomoly detection_ if 
- Small number of positive examples and Large number of negative examples
- If there are many ways (also unknown) to get a postive example (ie, unknown anomolys)
- Examples: fraud (new/unique); Manufacturig (for new, unseen defects); Machines in the data centers

Use _Supervised learning_
- If Large number both positive and negatives are found
- Future positives are similar to trained ones.
- Spam emails: (generally similar); Manufactoring (finding known, seen problems, e.g., scratched screen); Weather predictions; Desease classification

#### Choosing features for anomoly detection

Generally, features should be gaussian. Plot the hitogram of a feature and see how close it to the gaussian. If it is not, $x\rightarrow\log(x+c)$ or any other mathematical transformation. 
Error-analysis test: see where the algorithm fails. Asser that $p(x) > \epsilon$ for normal and $p(x) < \epsilon$ for anomoly.  
Common problem is $p(x) > \epsilon$ for both cases. Sometimes a _new feature_ is required to make alorithm recognize the anomoly. THis is _feature engineering_. 

