### Developing and Evaluating an Anomaly Detection System

#### The importance of real-number evaluation
When developing a learning algorithm (choosing features, etc.) making decisions is much easier if we have a way of evaluating our learning algorithm.     
Assume we have some labeled data, of anomalous and non anomalous examples. ($y = 0$ if normal, $y = 1$ if anomalous).


Training set $x^{(1)}, x^{(2)}, \dots, x^{(m)}$ assume normal examples/not anomalous)

Cross validation set $(x_{cv}^{(1)}, y_{cv}^{(1)}), (x_{cv}^{(2)}, y_{cv}^{(2)}), \dots, (x_{cv}^{(m_{cv})}, y_{cv}^{(m_{cv})})$  
Test set $x_{test}^{(1)}, y_{test}^{(1)}), (x_{test}^{(2)}, y_{test}^{(2)}), \dots, (x_{test}^{(m_{test})}, y_{test}^{(m_{test})})$ 

Example: Aircraft engines    
10000 goog (normal) engines    
20    flawned engines (anomalous)    

Training set: 6000 good engines   
CV: 2000 good engines ($y = 0$), 10 anomalous ($y = 1$)    
Test: 2000 good engines ($y = 0$), 10 anomalous ($y = 1$)    
Best to use different data for CV and Test   

#### Algorithm evaluation

Fit model $p(x)$ on trainig set $x^{(1)}, \dots, x^{(m)}$    
On a cross validation/test example $x$, predict  

$y =
\begin{cases}
1       & \quad \text{if } p(x) < \epsilon \quad \text{(anomaly)}\\
0       & \quad \text{if } p(x) \geq \epsilon \quad  \text{(normal)}
\end{cases}
$
Possible evaluation metrics:
- true positive, false positive, false negative, true negative
- precision recall
- F1 score

Can also use cross validation set to choose parameter $\epsilon$.

### Anomaly Detection vs. Supervised Learning

Why use anomaly detection instead of supervised learning?

#### Anomaly detection
- Very small number of positive examples ($y=1$). (0-20 is common).   
- Large number of negative ($y=0$) examples.
- Many different "types" of anomalies. Hard for any algorithm to learn from positve examples what the anomalies look like; fututre anomalies may look nothing like any of the anomalous examples we've seen so far.
Examples:
- fraud detection
- manufacturing
- monitoring machiens in data center

#### Supervised learning
- Large number of positive and negative examples.
- Enough positive examples for algorithm to get a sense of what posistive examples are like, future positive examples likely to be similar to ones in training set.
Examples
- email spam classification
- wheater prediction
- cancer classification



### Choosing What Features to Use

Plot histogram of data e.g asymmetric distribution instead of gaussian add some transforamtion for example:
- $x \to log(x)$
- $x \to x^{0.2}$
and plota again the histogram to make it more gaussian

#### Error analysis for anomaly detection

Want $p(x)$ large for normal examples $x$.  
Want $p(x)$ small for anomalous examples $x$.  

Most common problem: $p(x)$ is comparable (say, both large) for normal and anomalous examples

E.g. Monitoring computer in a data center    
Choose features tha might thake on unusually large or small values in the event of anomaly.      
$x_1$ = memory use of computer      
$x_2$ = number of disc access     
$x_3$ = CPU load     
$x_4$ = network traffic   

create other features:             
$x_5 = \frac{\text{CPU load}}{\text{network traffic}} $      
$x_6 = \frac{(\text{CPU load})^2}{\text{network traffic}} $ 