# Exploratory Activity 12: Anomaly Detection

*adapted from Machine Learning by Andrew Ng*

*see also: https://www.johnwittenauer.net/machine-learning-exercises-in-python-part-8/*



Consider a hypothetical monitoring network consisting of 614 low-cost sensor nodes measuring ozone and temperature. While it is often possible to identify failing sensors by visual inspection of their data output, this task becomes impractical with a network of this size, especially given the natural variability in atmospheric conditions across a network domain. 

Today you will design a classification algorithm that flags possible sensor failures by detecting nodes with anomalous data. The population of nodes has already been divided into two equal halves to form a "training" set and a "test" set. The training set has been manually screened and the failed sensors labeled as anomalies.


## 12.1 Anomaly Detection Using Two Variables

Use the code below to load the ozone and temperature readings from the sensors in the "test" set. The data matrix should have a "shape" of 307 by 2, corresponding to one observation of each variable made by half of the sensor nodes.


In [None]:
# import the necessary library
from scipy.io import loadmat

# load the ozone and temperature readings
data = loadmat('data/anomaly_practice_1.mat')
test_data = data['X']

# display the variables "shape"
print(test_data.shape)

**Question 12.1.1** Determine the Gaussian distributions most representative of the ozone and temperature values in the test data by computing the mean and variance of these two variables.

In [None]:
## ENTER CODE HERE TO COMPUTE TWO MEANS AND TWO VARIANCES

**Question 12.1.2** The function shown below will calculate the probability that a data point belongs to a given multivariate normal distribution. Update the variable names in this code so that it correctly calculates the probability that each of the sensor nodes in your test data set belongs to the normal distributions defined by the means and variances of the ozone and temperature distributions you calculated above. Your final probability data matrix should have a "shape" of 307 by 1.

In [None]:
# import the necessary library
from scipy import stats

# generate a normal distribution based on the calculated means and variances
## UPDATE THIS CODE TO REFLECT THE VARIABLE NAMES YOU USED IN Q12.1.2 AS NECESSARY
dist = stats.multivariate_normal(means, variances)

# calculate the probabily that the test data belongs to this normal distribution
prob = dist.pdf(test_data)

# display the shape of the probability variable
prob.shape

You now know how probable it is that each data point belongs to the normal distributions of ozone and temperature readings characterized by the test data set and can use this information to identify sensor nodes exhibiting anomalous behavior (i.e., those whose ozone and/or temperature readings have a lower probability of belonging to the normal distribution(s)). The remaining question is: what probability threshold should you use as the boundary between anomalies and non-anomalies?

This is where the "training" data set becomes useful! Use the code below to load  the ozone and temperature readings from the sensors in the training set, as well as the quality "flags" that have already been assigned to those sensors. Anomalous sensors are assigned a flag of "1;" all other sensors are assigned a flag of "0."

In [None]:
train_data = data['Xval']
train_labels = data['yval']

**Question 12.1.3** Make a scatter plot of the training data and use the training labels to color the markers to verify visually that the proper sensors have been flagged as anomalous.

In [None]:
## ENTER CODE HERE TO CREATE A SCATTER PLOT OF THE TRAINING DATA
## COLORED BY QUALITY FLAG

These are the sites that we would ideally like to be able to detect using our probability distrbution function!

**Question 12.1.4** To get started, calculate the probability that each sensor node from the *training* data set belongs to the multivariate normal distribution defined by the means and variances of the *test* data set. Then make a scatter plot of the probabilities to get a sense of the range of magnitudes.

In [None]:
## ENTER CODE HERE TO CALCULATE THE PROBABILIY THAT THE TRAINING DATA
## BELONGS TO THE TEST DATA'S NORMAL DISTRIBUTION

## ENTER CODE HERE TO PLOT THE PROBABILITIES CALCULATED ABOVE

**Question 12.1.5** Next, write code that loops over a range of possible threshold probabilities and uses each threshold to classify the data points in the training set. The code should also compare this probability-based classification to the true labels and calculate the number of true positives, false positives, and false negatives. Together, these metrics can be used to compute the “F<sub>1</sub>” score of the classification. 

*Hint 1: Consult Activity 11 for the definitions of true positives, false positives, false negatives, precision, recall, and F<sub>1</sub> scores.*

*Hint 2: You should only examine threshold probabilities that actually result in some anomalous flags. Otherwise, you will end up with no true positives and no false positives and you cannot divide by zero!*

In [None]:
## ENTER CODE HERE TO CLASSIFY THE TRAINING SENSORS BASED ON DIFFERENT THRESHOLD PROBABILITIES

## ALSO INCLUDE CALCULATIONS OF TRUE POSITIVES, FALSE POSITIVES, FALSE NEGATIVES
## PRECISION, RECALL, AND F1 SCORES 

**Question 12.1.6** Find the threshold probability that resulted in the highest F<sub>1</sub> score. If there are multiple threshold probabilities that result in the same maximum F<sub>1</sub> score, choose the highest threshold.

In [None]:
## ENTER CODE HERE TO CALCULATE THE HIGHEST F1 SCORE
## AND DETERMINE THE CORRESPONDING THRESHOLD PROBABILITY

**Question 12.1.7** Once you have identified the optimal probability threshold, use it to identify anomalous nodes in the test data set. 


In [None]:
## ENTER CODE HERE TO CLASSIFY TEST SENSORS BASED ON THE
## THRESHOLD PROBABILITY IDENTIFIED ABOVE

**Question 12.1.8** Make a scatter plot of the test data and use the new labels you generated to color the markers.

In [None]:
## ENTER CODE HERE TO CREATE A SCATTER PLOT OF THE TEST DATA
## COLORED BY QUALITY FLAG

## 12.2 Anomaly Detection Using 11 Variables


In the previous example, you used a combination of two variables to detect anomalous node behavior in a hypothetical monitoring network. Consider another network consisting of 1100 nodes, each measuring 11 different variables. This time your "training" set consists of 100 pre-labeled nodes and your "test" set consists of the remaining 1000. Use the code below to load the necessary data.


In [None]:
from scipy.io import loadmat
data = loadmat('data/anomaly_practice_2.mat')
test_data = data['X']
train_data = data['Xval']
train_labels = data['yval']

**Question 12.2.1** Repeat the procedure from the 2-variable example above to: (1) characterize the multivariate normal distribution that describes the 11 variables of the test data set, (2) select the threshold probability that maximizes the F<sub>1</sub> score of the classification of the training data set, and (3) apply the threshold probability to detect anomalies in the test data set.


In [None]:
## ENTER CODE HERE TO COMPUTE 11 MEANS AND 11 VARIANCES

In [None]:
## ENTER CODE HERE TO CALCULATE THE PROBABILIY THAT THE TRAINING DATA
## BELONGS TO THE TEST DATA'S NORMAL DISTRIBUTION

## ENTER CODE HERE TO PLOT THE PROBABILITIES CALCULATED ABOVE

In [None]:
## ENTER CODE HERE TO CLASSIFY THE TRAINING SENSORS BASED ON DIFFERENT THRESHOLD PROBABILITIES

## ALSO INCLUDE CALCULATIONS OF TRUE POSITIVES, FALSE POSITIVES, FALSE NEGATIVES
## PRECISION, RECALL, AND F1 SCORES

In [None]:
## ENTER CODE HERE TO CALCULATE THE HIGHEST F1 SCORE
## AND DETERMINE THE CORRESPONDING THRESHOLD PROBABILITY

In [None]:
## ENTER CODE HERE TO CLASSIFY TEST SENSORS BASED ON THE
## THRESHOLD PROBABILITY IDENTIFIED ABOVE

**Discussion Questions.** Pair up with a partner and discuss the following:

1. When you visualized the anomalous nodes in your 2-variable test data set, were you satisfied with your anomaly detection algorithm? Why or why not?
    * How might you visualize the performance of your 11-variable classifier?
2. Compare your F<sub>1</sub> score, threshold probability, and number of anomalous nodes detected for both the 2-variable and 11-variable examples with those calculated by your partner. If there are differences, discuss some possible reasons why. If they are largely similar, discuss some assumptions or decisions you could have made differently that might have led to a different answer.
3. Consider the real, physical phenomena that this hypothetical scenario represents.
    * What might cause a node to simultaneously report an anomalous temperature and anomalous ozone reading?
    * What might cause a node to report 11 anomalous variables simultaneously? Does it matter what the other 9 variables are?
    * Outside of this hypothetical example, are there other types of sensor failure that your algorithm might miss? How might you propose to incorporate those scenarios into your algorithm?
4. When you performed the k-nearest-neighbor classification yesterday, you normalized the units of each parameter before calculating the distances between points. In today’s anomaly classification exercise, you did not normalize the units beforehand. Was this a justifiable approach? Why or why not?
