# Supervised Anomaly Detection and Unbalanced Data

The topic we'll look at over the next couple weeks is anomaly detection. This is probably the most common and important application of machine learning tools to security. We'll start this week with supervised anomaly detection, and continue next week with unsupervised anomaly detection. There are a huge number of anomaly detection techniques out there, and many of them are very specialized to particular types of data, so all we'll be able to do here is scratch the surface.

As the name suggests, you can think of an _anomaly_ as some sort of highly unusual event occuring in your data that you wish to find (e.g. an attack on your network, a defective device, credit card fraud). More usefully, a good definition of an anomaly is the following: An anomaly is a data sample that deviates significantly from other data samples, so much so to suggest that it was generated by a different mechanism. In probability language, you can think of an anomaly as something that comes from a different distribution than the "real" data.

If we happen to know which points in the dataset we're training on are anomalies, we can use supervised learning techniques (specifically binary classification) to build an anomaly detection model. By convention, people usually use the label "0" for non-anomalous samples and "1" for anomalies.

Supervised anomaly detection happens to be a special case of the more general problem of _unbalanced data_. That is, the number of labels you have for each class is significantly different. Till now we've worked with _balanced data_, which assumes the number of labels for each class is roughly equal (e.g. with 100 samples, you'd have 50 with "0" labels and 50 with "1" labels, or close to that ratio). With unbalanced data they can be highly skewed (e.g. with 100 samples, you might have 98 with "0" labels and 2 with "1" labels).

In the examples below we talk about various ways to deal with unbalanced data for a binary classification problem like supervised anomaly detection. Many of the techniques work just as well for multiclass problems (e.g. image classification). We begin by loading in the packages we'll use. The new one we'll use here is the [imbalanced-learn](http://imbalanced-learn.org/) library, aka `imblearn`. This library has many techniques for dealing with unbalanced data that we'll use.

In [17]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix
from sklearn.metrics import cohen_kappa_score, precision_score, recall_score
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.datasets import make_imbalance
np.random.seed(123)

In [18]:
def get_data(ratio=0.01):
    # loads spam dataset, where only ratio*100 % of data is spam (1), else non-spam (0)
    df = pd.read_csv("http://www.apps.stat.vt.edu/leman/VTCourses/spam.data.txt",sep=' ')
    X_all = df.iloc[:,:-1]
    y_all = df.iloc[:,-1]
    X,y = make_imbalance(X_all, y_all, ratio={1: round(len(y_all)*ratio)})
    return X,y

def tsne_plot(X_tsne):
    plt.scatter(X_tsne[:,0],X_tsne[:,1],s=0.5,alpha=0.5)
    plt.title('t-SNE Plot of X')
    plt.show()

In [21]:
X,y = get_data(ratio=0.01)
X.shape,y.shape

((2834, 57), (2834,))

In [22]:
y[y==1].shape,y[y==0].shape

((46,), (2788,))