# Modern Data Science 
**(Module 01: A Touch of Data Science)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session H - Abnormality Analytics (I)

**The purpose of this session is to demonstrate:**

1. To understand characteristics of data in anomaly and novel detection problems
2. How to implement standard anomaly and novel detection algorithms 

** References and additional reading and resources**
- [Novelty and outlier detection with scikit-learn](http://scikit-learn.org/stable/modules/outlier_detection.html)

---



 # <span style="color:#0b486b">1. Challenges with anomaly and novel detection datasets</span> 

When dealing with anomaly and novel detection problems, you usually encounter the following challenges:
1. High dimensional data which contains a relatively large number of attributes. It is not possible to plot data points for getting the sense of data.
2. Imbalanced data whose class distribution is not (approximately) equal between two classes: normal and abnormal. For example, in the application of credit card fraud detection, the number of fraud transactions usually covers only 5% to 10% of entire datasets. Therefore, the standard classification algorithms are usually failed to apply into anomaly and novel detection problems.

In this section, you would have a chance to explore the following datasets
- [EMNIST Dataset](https://www.nist.gov/itl/iad/image-group/emnist-dataset) is a set of handwritten digits and characters, each of which is a 28x28 pixel image. In this practical application, we subsample about 10% of characters which are considered as abnormal data points. The digits are treated as normal data. You can think of a real-world application of this dataset as a handwritten form recognition for detecting phone numbers. 

You can learn the different characteristics of these datasets in the following codes.

# <span style="color:#0b486b"> 2. Anomaly Detection Dataset: EMNIST </span> 

 *__EMNIST__* dataset contains many of digit images and a few of non-digit images. Our aim is to train a model using this dataset to detect non-digit images. We can apply this model to build a machine to verify a valid phone number written by hand. You now can load EMNIST data (in csv format) and view data properties using the following code. The first column represents labels of data instances. The rest are feature vectors of data instances.

Now you can load data and get some basic information of dataset using <code>info()</code> function.

In [None]:
!ls

In [None]:
import wget
    
link_to_data = 'https://github.com/tuliplab/mds/blob/master/Jupyter/data/emnist.digits_letters.small.csv?raw=true'
DataSet = wget.download(link_to_data) 

In [None]:
import numpy as np
import pandas as pd
df = pd.read_csv('emnist.digits_letters.small.csv',index_col=0)

In [None]:
df.info()


In [None]:
df = df.sort_values(['0'])  # for further visualization



In [None]:
df = df.sort_values(['0'])  # for further visualization
df.info()

You can also have a look at the top five rows using the DataFrame’s <code>head()</code> method.

In [None]:
df.head()

For using *numpy* array methods, you can convert the data frame *df* to a numpy array object *data_array*.

In [None]:
data_array = df.as_matrix()
x = data_array[:,1:]
y = data_array[:,0]

num_samples = x.shape[0]

print("Feature matrix for the first 5 images\n {}".format(x[:5,:]))
print("\nLabels for the first 5 images\n {}".format(y[:5]))

print("\nNumber of samples: {}".format(num_samples))


Since the dataset contains images, you can sample and plot some digits images (labeled as 1) in the dataset. Note that you need to reshape to a matrix before using imshow to view this image because the feature vector is flattened in 1D.

In [None]:
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6) # configure the size of images

import matplotlib.pyplot as plt 
num_subplots = 5                                      # the number of images plotted
fig, ax = plt.subplots(1,num_subplots)
for idx in range(num_subplots):
    n = np.random.randint(np.sum(y < 0), len(y))      # randomly choose an image index
    img1 = x[n,:].reshape((28,28)).T                  # reshape the vector into the image size 28x28
    ax[idx].imshow(img1, cmap = plt.get_cmap('gray')) # show the selected image
plt.show()

Similarly, you can sample and plot some non-digit images in the dataset.

In [None]:
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)

import matplotlib.pyplot as plt
num_subplots = 5
fig, ax = plt.subplots(1,num_subplots)                 # the number of images plotted
for idx in range(5):
    n = np.random.randint(0, np.sum(y < 0))            # randomly choose an image index
    img1 = x[n,:].reshape((28,28)).T                   # reshape the vector into the image size 28x28
    ax[idx].imshow(img1, cmap = plt.get_cmap('gray'))  # show the selected image
plt.show()

## Dataset Statistics

First, we examine the ratio between normal data (labeled as **'1'**) and abnormal data (labeled as **'-1'**). Intuitively, we can see it is an imbalanced dataset.

In [None]:
import matplotlib.pyplot as plt
(counts, _) = np.histogram(y,2)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar([0,1], counts)
classlabels=["abnormal","normal"]

rects =ax.patches

# Now make labels of percentage
labels = ['{:.3f}%'.format(i*100) for i in counts/np.sum(counts)]
for rect, label in zip(rects, labels):
    height = rect.get_height()
#     plt.text(1, 173 , "dadad")
    ax.text(rect.get_x() + rect.get_width()/2, height + 2, label, ha='center', va='bottom')

plt.ylabel("Count")
plt.xticks(np.arange(2),classlabels)
plt.show()

You also visualize data in 2D using PCA approach. In particular, You project it in the first two principal components space and plot it using scatter function.

In [None]:
from sklearn.decomposition import PCA

x_reduced = PCA(n_components=2).fit_transform(x) # reduce data dimension to 2

plt.figure(2, figsize=(8, 6))
plt.scatter(x_reduced[:, 0], x_reduced[:, 1], c=y, cmap='PiYG')  # plot 2-d data where each data point is decorated with its label.
plt.show()

# <span style="color:#0b486b"> 3.  Anomaly Detection Systems using Classifier</span> 

The dataset is provided with labels, each of which is categorical data. Therefore, we can apply classification algorithms to learn and predict. We can use *Logistic Regression* and *Naive Bayes* models for our systems. We can report the performance of chosen method in terms of *accuracy, precision, recall, and F-measure*.

In [None]:
from sklearn import decomposition
pca = decomposition.PCA(n_components=100)
pca_X = pca.fit_transform(x)


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

X_train, X_test, y_train, y_test = train_test_split(pca_X, y, test_size = 0.3, random_state=2)
logistic = LogisticRegression()
logistic.fit(X_train, y_train)

y_prediction=logistic.predict(X_test )

rec=recall_score(y_test,y_prediction, average='macro')
pre=precision_score(y_test,y_prediction, average='macro')
acc=accuracy_score(y_test,y_prediction)
f1=f1_score(y_test,y_prediction, average='macro')
print("\t\t\tAccuracy\tPrecision\tRecall\t\tF-measure")
print("Logistic Regression\t{:f}\t{:f}\t{:f}\t{:f}".format(acc,pre,rec,f1))



from sklearn.naive_bayes import GaussianNB

model=GaussianNB()
model.fit( X_train, y_train )
y_prediction = model.predict( X_test )

rec=recall_score(y_test,y_prediction, average='macro')
pre=precision_score(y_test,y_prediction, average='macro')
acc=accuracy_score(y_test,y_prediction)
f1=f1_score(y_test,y_prediction, average='macro')
print("Naive Bayes\t\t{:f}\t{:f}\t{:f}\t{:f}".format(acc,pre,rec,f1))



We can observe that the accuracy is always high in this problem since the nature of data. If we choose all data point as normal, the accuracy now is the proportion of normal class, ~ 91%., similar to Naive Bayes classifier. However, the other metrics are not good. We need a class of better algorithms to deal with imbalanced data in anomaly detection problems.
In week 3 and 4, we present three algorithms designed to cope with this problem:
- Two distance-based approaches: $DB(p,d)$ and $DB(k,N)$
- PCA-based approach.

In the following section, we introduce you two distance-based approaches: $DB(p,d)$ and $DB(k,N)$.

# <span style="color:#0b486b"> 4.  Anomaly Detection Systems with Specialized Design Algorithms </span> 

## 4.1 Distance-based approaches

### 4.1.1 $DB(p,d)$ algorithm 

In  $DB(p,d)$  algorithm, an object  $o$  is an anomaly if at least a fraction  $p$  of objects in dataset has distances greater than  $d$  from  oo.

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances
from sklearn import metrics


Then, we calculate distance matrix.

In [None]:
# calculate distance matrix
dist_matrix = pairwise_distances(x, metric='euclidean')
np.max(dist_matrix)

Now, we need to set model hyper-parameters   $d$  and  $p$ and compute the proportion of objects in dataset have distances greater than $d$ from a given data point. If it is greater than $p$, we mark it as an anomaly.

In [None]:
# set model hyper-parameters
d = 74.0
p = 0.009

dist_matrix_greater_d = dist_matrix > d
sum_dist_matrix_greater_d = np.sum(dist_matrix_greater_d, axis=1)
percent_greater_d = sum_dist_matrix_greater_d / (num_samples - 1)

y_predict = np.ones(num_samples)
y_predict[percent_greater_d > p] = -1

It is used as anomaly score. The higher anomaly score is, the most likely it is an anomaly.

In [None]:
import matplotlib.pyplot as plt

plt.figure(2, figsize=(12, 8))

data_idx = np.arange(num_samples)
idx_anomaly = data_idx[percent_greater_d > p]

plt.scatter(data_idx, sum_dist_matrix_greater_d,s=3)
plt.scatter(data_idx[percent_greater_d > p], sum_dist_matrix_greater_d[percent_greater_d > p],s=3)
threshold_line = np.ones(num_samples) * np.min(sum_dist_matrix_greater_d[idx_anomaly])
plt.plot(data_idx, threshold_line, color='green', linewidth=1.5)
plt.yscale('log')
plt.show()

In [None]:
print('Classification results:')
print(metrics.classification_report(y, y_predict))

confusion_mat = metrics.confusion_matrix(y, y_predict, [1, -1])
print('Confusion matrix')
df_confusion = pd.DataFrame(confusion_mat, columns=['Prediction Positive ','Prediction Negative'])
df_confusion.index = ['Original Positive','Original Negative']
df_confusion

### 4.1.2 $DB(k,N)$ algorithm 


Now, we implement $DB(k,N)$ algorithm. 
 - **Input**: $k$ (the number of nearest neighbours), $N$ (the number of anomalies) and the dataset
 - **Output**: anomalies in the dataset
 
First, we load dataset.

In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn import metrics

# load data
df = pd.read_csv('emnist.digits_letters.small.csv',index_col=0)
data_array = df.as_matrix()
x = data_array[:,1:]

num_samples = x.shape[0]
print('Number of samples:', num_samples)

In  $DB(k,N)$ algorithm, an object  oo  is an anomaly if it is in top  $N$  data instances whose distances to its  $k$  nearest neighbours are largest. Now, we need to set model hyper-parameters  $k$  and  $N$. Then for each data instance, we find $k$ nearest neighbours.

In [None]:
# set model hyper-parameters
k = 10
N = 50
# find k-NN
nbrs = NearestNeighbors(n_neighbors=k+1, algorithm='ball_tree').fit(x)
distances, indices = nbrs.kneighbors(x)
avg_distances = np.average(distances, axis=1)

It is used as anomaly score. The higher anomaly score is, the most likely it is an anomaly. We mark top $N$ data instance that have the largest distance to $k$ nearest neighbours.

In [None]:
# get top N far from neighbours
plt.figure(2, figsize=(12, 8))


idx_anomaly = avg_distances.argsort()[-N:][::-1]
plt.scatter(np.arange(num_samples),avg_distances,s=3)
plt.scatter(idx_anomaly,avg_distances[idx_anomaly],color='red',s=3)
threshold_line = np.ones(num_samples) * np.min(avg_distances[idx_anomaly])
plt.plot(np.arange(num_samples), threshold_line, color='green',linewidth=1)
plt.show()

Finally, we report the prediction performance.

In [None]:
y_predict = np.ones(num_samples)
y_predict[idx_anomaly] = -1
print('Classification results:')
print(metrics.classification_report(y, y_predict))

confusion_mat = metrics.confusion_matrix(y, y_predict, [1, -1])
print('Confusion matrix')
df_confusion = pd.DataFrame(confusion_mat, columns=['Prediction Positive ','Prediction Negative'])
df_confusion.index = ['Original Positive','Original Negative']
df_confusion

We also have a ROC plot.

In [None]:
from sklearn import metrics
FPR, TPR, _ = metrics.roc_curve(y, avg_distances, pos_label=[1])
plt.plot(TPR,FPR)
plt.show()

## <span style="color:#0b486b">V. Practical Coding Exercises</span>

1. In section 3, can you report the performance for some other classification algorithms such as [K-NN](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) and [Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
2. You can try to vary $d$ and $p$ values in $DP(d,p)$ algorithm and $k$ and $N$ values for $DP(k,N)$ algorithm and report the best values for each algorithm in terms of F-measure.
3. We provide you a subset of [Statlog (German Credit Data) Data Set](https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) in **german.csv**. You can try to understand the data statistics and use distance-based anomaly detection in Section 2 to 4 and report the results.