# B. Tree-based Method for Anomaly Detection

Goal of this workbook

1. Know when to apply
2. What's the difference with a usual random forest
3. Know how to apply
4. Try a few detection methods with various data sets

In [1]:
# dataframe / analysis tools
import pandas as pd
import numpy as np
import scipy as sp
from scipy.io import loadmat
from sklearn.metrics import confusion_matrix, classification_report
from scipy.stats import multivariate_normal, uniform
# plotting
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Helpful functions
def normalize(x):
    mu = np.mean(x,axis=0)
    sigma = np.std(x,axis=0)
    return (x - mu)/sigma

def ad_plot(x, y, mask):
    plt.figure(figsize=(16,8))
    sns.kdeplot(x, y)
    sns.regplot(x=x[~mask], y=y[~mask], fit_reg=False,color='g',scatter_kws={'alpha':0.25})
    sns.regplot(x=x[mask], y=y[mask],fit_reg=False,color='red')
    
def estimate_gaussian(x):
    # Step 1: Normalize
    x = normalize(x)
    # Step 2: Use sp.stats.norm.pdf on results above
    x = x.apply(lambda v: sp.stats.norm.pdf(v),1)
    # Step 3: get Probabilities of Feature_1 x Features_2 x .... Feature_n
    return x.apply(np.prod,1)

def get_mu_sig(x):
    mu = np.mean(x, axis=0)
    sigma = np.cov(x.T)
    return mu, sigma

def multivariateGaussian(x,mu,sigma):
    p = multivariate_normal(mean=mu, cov=sigma)
    return p.pdf(x)

## Datasets
[Lymph]('https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29')

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link] 


[MUSK]('https://archive.ics.uci.edu/ml/datasets/Musk+%28Version+2%29')

This dataset describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. The goal is to learn to predict whether new molecules will be musks or non-musks. However, the 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule. Because bonds can rotate, a single molecule can adopt many different shapes. To generate this data set, all the low-energy conformations of the molecules were generated to produce 6,598 conformations. Then, a feature vector was extracted that describes each conformation. 


[Women's Breast Cancer]('http://odds.cs.stonybrook.edu/lympho/')

*missing info*


In [59]:
# df_musk = loadmat('./data/musk.mat')
df_wbc = loadmat('./data/wbc.mat')
# df_lymph = loadmat('./data/lympho.mat')

In [60]:
df_wbc.keys()

dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])

In [61]:
print(df_wbc['X'].shape)
print(df_wbc['y'].shape)
df_in = np.concatenate((np.array(df_wbc['X']), np.array(df_wbc['y'])), axis = 1)

(378, 30)
(378, 1)


In [62]:
df_in = pd.DataFrame(df_in)
df_in[0:30]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,0.310426,0.157254,0.301776,0.179343,0.407692,0.189896,0.156139,0.237624,0.416667,0.162174,...,0.192964,0.24548,0.129276,0.480948,0.14554,0.190895,0.442612,0.278336,0.115112,0.0
1,0.288655,0.202908,0.28913,0.159703,0.495351,0.330102,0.107029,0.154573,0.458081,0.382266,...,0.225746,0.227501,0.109443,0.396421,0.242852,0.150958,0.250275,0.319141,0.175718,0.0
2,0.119409,0.092323,0.114367,0.055313,0.449309,0.139685,0.06926,0.103181,0.381313,0.402064,...,0.097015,0.07331,0.031877,0.404345,0.084903,0.070823,0.213986,0.174453,0.148826,0.0
3,0.286289,0.294555,0.268261,0.161315,0.335831,0.05607,0.060028,0.145278,0.205556,0.182603,...,0.28758,0.16958,0.08865,0.17064,0.018337,0.038602,0.172268,0.083185,0.043618,0.0
4,0.057504,0.241123,0.05473,0.024772,0.301255,0.122845,0.037207,0.029409,0.358081,0.317397,...,0.264925,0.034115,0.014009,0.386515,0.10518,0.054952,0.08811,0.303568,0.124951,0.0
5,0.239907,0.166385,0.23668,0.129714,0.455629,0.219434,0.154452,0.13663,0.310606,0.220514,...,0.231343,0.196574,0.09767,0.516608,0.182699,0.24361,0.225017,0.232998,0.183458,0.0
6,0.30806,0.425769,0.297975,0.177094,0.314977,0.176676,0.111317,0.168191,0.378283,0.152064,...,0.527719,0.241994,0.126229,0.297365,0.139525,0.182268,0.44055,0.257441,0.09268,0.0
7,0.226182,0.402097,0.213738,0.120636,0.304595,0.092878,0.038824,0.055417,0.219697,0.187869,...,0.365139,0.162209,0.081424,0.246517,0.057106,0.044113,0.127663,0.171102,0.069461,0.0
8,0.315159,0.224214,0.300048,0.181676,0.218651,0.126403,0.04351,0.085636,0.14798,0.201559,...,0.297708,0.227452,0.115882,0.249158,0.12701,0.083866,0.295052,0.153952,0.165355,0.0
9,0.234701,0.288468,0.220579,0.124751,0.270651,0.086283,0.046204,0.067048,0.408081,0.234625,...,0.248134,0.165646,0.084054,0.285478,0.05993,0.073506,0.216357,0.240489,0.124885,0.0


## Your turn:
Using methods from before, apply these methods to the data. Look at the shape of distributions and correct to normal if needed. Use scatter or density plots to explore your data. 


Guiding Questions
1. Do these models work well? 
2. If no, why not?
3. Try working with an isolation forest model. Does it work well? Why or why not?

In [None]:
# pca to reduce dimension
# and then I can plot it

In [69]:
X = normalize(df_wbc['X'])
y = df_wbc['y']

mu, sig = get_mu_sig(X)
pred = multivariateGaussian(X, mu, sig)


In [116]:

threshold = 1/10000000000000000
y_pred = (pred < threshold) + 0.0
y_true = y.flatten()


print(confusion_matrix
      (y_pred= y_pred, y_true= y_true))
print('  ')
print('  ')
print(classification_report(y_pred= y_pred, y_true= y_true))

[[344  13]
 [ 12   9]]
  
  
             precision    recall  f1-score   support

        0.0       0.97      0.96      0.96       357
        1.0       0.41      0.43      0.42        21

avg / total       0.94      0.93      0.93       378



## Isolation Forest:
- Tree based method: Like Random forest, takes subsets of data. This means it's great for high-dimensional data.
- Not cluster or density based
- Works by randomly partitioning data set. 
- The more partitions, the more "normal" the data point is. The fewer, the more anomalous it is. 
- Get an "anomaly score". Essentially, the average of each point's numbers of splits to reach averaged across all trees and and ordered. 
- Whereas in Gaussian and Multivariate Gaussian, we can build relatively simple decision boundaries, with iForest you can build more complex ones

### Isolation Trees Make up an Isolation Forest
![title](http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2016/AY/c6ay01574c/c6ay01574c-f1_hi-res.gif)


### In an iTree, how many nodes does it take to isolate?

![title](https://image.slidesharecdn.com/gerster-anomalyv4-150730232953-lva1-app6892/95/anomaly-detection-using-isolation-forests-10-638.jpg)

In [133]:
X = df_wbc['X'][1:200]
y = df_wbc['y'][1:200]
from sklearn.ensemble import IsolationForest
clf = IsolationForest(n_estimators=500, n_jobs= -1, contamination=.03, random_state = 1)
clf.fit(X,y)
y_pred = (clf.predict(df_wbc['X'][200:378]) == -1).astype(int)
y_true = df_wbc['y'][200:378]
y_true.shape

print(confusion_matrix(y_pred=y_pred, y_true = y_true))
print('')
print(classification_report(y_pred, y_true))


[[153   4]
 [  7  14]]

             precision    recall  f1-score   support

          0       0.97      0.96      0.97       160
          1       0.67      0.78      0.72        18

avg / total       0.94      0.94      0.94       178



In [118]:
y.flatten().sum()

21.0