paper: http://neuro.bstu.by/ai/To-dom/My_research/Papers-0/For-research/D-mining/Anomaly-D/KDD-cup-99/NN/dawak02.pdf

In [1]:
import pickle
import pandas as pd
from collections import defaultdict

from keras.models import Model
from keras.layers import Dense, Input
from keras.callbacks import EarlyStopping

from sklearn.metrics import mean_squared_error

from utils import analyze_outlier

Using TensorFlow backend.


In [2]:
X = pickle.load(open("input/breast_cancerX", "rb"))
Y = pickle.load(open("input/breast_cancerY", "rb"))

In [3]:
X.shape

(387, 30)

In [4]:
input_dim = X.shape[1]

In [5]:
early_stop = EarlyStopping(monitor="loss", min_delta=0, patience=5, mode="auto")

In [6]:
def get_model():
    inp = Input(shape=(input_dim, ))
    x = Dense(input_dim//2, activation="tanh")(inp)
    x = Dense(input_dim//4, activation="tanh")(x)
    x = Dense(input_dim//2, activation="tanh")(x)
    outp = Dense(input_dim, activation="linear")(x)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss="mse", optimizer="adam")
    return model

In [7]:
model = get_model()
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 30)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 15)                465       
_________________________________________________________________
dense_2 (Dense)              (None, 7)                 112       
_________________________________________________________________
dense_3 (Dense)              (None, 15)                120       
_________________________________________________________________
dense_4 (Dense)              (None, 30)                480       
Total params: 1,177
Trainable params: 1,177
Non-trainable params: 0
_________________________________________________________________


In [8]:
hist = model.fit(X, X, epochs=500, callbacks=[early_stop], verbose=0)
final_loss = hist.history["loss"][-1]
print ("final loss: {}".format(final_loss))

final loss: 19293.83739098837


In [9]:
pred = model.predict(X)

In [10]:
def calculate_outlier_factor(X, Y, pred):
    outlier_factors = defaultdict(dict)
    for i in range(X.shape[0]):
        outlier_factors[i]["OF"] = mean_squared_error(X[i], pred[i])
        outlier_factors[i]["Y"] = Y[i]
    return outlier_factors

In [11]:
outlier_factors = calculate_outlier_factor(X, Y, pred)

In [13]:
df = pd.DataFrame.from_dict(outlier_factors, orient="index")

In [14]:
analyze_outlier(df)

Within the top 30 ranked cases (ranked according to the Outlier Factor), 23 of the malignant cases     (the outliers), comprising 76.66666666666667% of all malignant cases, were identified.
