# Anomaly Detection Homework

This notebook is for anomaly detection homework of Applied AI Week 4. The dataset is given with [this link](https://drive.google.com/file/d/1cZGOZu_zdKLXnH-Ap1w9SMffYXZqa2Ot/view?usp=sharing). If you are having problems with the link, contact with me: safak@inzva.com

## Dataset Description
"KDD CUP 99 data set is used mainly to analyze the different
attacks. It consists of nearly 4,900,000 samples with 41
features and each sample is classified as either normal or
attack" [explanation from this source](https://www.ripublication.com/ijaer18/ijaerv13n7_81.pdf)

## Task Description

The dataset is prepared and preprocessed for anomaly detection task, the dataset contains "Probe" and "Normal" targets. "Probe" is anomaly, "Normal" is normal. 

**You are supposed to build a anomaly detection model** with **Vanilla Autoencoder**, **Variational Autoencoder** and **Denoising Autoencoder**. However you are not restricted by autoencoer, you can implement a fancy state-of-the-art ensemble 1000B parameter model. It is really up to you. 

We don't really want you to do sloppy homework.

The variable descriptions:

- train set: kdd_train_probe
- validation set (for hyperparam tuning): kdd_valid_probe
- test set: kdd_test_v2_probe

## What will you report?
Report your average macro f1 score on test set:

```python
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred, average = "macro")
print(f1)
```


# Preparation (do not edit this part)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import numpy as np
import warnings
from pandas.core.common import SettingWithCopyWarning

import torch.nn as nn
import torch
import sys
from torch.utils.data import DataLoader, Dataset
from collections import defaultdict
from tqdm.auto import tqdm

import seaborn as sns
from pylab import rcParams
import torch.nn.functional as F

from sklearn.metrics import f1_score, accuracy_score, classification_report

In [3]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style="whitegrid", palette="muted", font_scale=1.2)
HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))

rcParams['figure.figsize'] = 10, 4

In [4]:
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

kdd = pd.read_csv('/content/drive/MyDrive/kdd.csv')
kdd = kdd.iloc[:,1:43]
kdd = kdd.drop(['Protocol Type', 'Service', 'Flag'], axis = 1)

kdd_train = kdd.iloc[0:102563, :]
kdd_test = kdd.iloc[102563:183737, :]

kdd_train_probe = kdd_train[(kdd_train.Type_Groups == 'Normal') | (kdd_train.Type_Groups == 'Probe')]
kdd_test_probe = kdd_test[(kdd_test.Type_Groups == 'Normal') | (kdd_test.Type_Groups == 'Probe')]

kdd_train_probe['Type_Groups'] = np.where(kdd_train_probe['Type_Groups'] == 'Normal', 0, 1)
kdd_test_probe['Type_Groups'] = np.where(kdd_test_probe['Type_Groups'] == 'Normal', 0, 1)

kdd_valid_probe = kdd_test_probe.iloc[14000:34000,:]
kdd_test_v2_probe = pd.concat([kdd_test_probe.iloc[0:14000,:], kdd_test_probe.iloc[34001:64759,:]])


# classify anomalies and normals
# train set: kdd_train_probe
# validation set (for hyperparam tuning): kdd_valid_probe
# test set: kdd_test_v2_probe
# avg. macro f1 score on test set

## Pytorch DataLoaders

In [5]:
# NORMAL: class label 0
# ANOMALY: class label 1

class TabularDataset(Dataset):
    def __init__(self, df):
        super(TabularDataset, self).__init__()
        self.df = df
    
    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        data = self.df.iloc[idx, :-1].to_numpy()
        return {
            "samples": torch.Tensor(data)
        }
    
class TabularDatasetTest(Dataset):
    def __init__(self, df):
        super(TabularDatasetTest, self).__init__()
        self.df = df
    
    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        data = self.df.iloc[idx, :-1].to_numpy()
        label = self.df.iloc[idx, -1]
        return {
            "samples": torch.Tensor(data),
            "labels": torch.tensor(label)
        }


BATCH_SIZE = 128

train_normal = kdd_train_probe[kdd_train_probe.Type_Groups == 0]
val_normal = kdd_valid_probe[kdd_valid_probe.Type_Groups == 0]
test_normal = kdd_test_v2_probe[kdd_test_v2_probe.Type_Groups == 0]


train_data = TabularDataset(train_normal)
val_data = TabularDataset(val_normal)
test_data_all = TabularDatasetTest(kdd_test_v2_probe)

# train_dataloader: For training autoencoder. Contains only normal samples
# val_dataloader: For evaluating autoencoder at training phase.
#                 then use it for tune the threshold value.
#                 N.B: setting batch size of 1 at threshold finding phase should be more reasonable:
#                 DataLoader(val_data, shuffle = False, batch_size = 1)
#
# test_all_dataloader: Contains all test samples (anomalies and normals). Use it for
#                      calculating your metrics

# N.B.: finding a threshold value is challenging. iterating all val_dataloader and calculating
#       metrics over it works but it is expensive computationally.

train_dataloader = DataLoader(train_data, shuffle = True, batch_size = BATCH_SIZE)
val_dataloader = DataLoader(val_data, shuffle = False, batch_size = BATCH_SIZE)
test_all_dataloader = DataLoader(test_data_all, shuffle = False, batch_size = 1)

# VAE

In [6]:
# VAE implementation in PyTorch

class LinearVAE(nn.Module):
    def __init__(self, n_features, latent_dim):
        super(LinearVAE, self).__init__()
        self.n_features = n_features

        self.encoder = nn.Sequential(
            nn.Linear(n_features, 20),
            nn.Tanh()
        )

        self.encoder2mean = nn.Linear(20, latent_dim)
        self.encoder2logvar = nn.Linear(20, latent_dim)

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 20),
            nn.ReLU(),
            nn.Linear(20, n_features)
        )
    
    def forward(self, x):
        bs = x.size(0)
        out = self.encoder(x)
        mu = self.encoder2mean(out)
        log_var = self.encoder2logvar(out)
        z = self.reparameterize(mu, log_var)
        out = self.decoder(z)
        return out, mu, log_var
        
    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5*log_var)
        eps = torch.randn_like(std)
        sample = mu + (eps * std)
        return sample

In [7]:
def vae_loss(recon_x, x, mu, log_var, criterion):
    variational_beta = 1
    recon_loss = criterion(recon_x, x)
    kldivergence = (-0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())) / x.size(0)
    return recon_loss + variational_beta * kldivergence

In [8]:
def evaluate(model, criterion, val_dataloader,device):
    total = len(val_dataloader)
    loss_a = []
    val_loss = 0
    val_batch_loss = 0
    val_num_batches = 0
    for step, batch in enumerate(val_dataloader):
      model.eval()
      val_batch_loss += 1
      batch, = tuple(t.to(device) for t in batch.values())

      with torch.no_grad():
          out, mu, log_var = model.forward(batch)
      
      loss = vae_loss(out, batch, mu, log_var, criterion)

      loss_a.append(loss.detach().cpu().numpy())
    return np.mean(loss_a)

In [19]:
def train(model, optimizer, criterion, train_dataloader, val_dataloader, device, num_epochs):
    model = model.to(device)
    total = len(train_dataloader) * num_epochs
    best_loss = 0
    print('Training ...')
    for epoch in range(num_epochs):
        total_loss = 0
        batch_loss = 0
        num_batches = 0
        
        for step, batch in enumerate(train_dataloader):
            model.train()
            batch, = tuple(t.to(device) for t in batch.values())
            optimizer.zero_grad()
            out, mu, log_var = model.forward(batch)
            loss = vae_loss(out, batch, mu, log_var, criterion)
            loss.backward()

            optimizer.step()

            batch_loss += loss.item()
            total_loss += loss.item()
            num_batches +=1

        val_loss = evaluate(model, criterion, val_dataloader, device)
        print('\n')
        print(f"{epoch+1}/{num_epochs}")
        print(f"Training loss: {batch_loss / num_batches}, Validation loss: {val_loss}")

        if val_loss > best_loss:
          print(f'----Best Model Saved----')
          best_loss = val_loss
          best_model = model

    return model, best_model


In [20]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LinearVAE(train_data.df.shape[1] - 1, 10)
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001, weight_decay=1e-5)
criterion = torch.nn.L1Loss(reduction="sum")
epochs = 20

In [21]:
model, best_model = train(
    model, 
    optimizer,
    criterion, 
    train_dataloader, 
    val_dataloader, 
    device, 
    epochs
)

Training ...


1/20
Training loss: 745.1585603011282, Validation loss: 515.259033203125
----Best Model Saved----


2/20
Training loss: 451.58684136240106, Validation loss: 510.6116027832031


3/20
Training loss: 350.5602393702457, Validation loss: 490.1463623046875


4/20
Training loss: 333.1674997229325, Validation loss: 483.7287292480469


5/20
Training loss: 316.4903158288253, Validation loss: 448.7687072753906


6/20
Training loss: 294.5082332209537, Validation loss: 444.7658996582031


7/20
Training loss: 288.96093521118166, Validation loss: 444.7655334472656


8/20
Training loss: 285.267223117226, Validation loss: 441.2275390625


9/20
Training loss: 282.0441728893079, Validation loss: 440.9853210449219


10/20
Training loss: 279.0639706059506, Validation loss: 436.9300231933594


11/20
Training loss: 277.03750333284074, Validation loss: 432.9464416503906


12/20
Training loss: 275.57396432976975, Validation loss: 435.4291687011719


13/20
Training loss: 274.01186535483913, Valid

# Vanilla AE

In [22]:
class VanillaAE(nn.Module):
    def __init__(self, n_features, latent_dim):
        super(VanillaAE, self).__init__()
        self.n_features = n_features

        self.encoder = nn.Sequential(
            nn.Linear(n_features, 20),
            nn.Tanh(),
            nn.Linear(20, latent_dim),
            nn.Tanh()
        )

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 20),
            nn.ReLU(),
            nn.Linear(20, n_features)
        )
    
    def forward(self, x):
        bs = x.size(0)
        out = self.encoder(x)
        out = self.decoder(out)
        return out