Why do we apply dimensionality reduction to find outliers?
Don’t we lose some information, including the outliers, if we reduce the dimensionality? The answer is that once the main patterns are identified, the outliers are revealed. Many distance-based techniques (e.g. KNNs) suffer the curse of dimensionality when they compute distances of every data point in the full feature space. High dimensionality has to be reduced. 

Interestingly, during the process of dimensionality reduction outliers are identified. We can say outlier detection is a by-product of dimension reduction.

Autoencoders are an unsupervised approach to find anomalies.

Why autoencoders?
There are many useful tools, such as Principal Component Analysis (PCA), for detecting outliers. Why do we need autoencoders? The reason is that PCA uses linear algebra to transform. In contrast, autoencoder techniques can perform non-linear transformations with their non-linear activation function and multiple layers. It’s more efficient to train several layers with an autoencoder, rather than training one huge transformation with PCA. The autoencoder techniques thus show their merits when the data problems are complex and non-linear in nature.



In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset
import optuna
import json

TIME_STEPS = 30
BATCH_SIZE = 512
EPOCHS = 100
LEARNING_RATE = 0.01
LATENT_DIM = 4
EARLY_STOPPING_PATIENCE = 20

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load data
data_org = pd.read_csv("times_series_data_no_labels.csv", index_col='datetime', parse_dates=['datetime'])

def preprocess_data(column_name):
    # Removing rows with values greater than 32 and less than 19
    data = data_org[(data_org[column_name] <= 32) & (data_org[column_name] >= 19)]

    # Removing rows for the time between 5:45 and 21:00 with values less than 26
    data['hour'] = data.index.hour
    data['minute'] = data.index.minute

    condition_time = ~((data['hour'] > 5) & ((data['hour'] < 21) | ((data['hour'] == 21) & (data['minute'] == 0))) & (data[column_name] < 26))
    data = data[condition_time]

    # Dropping the additional columns used for filtering
    data.drop(columns=['hour', 'minute'], inplace=True)

    # Removing rows for the time between 00:10 and 03:05 with values greater than 22.5
    data['hour'] = data.index.hour
    data['minute'] = data.index.minute

    condition_night = ~((data['hour'] == 0) & (data['minute'] >= 10) |
                        (data['hour'] > 0) & (data['hour'] < 3) |
                        (data['hour'] == 3) & (data['minute'] <= 5) &
                        (data[column_name] > 22.5))
    data = data[condition_night]

    # Dropping the additional columns used for filtering
    data.drop(columns=['hour', 'minute'], inplace=True)

    # Split data into training and test sets
    train_size = int(len(data) * 0.85)
    train, test = data.iloc[0:train_size], data.iloc[train_size:len(data)]

    return train, test


  return torch._C._cuda_getDeviceCount() > 0


In [2]:

train_0, test_0 = preprocess_data('data_0')
train_1, test_1 = preprocess_data('data_1')

def create_dataset(X, time_steps=1):
    Xs = []
    for i in range(len(X) - time_steps):
        v = X.iloc[i:(i + time_steps)].values
        Xs.append(v)
    return np.array(Xs)

def normalize_data(data, min_val, max_val):
    return (data - min_val) / (max_val - min_val)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['hour'] = data.index.hour
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['minute'] = data.index.minute
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['hour'] = data.index.hour
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] =

In [3]:


class DenseEncoder(nn.Module):
    def __init__(self, input_shape: int, latent_dim: int):
        super().__init__()
        self.l1 = nn.Linear(in_features=input_shape, out_features=4 * latent_dim)
        self.l2 = nn.Linear(in_features=4 * latent_dim, out_features=2 * latent_dim)
        self.l3 = nn.Linear(in_features=2 * latent_dim, out_features=latent_dim)

    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
        x = self.l1(inputs)
        x = torch.relu(x)
        x = self.l2(x)
        x = torch.relu(x)
        x = self.l3(x)
        latent = torch.relu(x)
        return latent


class DenseDecoder(nn.Module):
    def __init__(self, output_shape: int, latent_dim: int):
        super().__init__()
        self.l4 = nn.Linear(in_features=latent_dim, out_features=2 * latent_dim)
        self.l5 = nn.Linear(in_features=2 * latent_dim, out_features=4 * latent_dim)
        self.output = nn.Linear(in_features=4 * latent_dim, out_features=output_shape)

    def forward(self, latent: torch.Tensor) -> torch.Tensor:
        x = self.l4(latent)
        x = torch.relu(x)
        x = self.l5(x)
        x = torch.relu(x)
        output = self.output(x)

        return output


class DenseAutoencoderModel(nn.Module):
    def __init__(self, input_shape, latent_dim):
        super(DenseAutoencoderModel, self).__init__()
        self.encoder = DenseEncoder(input_shape, latent_dim)
        self.decoder = DenseDecoder(input_shape, latent_dim)
        self.latent_dim = latent_dim

    def forward(self, inputs):
        inputs = inputs.squeeze(2)
        latent = self.encoder(inputs)
        output = self.decoder(latent)
        output = output.unsqueeze(2)
        return output


In [4]:
def train_model(model, train_loader, criterion, optimizer, epochs, patience):
    best_loss = float('inf')
    epochs_no_improve = 0

    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for df in train_loader:
            df = df[0].to(device)
            optimizer.zero_grad()
            output = model(df)
            loss = criterion(output, df)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        train_loss /= len(train_loader)
        print(f'Epoch {epoch+1}, Loss: {train_loss}')

        if train_loss < best_loss:
            best_loss = train_loss
            epochs_no_improve = 0
        else:
            epochs_no_improve += 1

        if epochs_no_improve == patience:
            print('Early stopping!')
            break

    return best_loss

In [5]:

def objective(trial, train_loader):
    latent_dim = trial.suggest_int('latent_dim', 2, 12)
    learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2, log=True)
    
    model = DenseAutoencoderModel(input_shape=TIME_STEPS, latent_dim=latent_dim).to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    final_loss = train_model(model, train_loader, criterion, optimizer, EPOCHS, EARLY_STOPPING_PATIENCE)
    
    return final_loss

def run_optuna(train_loader):
    study = optuna.create_study(direction='minimize')
    study.optimize(lambda trial: objective(trial, train_loader), n_trials=50)
    best_params = study.best_params
    print(f'Best hyperparameters: {best_params}')
    return best_params


In [6]:

def preprocess_and_train(train, column_name):
    train_col = pd.DataFrame(train, columns=[column_name])

    TIME_STEPS = 30
    BATCH_SIZE = 512
    EPOCHS = 100
    EARLY_STOPPING_PATIENCE = 20

    X_train = create_dataset(train_col, TIME_STEPS)

    min_val = X_train.min()
    max_val = X_train.max()  # Use train max for consistency

    train_data = normalize_data(X_train, min_val, max_val)

    train_tensor = torch.tensor(train_data, dtype=torch.float32).to(device)

    train_loader = DataLoader(TensorDataset(train_tensor), batch_size=BATCH_SIZE, shuffle=True)

    best_params = run_optuna(train_loader)

    with open(f'saves/best_hyperparameters_{column_name}.json', 'w') as f:
        json.dump(best_params, f)

    model = DenseAutoencoderModel(input_shape=TIME_STEPS, latent_dim=best_params['latent_dim']).to(device)
    optimizer = optim.Adam(model.parameters(), lr=best_params['learning_rate'])

    final_loss = train_model(model, train_loader, criterion, optimizer, EPOCHS, EARLY_STOPPING_PATIENCE)
    print(f'Final model training loss for {column_name}: {final_loss}')

    torch.save(model.state_dict(), f'saves/best_autoencoder_model_{column_name}.pth')
    print(f'Best model for {column_name} saved!')

    return model, min_val, max_val, best_params


In [None]:
import os 
# criterion = nn.L1Loss()
criterion = nn.MSELoss()

# os.makedirs('saves', exist_ok=True)

# # Train models for data_0 and data_1
# model_0, min_val_0, max_val_0, best_params_0 = preprocess_and_train(train_0, 'data_0')
# model_1, min_val_1, max_val_1, best_params_1 = preprocess_and_train(train_1, 'data_1')

In [7]:
def calculate_anomalies(column_name, model, min_val, max_val, threshold):
    # Create a windowed dataset for the specified column
    data_window = create_dataset(data_org[[column_name]], TIME_STEPS)
    
    # Scale the data
    data_window_scale = (data_window - min_val) / (max_val - min_val)
    
    # Convert to PyTorch tensor
    data_window_scale = torch.tensor(data_window_scale, dtype=torch.float32).to(device)
    
    # Create a DataLoader
    data_loader = torch.utils.data.DataLoader(data_window_scale, batch_size=1, shuffle=False)
    
    # Calculate reconstruction losses
    reconstruction_loss = []
    with torch.no_grad():
        for df in data_loader:
            df = df.to(device)
            output = model(df)
            loss = criterion(output, df)
            reconstruction_loss.append(loss.item())
    
    # Convert to numpy array
    array_of_values = np.array(reconstruction_loss)
    
    # Identify anomalies
    is_anomaly = array_of_values > threshold
    
    # Prepare column name for anomaly flag
    anomaly_column = f"is_anomaly_{column_name.split('_')[1]}"
    data_org[anomaly_column] = False
    
    # Calculate the starting index for updating the original DataFrame
    n = len(is_anomaly)
    start_idx = -(n + 5)
    if start_idx < 0:
        start_idx = max(len(data_org) + start_idx, 0)
    
    # Get the rows to update
    rows_to_update = data_org.index[start_idx:start_idx + n]
    
    # Update the DataFrame with anomaly information
    data_org.loc[rows_to_update, anomaly_column] = is_anomaly
    
    return reconstruction_loss

In [None]:
# threshold_0 = 0.0025  # Set threshold for data_0
# threshold_1 = 0.00225  # Set threshold for data_1

# with open('saves/best_hyperparameters_data_0.json', 'r') as f:
#     best_params = json.load(f)
# model_0 = DenseAutoencoderModel(input_shape=TIME_STEPS, latent_dim=best_params['latent_dim']).to(device)
# optimizer_0 = optim.Adam(model_0.parameters(), lr=best_params['learning_rate'])

# with open('saves/best_hyperparameters_data_1.json', 'r') as f:
#     best_params = json.load(f)

# model_1 = DenseAutoencoderModel(input_shape=TIME_STEPS, latent_dim=best_params['latent_dim']).to(device)
# optimizer_1 = optim.Adam(model_1.parameters(), lr=best_params['learning_rate'])

# # final_loss = train_model(model, train_loader, criterion, optimizer, EPOCHS, EARLY_STOPPING_PATIENCE)

# model_0.load_state_dict(torch.load('saves/best_autoencoder_model_data_0.pth'))
# model_1.load_state_dict(torch.load('saves/best_autoencoder_model_data_0.pth'))

# reconstruction_loss_0 = calculate_anomalies('data_0', model_0, min_val_0, max_val_0, threshold_0)
# reconstruction_loss_1 = calculate_anomalies('data_1', model_1, min_val_1, max_val_1, threshold_1)

# # Plot histograms
# plt.hist(reconstruction_loss_0, bins=100)
# plt.xlabel('Loss')
# plt.ylabel('Frequency')
# plt.title('Histogram of Reconstruction Losses for data_0')
# plt.show()

# plt.hist(reconstruction_loss_1, bins=100)
# plt.xlabel('Loss')
# plt.ylabel('Frequency')
# plt.title('Histogram of Reconstruction Losses for data_1')
# plt.show()

# # Plot anomalies
# from plot_anomaly import univariate_anomaly_plot
# univariate_anomaly_plot(data=data_org)

In [10]:
# threshold_0 = 0.0025  # Set threshold for data_0
# threshold_1 = 0.00225  # Set threshold for data_1

# reconstruction_loss_0 = calculate_anomalies('data_0', model_0, min_val_0, max_val_0, threshold_0)
# reconstruction_loss_1 = calculate_anomalies('data_1', model_1, min_val_1, max_val_1, threshold_1)

# # Plot histograms
# plt.hist(reconstruction_loss_0, bins=100)
# plt.xlabel('Loss')
# plt.ylabel('Frequency')
# plt.title('Histogram of Reconstruction Losses for data_0')
# plt.show()

# plt.hist(reconstruction_loss_1, bins=100)
# plt.xlabel('Loss')
# plt.ylabel('Frequency')
# plt.title('Histogram of Reconstruction Losses for data_1')
# plt.show()

# # Plot anomalies
# from plot_anomaly import univariate_anomaly_plot
# univariate_anomaly_plot(data=data_org)

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

This way, we can distinguish and label pretty perfectly between typical datums and anomalies.

Pros

Autoencoders can handle high-dimensional data with ease. 
Pertaining to its nonlinearity behavior, it can find complex patterns within high-dimensional datasets.
Cons

Since it’s a deep learning-based strategy, it will particularly struggle if the data is less.
Computation costs will skyrocket if the depth of the network increases and while dealing with big data.
So far we’ve seen how to detect and identify anomalies. But the real question arises after finding them. Now what? What do we do about it?

Let’s discuss some of the pointers you could apply in your scenario.