# Spectral Residual Transformation Exploration

This notebook explore using spectral residual transformations on datasets to assess if it could be useful for anomaly detection identification. The assessment is mainly visual and based on looking at figures. The plots in this notebook show the manually (subjectively) labelled anomalies for the datasets (discussed in `/data/labelled-skyspark-data`) overlaid on various transformations.

In [1]:
import pandas as pd
import numpy as np
import datetime
from sklearn.preprocessing import scale
from plotly.subplots import make_subplots
import plotly.graph_objects as go

Create a function to apply SR in conjunction with standardizing data

In [10]:
def convert_SR(data, normal_first=False, SR=True, normal_last=False):
    if normal_first:
        data1 = scale(data)
    else:
        data1 = data.copy()
        
    # spectral residual transformation
    if SR:
        A = np.fft.fft(data1)
        L = np.log(abs(A))
        P = np.angle(A)
        h = np.ones((7,),np.float32)/7
        A_L = np.convolve(L, h, 'same')
        R = L - A_L
        S = np.square(abs(np.fft.ifft(np.exp(R + 1j*P))))
    else:
        S = data1.copy()
    
    if normal_last:
        data_out = scale(S)
    else:
        data_out = S.copy()

    return data_out



In [6]:
# function to read/apply transformation/save data
def save_files(file_in, suffix, normal_first=False, SR=False, normal_last=False):
    data = pd.read_csv('../../data/labelled-skyspark-data/' + file_in + '.csv', parse_dates = ['Datetime'])
    data_trans = convert_SR(data.Value, normal_first=normal_first, SR=normal_first, normal_last=normal_first)
    data_out = data.copy()
    data_out['Value'] = data_trans
    data_out.to_csv(file_in + '_' + suffix + '.csv', index = False)

Create a function that provides a series of plots with various transformation applied including:

1. only standardize data
2. only apply spectral residual transformation
3. standardize, then apply spectral residual
4. apply spectral residual, then standardize

In [11]:
def plot_various(file_in, plot_portion = 'all'):
    data = pd.read_csv('../../data/labelled-skyspark-data/' + file_in + '.csv', parse_dates = ['Datetime'])

    trace1 = convert_SR(data.Value, normal_first=True, SR=False, normal_last=False)
    trace2 = convert_SR(data.Value, normal_first=False, SR=True, normal_last=False)
    trace3 = convert_SR(data.Value, normal_first=True, SR=True, normal_last=False)
    trace4 = convert_SR(data.Value, normal_first=False, SR=True, normal_last=True)

    if plot_portion == 'first_half':
        start = 0
        end = int(len(data)/2)

    elif plot_portion == 'second_half':
        start = int(len(data)/2)
        end = len(data)
    else:
        start = 0
        end = len(data)

    fig = make_subplots(rows=1, cols=4, subplot_titles=("Standardize, No SR", 
    "No Standardize, SR", 
    "Standardize, then SR", 
    "SR, then Standardize"))

    fig.append_trace(go.Scattergl(
        x = data['Datetime'][start:end],
        y = trace1[start:end],
        mode = 'markers',
        marker=dict(color=1*data['Anomaly'][start:end])
    ), row=1, col=1)

    fig.append_trace(go.Scattergl(
        x = data['Datetime'][start:end],
        y = trace2[start:end],
        mode = 'markers',
        marker=dict(color=1*data['Anomaly'][start:end])
    ), row=1, col=2)

    fig.append_trace(go.Scattergl(
        x = data['Datetime'][start:end],
        y = trace3[start:end],
        mode = 'markers',
        marker=dict(color=1*data['Anomaly'][start:end])
    ), row=1, col=3)

    fig.append_trace(go.Scattergl(
        x = data['Datetime'][start:end],
        y = trace4[start:end],
        mode = 'markers',
        marker=dict(color=1*data['Anomaly'][start:end])
    ), row=1, col=4)

    fig.update_layout(height=500, width=1200,
    margin=dict(
        l=50,
        r=50,
        b=50,
        t=50,
        pad=2
    ))
    fig.show()

## Look at Datasets with Transformations

The following sections provide code to run on each of the datasets. The code is commented out as the plots a very large. The function was run creating plotly images (interactive), a snapshot of the image was taken and saved, and the images are shown below.

The functions can be run to explore the data as required, but it is recommend to remove them before saving this notebook file.

In [30]:
#plot_various('CEC_compiled_data_1b_updated')

This figure is for the CEC HW Main Meter Power

![](images/CEC_compiled_data_1b_updated.png)

The plot indicates that applying SR appears to highlight the self-labelled anomalies. The figures also show that standardizing, then applying SR is potentially a better option as the other two options do not appear to show any anomalous data (visually shown in yellow) in the summer 2020 period.

In [34]:
#plot_various('CEC_compiled_data_2b_updated')

This figure is for the HW Main Meter Entering Water Temperature

![](images/CEC_compiled_data_2b_updated.png)

The figure shows that the SR transformations appear to help isolate the self-labelled anomalies. It is hard to determine which of the three options with SR is better.

In [13]:
#plot_various('CEC_compiled_data_3b_updated', plot_portion='first_half')

The following figures are for the CEC Main Meter Flow. Note that the data had to be split into two sets of figures due to the size of the data. The figure below is for the first half.

![](images/CEC_compiled_data_3b_updated_first.png)

In [15]:
#plot_various('CEC_compiled_data_3b_updated', plot_portion='second_half')

The figure below is for the second half.

![](images/CEC_compiled_data_3b_updated_second.png)

The figures shows that the SR transformations appear to help isolate the self-labelled anomalies. It is hard to determine which of the three options with SR is better. The standardize, then SR appears to result in curvature to the general trend compared with the other SR options but I don't believe this would be an issue with the LSTM.

In [31]:
#plot_various('CEC_compiled_data_4b_updated')

The figure below is for the Boiler B-1 Gas Pressure.

![](images/CEC_compiled_data_4b_updated.png)

The figures indicate that all manually labelled anomalies are quite distinct. Once again the standardize, then SR appears to create curvature to the data but I don't believe this would be in issue for the LSTM.

In [32]:
#plot_various('CEC_compiled_data_5b_updated')

The following figure is for the Boiler B-1 Exhaust O2.

![](images/CEC_compiled_data_5b_updated.png)

The figure indicates that the SR transformation results in the data having a much similar pattern to the other datasets whereas the data that does not have an SR transformation is quite different. Again, the standardize, then SR has curvature in the data.

## Comments based on above Plots

The SR transformations do appear to highlight the self-labelled anomalies. It also highlights additional data that are potentially anomalous as well that were not manually (subjectively) picked out. This transformation appears to be potential option to apply with the LSTM and should be explored.

The option where standardization is used before the SR transformation will likely be tried first but not applying any standardization would likely be the next one to try if standardize, then SR didn't work well.

In [None]:
# run this cell to save various options

# normalize, then SR
save_files('CEC_compiled_data_1b_updated', '_NormSR', normal_first=True, SR=True, normal_last=False)
save_files('CEC_compiled_data_2b_updated', '_NormSR', normal_first=True, SR=True, normal_last=False)
save_files('CEC_compiled_data_3b_updated', '_NormSR', normal_first=True, SR=True, normal_last=False)
save_files('CEC_compiled_data_4b_updated', '_NormSR', normal_first=True, SR=True, normal_last=False)
save_files('CEC_compiled_data_5b_updated', '_NormSR', normal_first=True, SR=True, normal_last=False)


## Interactive Inspection of Data

The [interactive app](../../code/labeller-app/) was also used to explore the transformed data in comparison with the untransformed datasets in `/data/labelled-skyspark-data`. Based on the inspection the SR transformation appears to do a good job of highlighting spikes or sharp breaks in data (as was noted in the above plots). Howevever, one area it does not appear to do well is with odd looking data that did not fluctuate out the range of nearby data.

For example, the plot below shows anomalous flatline data. However, the plot where the data is transformed using SR, this anomalous data is now not flatlined potentially making it much harder for the LSTM to pick it up.

![](images/noSR_1.png)

![](images/SR_1.png)

Conversely, the transformation does a good job of picking up a piece of data that is clearly not inline with it's surrounding data as shown in the two figures below (first figure is the normal untransformed data and the second has SR applied)

![](images/noSR_2.png)

![](images/SR_2.png)

## Testing Custom sklearn Transformer Class for Spectral Residual Transformation

Note that the problem with this implementation is that it loads the full dataset everytime is uses predict to use SR.

In [14]:
from sklearn.base import BaseEstimator, TransformerMixin

class SR(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        self.original_data = X
        self.original_length = len(X)
        return self
    def transform(self, X):
        length_data = len(X)
        if self.original_length == length_data:
            full_data = X
        else:
            full_data = self.original_data.append(X)
        A = np.fft.fft(full_data)
        L = np.log(abs(A))
        P = np.angle(A)
        h = np.ones((7,),np.float32)/7
        A_L = np.convolve(L, h, 'same')
        R = L - A_L
        S = np.square(abs(np.fft.ifft(np.exp(R + 1j*P))))
        if self.original_length == len(X):
            return S
        else:
            return S[length_data:]



In [15]:
data_classtest = pd.read_csv('../../data/labelled-skyspark-data/' + 'CEC_compiled_data_5b_updated' + '.csv', parse_dates = ['Datetime'])
SR_classtest = convert_SR(data_classtest.Value, normal_first=False, SR=True, normal_last=False)

In [16]:
split_data = int(len(data_classtest)*0.8)
X_train = data_classtest.Value[:split_data]
X_test = data_classtest.Value[split_data:]
SR_transformer = SR()
SR_transformed_train = SR_transformer.fit_transform(X_train)

In [17]:
SR_transformed_train

array([0.00403629, 0.00040683, 0.00033663, ..., 0.00255812, 0.00289514,
       0.01075843])

In [18]:
SR_transformed_test = SR_transformer.transform(X_test)

In [19]:
other_SR = SR().fit_transform(X_test)

In [20]:
fig = go.Figure()

fig.add_trace(go.Scattergl(
    x = data_classtest['Datetime'],
    y = SR_classtest,
    mode = 'markers',
))

fig.add_trace(go.Scattergl(
    x = data_classtest['Datetime'][:split_data],
    y = SR_transformed_train,
    mode = 'markers',
))

fig.add_trace(go.Scattergl(
    x = data_classtest['Datetime'][split_data:],
    y = SR_transformed_test,
    mode = 'markers',
))

fig.add_trace(go.Scattergl(
    x = data_classtest['Datetime'][split_data:],
    y = other_SR,
    mode = 'markers',
))


fig.show()

## Daily SR

Due to the issue with needing to recalculate SR using the entire dataset everytime predict is used, try applying SR to individual days of data.


In [73]:
data_daily = pd.read_csv('../../data/labelled-skyspark-data/' + 'CEC_compiled_data_1b_updated' + '.csv', parse_dates = ['Datetime'])

In [74]:
data_daily['date'] = data_weekly.Datetime.dt.date

In [79]:
group_test = data_daily.groupby('date')['Value'].transform(convert_SR)

In [85]:
# fig = go.Figure()

# fig.add_trace(go.Scattergl(
#     x = data_daily['Datetime'],
#     y = group_test,
#     mode = 'markers',
#     marker=dict(color=1*data_weekly['Anomaly'])
# ))

# fig.show()

![](images/dailysection_plot1.png)

The figure shows that applying SR on a daily basis is not that useful in highlighting any of the events self-identified as anomalies. Instead a larger window may be required.

In [76]:
data_daily['month_year'] = pd.to_datetime(data_daily['Datetime']).dt.to_period('M')

In [78]:
group_test2 = data_daily.groupby('month_year')['Value'].transform(convert_SR)

In [84]:
# fig = go.Figure()

# fig.add_trace(go.Scattergl(
#     x = data_daily['Datetime'],
#     y = group_test2,
#     mode = 'markers',
#     marker=dict(color=1*data_weekly['Anomaly'])
# ))

# #fig.show()

![](images/dailysection_plot2.png)

Using month-year does appear to highlight the self-labelled anomalies. Therefore, this could be explored as a method to use SR but not require the entire timeperiod.