## Example of anom_detect Usage

Below I use an example from the commonly used sunspots dataset to show some features of the anomaly detection library, especially some of the plotting functionalities.

If you want to run the example, download the data set from the below commented link and then run the example.

In [1]:
from anom_detect import anom_detect
import pandas as pd
%matplotlib inline

import sys
import os
from os.path import join
from pathlib import Path
sys.path.append(join(Path(os.getcwd()).parent, 'lib'))
import numpy as np
import json
from matplotlib import pyplot as plt
from pylab import rcParams

with open('../../info/info-EDP-ES.json','rt') as p:
    data = json.load(p)

mm_path = data['paths']['mm']['input']
mm = pd.read_csv(mm_path, sep=';', index_col = 0, parse_dates=[0])
mm.fillna(value=0, inplace=True)


### Load data set into Pandas

In [2]:
mm_path = data['paths']['mm']['input']
mm = pd.read_csv(mm_path, sep=';', index_col = 0, parse_dates=[0])
mm.fillna(value=0, inplace=True)

COLS_LISTA = data['lista2']['columns']
LISTA_CAD_PATH = data['paths']['lista2']

try:
    LISTAF = pd.read_csv(LISTA_CAD_PATH, names = COLS_LISTA, header=0, sep=';',engine='python')
except:
    LISTAF = pd.read_csv(LISTA_CAD_PATH, names = COLS_LISTA, header=0, engine='python')
    
LISTAF['instalacao'] = pd.DataFrame(LISTAF['instalacao'].apply(lambda x: '{0:0>10}'.format(x)))
try:
    LISTAF['medidor'] = pd.DataFrame(LISTAF['medidor'].astype(int).apply(lambda x: '{0:0>8}'.format(x)))
except:    
    LISTAF['medidor'] = pd.DataFrame(LISTAF['medidor'].apply(lambda x: '{0:0>8}'.format(x)))    


In [3]:
LISTAF[LISTAF['instalacao']=='0000051508']    

Unnamed: 0,cliente,instalacao,medidor,data_inicio,data_fim,lote,ssn,5,6,7,8,troca
632,IGREJA BATISTA MONTE HOREBE,51508,13831653,01/01/2015,17/06/2019,,,23/05/2019,21/06/2019,22/07/2019,22/08/2019,2-Com troca de medidor
633,IGREJA BATISTA MONTE HOREBE,51508,15347749,17/06/2019,17/06/2099,,89551180637000094687,23/05/2019,21/06/2019,22/07/2019,22/08/2019,2-Com troca de medidor


In [4]:
df = pd.DataFrame(mm['15347749'])

In [5]:
df.index.name = 'time'
df.columns = ['mm']

In [6]:
df.head()

Unnamed: 0_level_0,mm
time,Unnamed: 1_level_1
2019-05-01 01:00:00,0.0
2019-05-01 02:00:00,0.0
2019-05-01 03:00:00,0.0
2019-05-01 04:00:00,0.0
2019-05-01 05:00:00,0.0


### Evaluate for Anomalies

There are a number of options available in the anom_detect method.  It is recommended a small description below helps to:
- method : This is the data filtering method used, for the moment only 'average' is avaiable representing the moving average method.  In the future more data modelling techniques will be implemented.
- max_outliers : This is defaulted to 'None', which means that the max number of outliers is set to the size of your data set.  For more efficient computation this should be limited.
- window : The window size for the moving average, defaulted to 5.
- alpha : the significance level used for ESD test.
- mode : Method used in discrete linear convolution for dealing with boudaries.  Please read seperate documentation.  Default is 'same', this means that the window of averaging must intersect with data points with a length of >len(window)/2

In [7]:
# Use default values
an = anom_detect()

In [8]:
# Find the anomalies and print them

an.evaluate(df)

  res = abs((x - x_mean) / x_std)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


Unnamed: 0_level_0,mm
time,Unnamed: 1_level_1
2019-05-03 10:00:00,0.00
2019-05-03 11:00:00,0.00
2019-05-03 12:00:00,0.03
2019-05-03 13:00:00,0.00
2019-05-03 14:00:00,0.00
2019-05-05 03:00:00,0.00
2019-05-05 04:00:00,0.00
2019-05-05 05:00:00,4.92
2019-05-05 06:00:00,0.33
2019-05-05 07:00:00,0.00


In [None]:
an.plot()

In [None]:
an.plot(left='2019-07-25',right='2019-09-01',top=20,bottom=0)

### Accessing data

In [None]:
# The graph values can be accessed using 'results'.
an.results

In [None]:
an.anoma_points

In [9]:
# Morro = at least 2 days with anomalous values
# Each day = 1 row each hour = 24 rows/day
# 2 days = 48 rows (min)

# Anomalous data points can be printed from anoma_points.
morro_min_points = 48

month_anom_points = an.anoma_points.groupby(by=[an.anoma_points.index.month]).count()
above_threshold_points = month_anom_points[month_anom_points['mm'] >= morro_min_points]
above_threshold_points

Unnamed: 0_level_0,mm
time,Unnamed: 1_level_1
8,555


In [24]:
import warnings
warnings.filterwarnings('ignore')

morro_min_points = 48
morro = pd.DataFrame(columns=['mes', 'mm', 'instalacao', 'medidor'])
for index, row in LISTAF.iterrows():
    instalacao = row['instalacao']
    medidor = row['medidor']
    try:
        df = pd.DataFrame(mm[medidor])
    except:
        continue
    df.index.name = 'mes'
    df.columns = ['mm']
    # Use default values
    an = anom_detect()
    results = an.evaluate(df)
    #an.plot()
    month_anom_points = an.anoma_points.groupby(by=[an.anoma_points.index.month]).count()
    above_threshold_points = month_anom_points[month_anom_points['mm'] >= morro_min_points]
    above_threshold_points['instalacao'] = instalacao
    above_threshold_points['medidor'] = medidor
    if len(above_threshold_points) > 0:
        #above_threshold_points.reset_index(drop=True, inplace=True)
        above_threshold_points = above_threshold_points.reset_index()
        morro = pd.concat([morro, above_threshold_points])
    print(index, end=' ')        

   mes   mm  instalacao   medidor
0    5   53  0001001176  13017277
1    7   93  0001001176  13017277
2    8  135  0001001176  13017277
  mes   mm  instalacao   medidor
0   5   53  0001001176  13017277
1   7   93  0001001176  13017277
2   8  135  0001001176  13017277
0 1 2 3    mes   mm  instalacao   medidor
0    5  431  0160040259  12680634
1    6  399  0160040259  12680634
2    7  302  0160040259  12680634
3    8  222  0160040259  12680634
  mes   mm  instalacao   medidor
0   5   53  0001001176  13017277
1   7   93  0001001176  13017277
2   8  135  0001001176  13017277
0   5  431  0160040259  12680634
1   6  399  0160040259  12680634
2   7  302  0160040259  12680634
3   8  222  0160040259  12680634
4 5 

KeyboardInterrupt: 

In [13]:
morro

Unnamed: 0,Instalacao,Medidor,Mes,count_mm


### Check Normality of Residual

In order to use the ESD test, it is important that the quantity being tested is approximately normally distributed.  You can use the normality function in order to check this through two plots. 
In this implementation we calculate a residual value between the approximated curve (in this case the 5 day moving average) and the actual data:

<b>residual = (actual data point) -  (estimated value from moving average)</b>

The plots are simple and qualitative checks for normality:
- <b>Distribution of residuals</b> : is just a histogram of the residual in 100 bins.
- <b>Probability plot</b> : plots the actual data against it's corresponding normal value approximation (uses scipy.stats.probplot).  A perfectly normal data set would lie along the straight line.

In [None]:
an.normality()