## Anomaly Detection Algorithms to Choose From

https://towardsdatascience.com/5-anomaly-detection-algorithms-every-data-scientist-should-know-b36c3605ea16
Depends on what kind of anomalies exist in our data:
- Outliers: Short/small anomalous patterns that appear in a non-systematic way in data collection.
- Change in Events: Systematic or sudden change from the previous normal behavior.
- Drifts: Slow, undirectional, long-term change in the data.



Point anomalies – if a data point is too far from the rest, it falls into the category of point anomalies. The above example of bank transaction illustrates point anomalies.
Contextual anomalies – If the event is anomalous for specific circumstances (context), then we have contextual anomalies. As data becomes more and more complex, it is vital to use anomaly detection methods for the context. This anomaly type is common in time-series data. Example – spending $10 on ice-cream every day during the hot months is normal, but is odd for the rest months.
Collective anomalies. The collective anomaly denotes a collection of anomalous with respect to the whole dataset, but not individual objects. Example: breaking rhythm in ECG (Electrocardiogram).



We are not using simple statistical methods like comparing mean and median because our data is not univariate. Typos/errors can occur in the cost or consumption columns, so we need to choose a method that can take this into account. In addition, natural dips and peaks that are typical for that account could be marked as outliers if we were to do simple comparisons. A machine learning approach will take into account the historical context. 

Simple statistical techniques such as mean, median, quantiles can be used to detect univariate anomalies feature values in the dataset. Various data visualization and exploratory data analysis techniques can be also be used to detect anomalies.

This project originated from my work with naval utility data, which contained electricity, water, natural gas, and sewer along with over 4000 different accounts and line item descriptions to take into account. Since this data is sensitive, I've recreated the assignment using NYC utility data, which only contains electricity and about 300 accounts. The NYC data is also cleaned an doesn't contain any errors, so I artificially introduced some errors into the data (approx 5% of the data will contain errors). The original dataset where I was working with all naval utilities worldwide had much more errors because we dealt with a lot of manual data entry. We also had bill corrections which this data does not contain. This data comes directly from the electricity meter. 



Checklist:
1. Isolation Forest
   1. Chose isolation forest because it is easy to interpret for our stakeholders
2. Local Outlier Factor
3. Robust Covariance
4. One-Class SVM
5. One-Class SVM (SGD)

https://www.intellspot.com/anomaly-detection-algorithms/

More Anomaly Info:
https://serokell.io/blog/anomaly-detection-in-machine-learning
- We are dealing with contextual outliers

Requirements:
- Unsupervised: don't actually know what is is/isnt an error
- Flexible/robust enough to account for data changes
- Interpretable

Looking at only variance and standard deviations assumes that all groups will behave the same and doesn't take into account context of the account
Rule based analysis where we are looking at vaue thresholds assume we know what an acceptable range is. What happens when we have new data? How we take that into account? Do we have to create rules for every single group? Unwieldy for large datasets and not sustainable in the long-run

The k-NN algorithm works very well for dynamic environments where frequent updates are needed. In addition, density-based distance measures are good solutions for identifying unusual conditions and gradual trends. This makes k-NN useful for outlier detection and defining suspicious events.
k-NN also is very good techniques for creating models that involve non-standard data types like text.

k-NN is one of the proven anomaly detection algorithms that increase the fraud detection rate. It is also one of the most known text mining algorithms out there.


The LOF is a key anomaly detection algorithm based on a concept of a local density. It uses the distance between the k nearest neighbors to estimate the density.

LOF compares the local density of an item to the local densities of its neighbors. Thus one can determine areas of similar density and items that have a significantly lower density than their neighbors. These are the outliers.


K-means clustering
- Only works with numeric data - we have some categorical cols

Time series
- Not using this because we are taking into account more than just the numeric values. We want to feed in the vendor, location, rate type because anomalies could be specific to groups within these variables








In [6]:
# import packages
import os
import pandas as pd
import random
import numpy as np
import scipy
# import pycaret
import datetime as dt
import json
from sodapy import Socrata

from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt

In [None]:
# go with LOF or isolation forest because of high dimensional data - robust, flexible

NYC Data from City of New York
https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2010-Feb-2022-/jr24-e7cr

In [169]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("jr24-e7cr", limit=407031)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)



In [7]:
# %store results_df

UsageError: Unknown variable 'results_df'


In [8]:
%store -r results_df

In [9]:
# drop accounts with less than 8 years of data
df = results_df.groupby('account_name').filter(lambda x: len(x) >= 96)

df['consumption_kwh'] = df['consumption_kwh'].astype(float)
df['kwh_charges'] = df['kwh_charges'].astype(float)
df['days'] = df['days'].astype(float)


In [12]:
# select only last 5 years of data
df['revenue_month'] = pd.to_datetime(df['revenue_month'], format = '%Y-%m')


# take only month integer
df['revenue_month'] = df['revenue_month'].dt.month

In [41]:
df.loc[df['has_cost_anom']=='ANOMALY'][['kwh_charges','anom_cost']]

Unnamed: 0,kwh_charges,anom_cost
2,990.98,9909.8000
9,2585.83,25858.3000
16,885.68,88.5680
79,105.39,1053.9000
112,820.65,82.0650
...,...,...
406960,2908.15,29081.5000
406990,180.83,18083.0000
407020,421.83,42183.0000
407023,23.76,0.2376


In [None]:
df.isna().sum()

## Add anomalies to dataset

https://towardsdatascience.com/create-synthetic-time-series-with-anomaly-signatures-in-python-c0b80a6c093c

As a synthetic data generation method, you want to control the following characteristics of the anomalies:
Fraction of the data that need to be anomalous
The scale of the anomaly (how far they lie from the normal)
One-sided or two-sided (higher or lower than the normal data in magnitude)


In [10]:
# select random rows to be anomalized - generate list of all indices and pick randomly from list
totalRows = len(df)
inds = range(totalRows)
perc = .05 # 5% of data chosen to be anomalized for both consumption and cost columns
n = totalRows * perc


# cost indices
anomalized_inds_cost = random.sample(inds, int(n))

# common typos to add to data - these also take into account adding/removing 0s by accident

cat1_cost = anomalized_inds_cost[0::4] # rows with x 10 (one decimal to the right)
cat2_cost = anomalized_inds_cost[1::4] # rows with x 100 (2 decimals to the right)
cat3_cost = anomalized_inds_cost[2::4] # rows with x .1 (one decimal point to the left)
cat4_cost = anomalized_inds_cost[3::4] # rows with x .01 (one decimal point to the left) 

df['anom_cost'] = df['kwh_charges']
df['anom_cost'].iloc[cat1_cost] = df['anom_cost'].iloc[cat1_cost] * 10
df['anom_cost'].iloc[cat2_cost] = df['anom_cost'].iloc[cat2_cost] * 100
df['anom_cost'].iloc[cat3_cost] = df['anom_cost'].iloc[cat3_cost] * .1
df['anom_cost'].iloc[cat4_cost] = df['anom_cost'].iloc[cat4_cost] * .01

# qty indices
anomalized_inds_qty = random.sample(inds, int(n))

cat1_qty = anomalized_inds_qty[0::4] # rows with x 10 (one decimal to the right)
cat2_qty = anomalized_inds_qty[1::4] # rows with x 100 (2 decimals to the right)
cat3_qty = anomalized_inds_qty[2::4] # rows with x .1 (one decimal point to the left)
cat4_qty = anomalized_inds_qty[3::4] # rows with x .01 (one decimal point to the left) 

df['anom_qty'] = df['consumption_kwh']
df['anom_qty'].iloc[cat1_qty] = df['anom_qty'].iloc[cat1_qty] * 10
df['anom_qty'].iloc[cat2_qty] = df['anom_qty'].iloc[cat2_qty] * 100
df['anom_qty'].iloc[cat3_qty] = df['anom_qty'].iloc[cat3_qty] * .1
df['anom_qty'].iloc[cat4_qty] = df['anom_qty'].iloc[cat4_qty] * .01

df['has_cost_anom'] = ''
df['has_cost_anom'] = np.where((df['anom_cost']!=df['kwh_charges']),'ANOMALY','NO ANOMALY')
df['has_cost_anom'].value_counts()

df['has_qty_anom'] = ''
df['has_qty_anom'] = np.where((df['anom_qty']!=df['consumption_kwh']),'ANOMALY','NO ANOMALY')
df['has_qty_anom'].value_counts()


del cat1_cost
del cat2_cost
del cat3_cost
del cat4_cost

del cat1_qty
del cat2_qty
del cat3_qty
del cat4_qty


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['anom_cost'].iloc[cat1_cost] = df['anom_cost'].iloc[cat1_cost] * 10
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['anom_cost'].iloc[cat2_cost] = df['anom_cost'].iloc[cat2_cost] * 100
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['anom_cost'].iloc[cat3_cost] = df['anom_cost'].iloc[cat3_cost] * .1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/i

NO ANOMALY    392621
ANOMALY        13415
Name: has_qty_anom, dtype: int64

In [23]:
# test = df.groupby('account_name').agg({lambda x: list(x)})

In [29]:
X = df[['borough', 'funding_source', 'revenue_month', 'anom_cost', 'anom_qty', 'rate_class']]
X = pd.get_dummies(X, columns=['rate_class', 'funding_source', 'borough'], prefix= ['rate_class', 'funding', 'borough'])

In [30]:
X.dtypes

revenue_month                                  int64
anom_cost                                    float64
anom_qty                                     float64
rate_class_281-Sec Com Large Gen Use           uint8
rate_class_284-Sec Com Large Multiple Per      uint8
rate_class_285-Prim Com Large Mult Per         uint8
rate_class_285-Sec Com Large Multi Period      uint8
rate_class_68                                  uint8
rate_class_EL2                                 uint8
rate_class_EL2 Small Non-Res                   uint8
rate_class_EL9                                 uint8
rate_class_GOV/NYC/062                         uint8
rate_class_GOV/NYC/064                         uint8
rate_class_GOV/NYC/068                         uint8
rate_class_GOV/NYC/068 HT                      uint8
rate_class_GOV/NYC/068 TOD                     uint8
rate_class_GOV/NYC/069                         uint8
rate_class_GOV/NYC/069 TOD                     uint8
rate_class_GOV/NYC/082                        

In [None]:
i = 0 

for i in range(len(list_train))

In [31]:
clf = LocalOutlierFactor()


df['anomaly_score'] = 0.0
df['anomaly_score'] = clf.fit_predict(X)

In [33]:
df['anomaly_score'].value_counts(0)

 1    374529
-1     31507
Name: anomaly_score, dtype: int64

In [35]:
df[['has_qty_anom', 'has_cost_anom']].value_counts()

has_qty_anom  has_cost_anom
NO ANOMALY    NO ANOMALY       379810
              ANOMALY           12811
ANOMALY       NO ANOMALY        12771
              ANOMALY             644
dtype: int64

In [36]:
path = '/Users/viviantran/projects/jpl_interview/data/processed'
os.chdir(path)

modelName = 'lof'
df.to_csv('results_'+modelName+'.csv', index = False)