# You can <font color="red"> watch </font>  this notebook at **Murat Karakaya Akademi** channel on ***YOUTUBE*** in [TURKISH](https://www.youtube.com/watch?v=7MhZ2DDg89Y) or in [ENGLISH](https://youtu.be/cy7vzuuADBc)

# Anomaly Detection in Time Series using Voting Scheme

In this notebook, we will predict if a GPS tracking device consumes **abnormal** amounts of current from the car battery (accumulator).

**Keywords & Concepts**:
* **Abnormal**: deviating from what is normal or usual, typically in a way that is undesirable or worrying.

* **Anomaly**: something that deviates from what is standard, normal, or expected.

* **Ensemble**: Ensemble methods are techniques that create multiple models and then combine them to produce improved results. 

* **Voting Scheme**: Voting is one of the easiest ensemble methods. the first step is to create multiple classification/regression models using some training dataset. Each base model can be created using different splits of the same training dataset and same algorithm, or using the same dataset with different algorithms, or any other method. 

* **Majority Voting**: Every model makes a prediction (votes) for each test instance and the final output prediction is the one that receives more than half of the votes. If none of the predictions get more than half of the votes, we may say that the ensemble method could not make a stable prediction for this instance. Although this is a widely used technique, you may try the most voted prediction (even if that is less than half of the votes) as the final prediction. 

* **Weighted Voting**: Unlike majority voting, where each model has the same rights, we can increase the importance of one or more models. In weighted voting you count the prediction of the better models multiple times. Finding a reasonable set of weights is up to you.





#  Data Collection & Preprocessing

A car tracking company collects the data automatically.
Car battery (accumulator) supplies the electric to the tracking device and tracking device logs electric supply in in miliamperes (mA). The number of the logs during a day varies according to the usage of the car. 

For simplify the problem, we selected one car with a diminishing battery (accumulator). Then, we processed the data such that for each day we have minimum and maximum  of the recorded current values.

# Goal

Our goal is to find the anomalies in the electric consumption due to battery or tracking device malfunction.



# Method
We will use an **ensembling** with **Majority Voting** implementing 4 prediction methods:
* [Facebook's Prophet](https://facebook.github.io/prophet/) 
* [Anomaly Detection with the Normal Distribution](https://anomaly.io/anomaly-detection-normal-distribution/index.html)
* [Simple Moving Average](https://towardsdatascience.com/anomaly-detection-def662294a4e) 
* [Exponential Moving Average](https://towardsdatascience.com/anomaly-detection-def662294a4e)

## NOTES:
* You can acess the notebook on  [COLAB](https://colab.research.google.com/drive/1Q9KPbgEXHbcJqUmsilPZntbtkGxSFdyA?usp=sharing), [GITHUB](https://github.com/kmkarakaya/ML_tutorials/blob/master/Anomaly_Detection_in_A_Time_Series.ipynb), or [Kaggle](https://www.kaggle.com/kmkarakaya/anomaly-detection-in-time-series-using-voting) 

* you can **watch** it on <font color="red"> YOUTUBE </font> in [TURKISH](https://www.youtube.com/watch?v=7MhZ2DDg89Y) or in [ENGLISH](https://youtu.be/cy7vzuuADBc)

* you can download the data from [GITHUB](https://github.com/kmkarakaya/ML_tutorials/blob/master/data/Min-Max%20Daily%20Analyse.csv) or [Kaggle](https://www.kaggle.com/kmkarakaya/car-battery-measurements)

## Load data

Let's begin with importing dependicies and data.

In [None]:
import pandas as pd
# prophet by Facebook
from fbprophet import Prophet
from sklearn.metrics import mean_absolute_error
import warnings; warnings.simplefilter('ignore')
import matplotlib.pyplot as plt
from IPython.display import HTML
import os



In [None]:
pd.set_option('display.max_columns', None)

In [None]:
url='../input/Min-Max Daily Analyse.csv'
df = pd.read_csv(url, sep=',')
df.head()

You can download from Github as well

In [None]:

#url = 'https://raw.githubusercontent.com/kmkarakaya/ML_tutorials/master/data/Min-Max%20Daily%20Analyse.csv'
#df = pd.read_csv(url, sep=';')
#df.head()

## Create New Feature

In [None]:
df['Range'] = df['Max']-df['Min']
df.head()

## Add meta data


In [None]:
import datetime 
day_name= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday']
df['Day'] = [ day_name[i] for i in pd.to_datetime(df['Date']).dt.dayofweek]
df= df[['Date','Day','Min','Max','Range']]
df.head()

## Explore Data



In [None]:
df.describe()

## Visualize the data in a plot.


In [None]:
#df['ds'] = pd.to_datetime(df['Day'],  dayfirst = True)
df.plot(x='Date',   figsize=(15, 5))


# Prediction Method 1: Facebook's Prophet Model

As a prediction method, we will use Prophet. 

[More about Prophet](https://facebook.github.io/prophet/docs/quick_start.html#python-api) 




## Prepare a Train & Predict function for  the Prophet Model


In [None]:
def prediction_Prophet(feature):
  dfNew = pd.DataFrame()
  dfNew['ds'] = pd.to_datetime(df['Date'],  dayfirst = True)
  dfNew['y'] = df[[feature]].copy()
  
  #print(dfNew.tail())

  m = Prophet(daily_seasonality=True )
  m.fit(dfNew)
  horizon= 1
  future = m.make_future_dataframe(periods=horizon)
  forecast = m.predict(future)
  print('\nForcasted  {} values \n {}\n'.format(feature, forecast[['ds',  'yhat', 'yhat_lower', 'yhat_upper']].tail()))
  fig1 = m.plot(forecast)
  #fig2 = m.plot_components(forecast)
  return forecast

## Run the Prophet Model

### For Range


In [None]:
pred=prediction_Prophet('Range')

df['Range_By_Prophet']=pred['yhat_upper']

print('Anamolies for range values\n', df[df['Range']>df['Range_By_Prophet']][['Date','Day','Range','Range_By_Prophet']])

### For Min Values

In [None]:
pred=prediction_Prophet('Min')

df['Min_By_Prophet']=pred['yhat_lower']
print('Anamolies for min values\n', df[df['Min']<df['Min_By_Prophet']][['Date','Day','Min','Min_By_Prophet']])


### For Max values

In [None]:
pred=prediction_Prophet('Max')

df['Max_By_Prophet']=pred['yhat_upper']
print('Anamolies for Max values\n', df[df['Max']>df['Max_By_Prophet']][['Date','Day','Max','Max_By_Prophet']])

## Compare the predictions with collected data 


In [None]:
df.plot(title="comparison",x='Date',y=['Min','Max', 'Min_By_Prophet','Max_By_Prophet'],figsize=(20, 6))

In [None]:
df.plot(title="comparison",x='Date',y=['Range','Range_By_Prophet'],figsize=(20, 6))

#Prediction Method 2: Mean + 2 SD
[More about Normal distribution & Standard Deviation](https://anomaly.io/anomaly-detection-normal-distribution/index.html) 
## Key Concepts:
* A **normal distribution** is a very common probability distribution that approximates the behavior of many natural phenomena.

* The **standard deviation**, called sigma (σ), defines how far the normal distribution is spread around the mean.

## Mathematical Rules:

When a metric is normally distributed it follows some interesting laws:

* The **mean** and the **median** are the same: both are equal to 1000 in this case. This is because of the perfectly symmetric “bell-shape”.

* The standard deviation, called sigma (σ), in this example σ = 20.
* 68% of all values fall between [mean-σ, mean+σ]; for the example this is [980, 1020].
* 95% of all values fall between [mean-2*σ, mean+2*σ]; for the example, [960, 1040].
* 99,7% of all values fall between [mean-3*σ, mean+3*σ]; in the example, [940; 1060].

The last 3 rules are also known as the **68–95–99.7 rule** or the “**three-sigma rule of thumb**”.



<img border="0" alt="W3Schools" src="https://github.com/kmkarakaya/ML_tutorials/blob/master/images/3-sigma-rules.png?raw=true" width="500" height="500">



### For Min

In [None]:
print('Mean of Min', df['Min'].mean())
print('Standart Deviation of Min', df['Min'].std())
print('Expected minimum value for of Min', df['Min'].mean()-2*df['Min'].std())
df['Min_Calculated']=df['Min'].mean()-2*df['Min'].std()
print('Anamolies for Min values\n', df[df['Min']<df['Min_Calculated']][['Date','Day','Min','Min_Calculated']])

### For Max

In [None]:
print('Mean of Max', df['Max'].mean())
print('Standart Deviation of Max', df['Max'].std())
print('Expected minimum value for of Max', df['Max'].mean()+2*df['Max'].std())
df['Max_Calculated']=df['Max'].mean()+2*df['Max'].std()
print('Anamolies for Max values\n', df[df['Max']>df['Max_Calculated']][['Date','Day','Max','Max_Calculated']])

### For Range

In [None]:
print('Mean of Range', df['Range'].mean())
print('Standart Deviation of Range', df['Range'].std())
maxRange=df['Range'].mean()+2*df['Range'].std()
print('Expected maximum value for of Range', maxRange)
df['Range_Calculated']=maxRange
print('Anamolies for Range values\n', df[df['Range']>df['Range_Calculated']][['Date','Day','Range','Range_Calculated']])

## Compare the predictions with collected data 

In [None]:
df.plot(title="comparison",x='Date',y=['Min','Max', 'Min_Calculated','Max_Calculated'],figsize=(20, 6))


In [None]:
df.plot(title="Range",x='Date',y=['Range','Range_Calculated'],figsize=(20, 6))

# Summary


In [None]:
CodesOfInterest=['anomaly']
def hover(hover_color="#ffff99"):
    return dict(selector="tr:hover",
                props=[("background-color", "%s" % hover_color)])
def showSummary(fontSize='12px'):
  summary =pd.DataFrame()
  
  summary= anomaly[(anomaly.isin(CodesOfInterest)==True).any(1)]
  styles = [
    hover(),
    dict(selector="th", props=[("font-size", fontSize),
                               ("text-align", "center")]),
    dict(selector="tr", props=[("font-size", fontSize),
                               ("text-align", "center")]),      
    dict(selector="caption", props=[("caption-side", "bottom")])
  ]
  html = (summary.style.set_table_styles(styles)
          .set_caption("Hover to highlight."))
  print(' Number of detected anomalies: ', len(summary) )
  return html

In [None]:
anomaly = pd.DataFrame()
anomaly = df[['Date','Day','Min','Max','Range']].copy()
anomaly['Min_anomaly_Prophet']= df['Min']
anomaly['Max_anomaly_Prophet']= df['Max']
anomaly['Range_anomaly_Prophet']=df['Range']

anomaly['Min_anomaly_Calculated']= df['Min']
anomaly['Max_anomaly_Calculated']= df['Max']
anomaly['Range_anomaly_Calculated']= df['Range']


In [None]:
df.columns

In [None]:
anomaly['Min_anomaly_Prophet'][df['Min']<df['Min_By_Prophet']]= 'anomaly'
anomaly['Min_anomaly_Prophet'][df['Min']>=df['Min_By_Prophet']]= ''

anomaly['Max_anomaly_Prophet'][df['Max']>df['Max_By_Prophet']]= 'anomaly'
anomaly['Max_anomaly_Prophet'][df['Max']<=df['Max_By_Prophet']]= ''

anomaly['Range_anomaly_Prophet'][df['Range']>df['Range_By_Prophet']]= 'anomaly'
anomaly['Range_anomaly_Prophet'][df['Range']<=df['Range_By_Prophet']]= ''

anomaly['Min_anomaly_Calculated'][df['Min']<df['Min_Calculated']]= 'anomaly'
anomaly['Min_anomaly_Calculated'][df['Min']>=df['Min_Calculated']]= ''

anomaly['Max_anomaly_Calculated'][df['Max']>df['Max_Calculated']]= 'anomaly'
anomaly['Max_anomaly_Calculated'][df['Max']<=df['Max_Calculated']]= ''

anomaly['Range_anomaly_Calculated'][df['Range']>df['Range_Calculated']]= 'anomaly'
anomaly['Range_anomaly_Calculated'][df['Range']<=df['Range_Calculated']]= ''




In [None]:
showSummary('11px')


# Prediction Method 3: Simple Moving Average (SMA)


Let's calculate Simple Moving Average with 3 days window


## Prepare a Train & Predict function for SMA


In [None]:
def predict_SMA(feature):
  window= 7
  sma = df[feature].rolling(window=window).mean()
  rstd = df[feature].rolling(window=window).std()
  bands = pd.DataFrame()
  bands['Date']=  (df['Date']).copy()
  bands['Date'] = pd.to_datetime(bands['Date'], dayfirst=True)
  bands['sma'] = sma 
  bands['lower'] = sma - 2 * rstd
  bands['upper'] = sma + 2 * rstd
  bands = bands.join(df[feature])
  bands = bands.set_index('Date')
  fig = plt.figure(figsize=(20, 6))
  ax = bands.plot(title=feature,  figsize=(20, 6))
  ax.fill_between(bands.index, bands['lower'], bands['upper'], color='#ADCCFF', alpha=0.4)
  ax.set_xlabel('Date')
  ax.set_ylabel(feature)
  ax.grid()
  plt.show()
  return bands

## For Min 


In [None]:
bands = predict_SMA('Min')
bands.reset_index(inplace=True)
min= df['Min'].min()
bands['lower'].fillna(min , inplace=True)
df['Min_SMA']= bands['lower'].copy()
print('Anamolies for SMA_Min values\n', df[df['Min']<df['Min_SMA']][['Date','Min', 'Min_SMA']])



## For Max 


In [None]:
bands = predict_SMA('Max')
bands.reset_index(inplace=True)
max= df['Max'].max()
bands['upper'].fillna(max , inplace=True)

df['Max_SMA']= bands['upper'].copy()
print('Anamolies for Max_SMA values\n', df[df['Max']>df['Max_SMA']][['Date','Max', 'Max_SMA']])

## For Range

In [None]:
bands = predict_SMA('Range')
bands.reset_index(inplace=True)
max= df['Range'].max()
bands['upper'].fillna(max , inplace=True)
df['Range_SMA']= bands['upper'].copy()
print('Anamolies for Range_SMA values\n', df[df['Range']>=df['Range_SMA']][['Date','Range', 'Range_SMA']])

#Summary

In [None]:
anomaly['Min_anomaly_SMA']= df['Min']
anomaly['Max_anomaly_SMA']= df['Max']
anomaly['Range_anomaly_SMA']= df['Range']

anomaly['Min_anomaly_SMA'][df['Min']<df['Min_SMA']]= 'anomaly'
anomaly['Min_anomaly_SMA'][df['Min']>=df['Min_SMA']]= ''

anomaly['Max_anomaly_SMA'][df['Max']>df['Max_SMA']]= 'anomaly'
anomaly['Max_anomaly_SMA'][df['Max']<=df['Max_SMA']]= ''

anomaly['Range_anomaly_SMA'][df['Range']>df['Range_SMA']]= 'anomaly'
anomaly['Range_anomaly_SMA'][df['Range']<=df['Range_SMA']]= ''


In [None]:
showSummary('10px')

# Prediction Method 4: Exponential Moving Average (EMA)

EMA(t)

EMA(t0)=(1−α)

EMA(t−1)+α p(t)=p(t0)
 
where p(t) is the price at time t and α is called the decay parameter for the EMA. 

α is related to the lag as
α=1/L+1

and the length of the window (span) M as
α=2/M+1.

The reason why EMA reduces the lag is that it puts more weight on more recent observations, whereas the SMA weights all observations equally by 1/M.


## Prepare a Train & Predict function for SMA

In [None]:
def predict_EMA(feature):
  window= 3
  ema = df[feature].ewm(span=window,adjust=False).mean()
  rstd = df[feature].rolling(window=window).std()
  bands = pd.DataFrame()
  bands['Date']=  (df['Date']).copy()
  bands['Date'] = pd.to_datetime(bands['Date'], dayfirst=True)
  bands['ema'] = ema 
  bands['lower'] = ema - 2 * rstd
  bands['upper'] = ema + 2 * rstd
  bands = bands.join(df[feature])
  bands = bands.set_index('Date')
  fig = plt.figure(figsize=(20, 6))
  ax = bands.plot(title=feature,  figsize=(20, 6))
  ax.fill_between(bands.index, bands['lower'], bands['upper'], color='#ADCCFF', alpha=0.4)
  ax.set_xlabel('Date')
  ax.set_ylabel(feature)
  ax.grid()
  plt.show()
  return bands

## For Min

In [None]:
bands= predict_EMA('Min')
bands.reset_index(inplace=True)
min= df['Min'].min()
bands['lower'].fillna(min , inplace=True)
df['Min_EMA']= bands['lower'].copy()
print('Anamolies for EMA_Min values\n', df[df['Min']<df['Min_EMA']][['Date','Min', 'Min_EMA']])

## For Max

In [None]:
bands = predict_EMA('Max')
bands.reset_index(inplace=True)
max= df['Max'].max()
bands['upper'].fillna(max , inplace=True)
df['Max_EMA']= bands['upper'].copy()
print('Anamolies for EMA_Max values\n', df[df['Max']>df['Max_EMA']][['Date','Max', 'Max_EMA']])

## For Range

In [None]:
bands = predict_EMA('Range')
bands.reset_index(inplace=True)
max= df['Range'].max()
bands['upper'].fillna(max , inplace=True)
df['Range_EMA']= bands['upper'].copy()
print('Anamolies for EMA_Range values\n', df[df['Range']>df['Range_EMA']][['Date','Range', 'Range_EMA']])

#Summary


In [None]:
anomaly['Min_anomaly_EMA']= df['Min']
anomaly['Max_anomaly_EMA']= df['Max']
anomaly['Range_anomaly_EMA']= df['Range']

anomaly['Min_anomaly_EMA'][df['Min']<df['Min_EMA']]= 'anomaly'
anomaly['Min_anomaly_EMA'][df['Min']>=df['Min_EMA']]= ''

anomaly['Max_anomaly_EMA'][df['Max']>df['Max_EMA']]= 'anomaly'
anomaly['Max_anomaly_EMA'][df['Max']<=df['Max_EMA']]= ''

anomaly['Range_anomaly_EMA'][df['Range']>df['Range_EMA']]= 'anomaly'
anomaly['Range_anomaly_EMA'][df['Range']<=df['Range_EMA']]= ''

In [None]:
showSummary('9px')

In [None]:
anomaly.info()

In [None]:
anomaly=anomaly[(anomaly.isin(CodesOfInterest)==True).any(1)]

#Apply pd.Series.value_counts to all the columns of the dataframe, it will give you the count of unique values for each row
voting= anomaly.iloc[:,5:14].apply(pd.Series.value_counts, axis=1)
voting.iloc[:,1:2]
anomaly['Vote_Number']=voting.iloc[:,1:2]
anomaly['Vote_Ratio']=voting.iloc[:,1:2]/9*100
anomaly.plot.bar(x='Date', y='Vote_Number')
print(anomaly[['Date','Day', 'Vote_Number']])

In [None]:
print("Total Number of detected anomalies: ",len(anomaly))
threshold= 50
print("Number of Anomalies over the threshold ({}%) voting: {} ".format(threshold,len(anomaly[anomaly['Vote_Ratio']>threshold] )))
print(anomaly[anomaly['Vote_Ratio']>threshold][['Date','Day','Vote_Number','Min','Max','Range']])

In [None]:
anomaly['Vote_Number'].describe()

# Export the Anomaly Report

In [None]:
anomaly.to_csv('anomaly.csv')

# Conclusion

Given the data set and using the Prophet, SMA, EMA to forecast the anomaly:
* Data has ONLY **90 days**
* Majority Voting provides a fitering mechanism of the predicted anomalies by various methods
* We analyze the anomaly in a day by using 3 observation: **min, max, and range** values of current in mA.
* The four selected methods detected **22 anomalies** out of 90 data samples
* **EMA** has predicted almost no anomalies
* Thus, we remove it, only use the rest **three methods** in Majority Voting
* We decided that there are **3 anomalies** by setting the Majority Voting **threshold 50%**


# How you can improve
* Add more prediction methods such as:

>* [Arima](https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/)
>*  [Autoregressive (AR)](https://machinelearningmastery.com/autoregression-models-time-series-forecasting-python/) 

* Implement Weighted Voting scheme

* Apply what you have learned in here to a different data set

* Write a comment to me


# Thank you
Murat Karakaya

First Submission: 09/05/2020