In [1]:
from IPython.display import HTML

<a href = "https://snibborevets.github.io/ac209a_anomaly/overview"><img style="float: left;" src="https://snibborevets.github.io/ac209a_anomaly/webpage_banner.jpg"></a>

In [2]:
HTML('''<blockquote>
            
        <script>
            code_show=true;
            function code_toggle() {
                if (code_show){
                     $('div.input').hide();
                 } else {
                     $('div.input').show();
                 }
             code_show = !code_show
            }
            $( document ).ready(code_toggle);
        </script>

        <form action="javascript:code_toggle()" id="button">
            <input type="submit" value="Click here to toggle on/off the raw code.">\
        </form>

        </blockquote>''')

In [5]:
HTML('''<blockquote>

        <h2> Website Links: </h2>
        
        <br>
        
        <select onChange="window.location.href=this.value">
           <option value = "overview.html" selected = "selected">Overview</option>
           <option value = "key_concepts.html">Key Concepts</option>
           <option value = "simple_anomaly_detection.html">Simple Anomaly Detection</option>
           <option value = "deterministic_power_martingales.html">Deterministic Power Martingales</option>
           <option value = "randomized_power_martingales.html">Randomized Power Martingales</option>
           <option value = "adiabatic_SVM.html">Adiabatic Iterative Support Vector Machines</option>
           <option value = "conclusion.html">Conclusion</option>
           <option value = "works_cited.html">Works Cited</option>
           <option value = "https://snibborevets.github.io/ac209a_anomaly/Stephen_Robbins_Ryan_Lapcevic_AC209a_Poster.pdf">Poster Presentation</option>
        </select>

        </blockquote>''')

---

## Overview

- [Project Objective](#Project-Objective)
- [Data Overview](#Data-Overview)
- [Algorithm Strategy](#Algorithm-Strategy)
- [Results and Conclusion](#Results-and-Conclusion)
- [Link to Poster Presentation](#Link-to-Poster-Presentation)

---

### Project Objective


***To create an algorithm to detect anomalies on-line for time series data***

There are many practical uses for anomaly detection. Quantitative hedge fund analysts write algorithms to detect unusual fluctuations in stock markets, seeking to exploit an arbitrage for profit. Weather analysts detect unusual climate patterns to predict future weather. Network security analysts discover breaches by detecting unusual network activity.

In many problem domains it is desirable to analyze data dynamically as they are collected over time. This introduces a number of problems for traditional machine learning algorithms, many of which assume the hypothesis of exchangeability applies to the data set. Many revolved around the concept of exchangability. The principle of exchangeability essentially means that past data are representative of future data and can therefore be used to make inferences. A more detailed discussion of exchangibility can be found [here](https://snibborevets.github.io/ac209a_anomaly/key_concepts.html).

Our approach to creating an anomaly detection algorithm consisted of the following:
* Analyze [previous research](https://snibborevets.github.io/ac209a_anomaly/works_cited.html) and strategies for detecting anomalies. 
* Find time-series datasets to run through simple anomaly detection algorithm that utilizes a dataset's derivative
* Implement martingale alghorithms provided by Vovk's ["Testing Exchangeability On-Line"](http://www.aaai.org/Papers/ICML/2003/ICML03-100.pdf)
* Implement algorithm utilizing iterative [support vector machines](https://snibborevets.github.io/ac209a_anomaly/adiabatic_SVM.html)

---

### Data Overview

We sought to find time-series datasets across a broad set of fields. Ideal datasets contain a large sample size (ideally with an n of over 1,000, varying depending on the algorithm being implemented), equally spaced samples over time (i.e. units of time are identical between samples and there are no time "gaps"), and samples labelled for anomalous events. 

There are many sources online for finding clean time-series datasets. A few are listed below:
* [DataMarket's Time Series Library](https://datamarket.com/data/list/?q=provider:tsdl)
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html)
* [UCR Time Series Data](http://www.cs.ucr.edu/~eamonn/time_series_data/)
* [Links to Several Other Sites](https://www.researchgate.net/post/What_are_anomaly_detection_benchmark_datasets)

Only very minor data "cleaning" was needed in order to work with the data from the above links. It's also worth noting that  we created artificial datasets to test our algoritms containing obvious anomalous data.

Our biggest challenge was finding a dataset that labelled anomalies. As a result we focused on data where known anomalies have taken place. For example, in our poster presentation we showed an anomaly detection algorithm applied to stock market data. This enabled us to see if our algorithm detected known anomalous events, such as the Cuban Missle Crisis, Dot-com Bubble or the Great Recession. The image below is from our poster presentation --- a plot of anomalies in the S&P 500 Index since 1950: 

<img style="float: left;" src="https://snibborevets.github.io/ac209a_anomaly/sp500_poster.jpg">

---

### Algorithm Strategy

There were several steps needed to create our anomaly detection algorithm. Our method can be best described in the section below from our poster presentation below. NOTE: formulas below provided by Vovk's ["Testing Exchangeability On-Line"](http://www.aaai.org/Papers/ICML/2003/ICML03-100.pdf).

<img style="float: left;" src="https://snibborevets.github.io/ac209a_anomaly/algorithm_flow_chart.jpg" style="width:30px;height:22px;">

---

### Results and Conclusion

We used several methods for detecting anomalies: simple detection function using a derivative-based algorithm, deterministic and randomized power martingales, and adiabatic iterative support vector machines. Our empirical results suggest that our martingale-based anomaly detection algorithm successfully identifies concept changes (anomalies) in streaming data for each method.

While we yielded good results, we would have liked to spend more time revising our strangeness algorithms, specifically the deterministic and randomized power martingales and adiabatic iterative SVM methods. Regarding power martingales, it would have been useful to make our algorithm match exactly that of Vovk's research on USPS data. Furthermore, as seen in the plots on our conclusion page, we may have been able to improve the performance so that it yielded better results than the simple derivative-based algorithm. Regarding adiabatic iterative SVM methods, we were not able to fully implement the algorithm. 

We also would have liked to explore more methods for determining strangeness of our data. We used the nearest-neighbor and SVM methods, but additional strangeness measures could have been utilized. For example, we could have used the random forest classifier or other decision tree algorithms to determine how strange a given point is compared to previous data received.

---

### Link to Poster Presentation

Our poster presention from 12/9/2016 can be found [here](https://snibborevets.github.io/ac209a_anomaly/Stephen_Robbins_Ryan_Lapcevic_AC209a_Poster.pdf).

---