In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.matplotlib.rcParams['savefig.dpi'] = 144
import seaborn

# Anomaly Detection, Session 1

The first two sessions will be focused on anomaly detection in time-series data.

**Time series** differ from other sources of data in that they are explicitly ordered.  The usual intent is to use past data to make predictions about the future, so only data from the past may be used to make a prediction.  For simplicity, we will ignore this restriction for most of these sessions, but will discuss it in the context of **online learning** towards the end.

**Anomaly detection**, or novelty detection, is attempting to find data that look different from the majority of the data.  It is typically an **unsupervised learning** system.  By this, we mean that the anomalous data is not **labeled**.  We must detect it by learning what the normal data look like.

## CitiBike Ridership Data

We will be looking at ridership from the CitiBike bike sharing system.  The data are available [online](https://s3.amazonaws.com/tripdata/index.html).  The zip files should be loaded in to the `anomaly/tripdata/` directory.  The script `download.sh` will do this for you.

Let's start by looking at an individual file.  Python's *zipfile* module will allow us to read data from those zip files without manually unzipping each one.

In [None]:
import zipfile

In [None]:
zf = zipfile.ZipFile('tripdata/201307-citibike-tripdata.zip', 'r')
zf.namelist()

In [None]:
data = zf.read(zf.namelist()[0])
print '\n'.join(data.split('\n')[:5])

The *pandas* module provides a `DataFrame` class for powerful manipulation of data.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(zf.open(zf.namelist()[0]))
df.head()

We will get just the day of the start time.  String processing is a bit faster than turning everything into datetime objects.

In [None]:
df['starttime'].str.split(' ', 1).apply(lambda x: x[0]).head()

Now, all we have to do is count how many time each date occurs:

In [None]:
df['starttime'].str.split(' ', 1).apply(lambda x: x[0]).value_counts().sort_index()

We can convert the index to actual datetime objects for convenience later.  Let's wrap this all up in a function.

In [None]:
def load_counts(fn):
    zf = zipfile.ZipFile(fn, 'r')
    df = pd.read_csv(zf.open(zf.namelist()[0]))
    counts = df['starttime'].str.split(' ', 1).apply(lambda x: x[0]).value_counts()
    if '-' in counts.index[0]:
        counts.index = pd.to_datetime(counts.index, format='%Y-%m-%d')
    else:
        counts.index = pd.to_datetime(counts.index, format='%m/%d/%Y')
    return counts.sort_index()

load_counts('tripdata/201307-citibike-tripdata.zip')

And let's do this for all of the files.

In [None]:
import glob
fns = glob.glob('tripdata/[0-9][0-9][0-9][0-9][0-9][0-9]-citibike-tripdata.zip')
counts = pd.concat([load_counts(fn) for fn in sorted(fns)])

In [None]:
counts.plot()
plt.ylabel('Rides per day')

## Detecting seasonality

**Seasonality** is the tendency of time-series data to have underlying cycles.  We typically wish to remove these variations before undertaking further analysis.

The first tool we we look at is the **autocorrelation**.  For two random variables, $X$ and $Y$, the covariance is defined as

$$ \mbox{Cov}[X, Y] = E\left[(X - E[X])(Y - E[Y])\right] = E[XY] - E[X]E[Y] \ .$$

If there is no correlation between the random variables, the covariance is 0.  If the two random variables always return the same value,

$$ \mbox{Cov}[X, Y] = \mbox{Cov}[X, X] = \mbox{Var}[X] \ . $$

The autocovariance of a time-series signal is just the covariance of the signal with a time-lagged copy.  The autocorrelation normalizes this by the variance of the signal:

$$ \rho(X \mid \tau) = \frac{\mbox{Cov}[X_t, X_{t+\tau}]}{\mbox{Var[X]}} \ . $$

Pandas provides a built-in autocorrelation plot.

In [None]:
pd.tools.plotting.autocorrelation_plot(counts)
plt.axvline(365, color='k', ls=':')

The yearly cycle is clearly visible.  Zooming in, more detail becomes obvious.

In [None]:
pd.tools.plotting.autocorrelation_plot(counts)
plt.xlim(0,60)
plt.axvline(7, color='k', ls=':')

**Fourier analysis** takes advantage of the fact that any signal can be written as the sum of sinusoids: 

$$ X_t = \sum_\nu A_\nu \sin(2\pi\nu t) + B_\nu \cos(2\pi\nu t) \ . $$

The **Fourier transform** is used to read out the coefficients $A_\nu$ and $B_\nu$ from the original signal.  If the signal is sampled, this process is known as the **discrete Fourier transform** (DFT).  The standard algorithm for doing this is the **fast Fourier transform** (FFT), which is implemented in the *numpy* module.

Quite often, we are not concerned with the Fourier coefficients directly, but with the total power provided at frequency $\nu$, $A_\nu^2 + B_\nu^2$.  This **power spectrum** is easily derived from the Fourier transform.

(In practice, Fourier analysis uses complex exponentials instead of sines and cosines.  Instead of real coefficients $A_\nu$ and $B_\nu$, the FFT returns a single complex coefficient $C_\nu$.  The contribution to the power spectrum is $|C_\nu|^2$.

In [None]:
import numpy as np
fft_counts = np.fft.fft(counts - counts.mean())
yrs = (counts.index[-1] - counts.index[0]).days / 365.

In [None]:
plt.plot(1.0*np.arange(len(fft_counts)) / yrs, np.abs(fft_counts)**2)
plt.xlabel('Freq (1/yrs)')

While the Fourier transform returns information on frequencies up to the sampling frequency (1/day, or 365/year), only the results up to half of that are valid.  This is due to the problem of **aliasing**.  In a sampled signal, you can not distinguish a signal with a frequency above half of the sampling rate, known as the **Nyquist freqency**, from a signal with a frequency below that.

In [None]:
t = np.linspace(0, 10, 1000)
ts = np.arange(0, 11)
f = 0.65
plt.plot(t, np.sin(2*np.pi * f * t), t, -np.sin(2*np.pi * (1 - f) * t))
ml, sl, bl = plt.stem(ts, np.sin(2*np.pi * f * ts))
plt.setp(ml, 'markerfacecolor', 'r')
plt.setp(sl, 'color', 'r')
plt.setp(bl, visible=False)
plt.xticks(ts)
plt.ylim(-2,2)

Zooming in on the low-frequency components, we can clearly see the yearly cycle dominating all other components.

In [None]:
plt.plot(1.0*np.arange(len(fft_counts)) / yrs, np.abs(fft_counts)**2)
plt.axis([0,3, 0, 5e13])
plt.xlabel('Freq (1/yrs)')

Changing focus some more, we can cleary see the weekly cycle, at 52/year, as well as additional peaks at 12/year and 8/year.

In [None]:
plt.plot(1.0*np.arange(len(fft_counts)) / yrs, np.abs(fft_counts)**2)
plt.axis([0,100, 0, 1e12])
plt.xlabel('Freq (1/yrs)')
plt.axvline(365/7., color='k', ls=':')
plt.axvline(12, color='k', ls=':')
plt.axvline(8, color='k', ls=':')

## Detrending

We wish to remove the seasonality we see in the data, so that we can better distinguish anomalous points.  To this end, we will build model of the ridership.  This is an example of **supervised machine learning**.

In supervised machine learning, we have a $n \times p$ **feature matrix** $X_{ji}$.  Each column corresponds to one of the $p$ features, and each row to a particular observation, out of $n$ total.  We also have a **label vector** $y_j$ of length $n$.  The goal is to develop a model $f$ that predicts the labels from the corresponding feature row; that is,

$$ f(X_{j\cdot}) \approx y_j \ . $$

![Feature matrix](images/matrix.svg)

When the labels are numerical, the problem is known as **regression**.  Then the labels represent different categories, the problem is one of **classification**.  Our problem, estimating counts, is a regression problem.

Both regression and classification have a number of different metrics to gauge the effectiveness of a model.  The most commonly used metric for regression is the **mean-squared error** (MSE):

$$ \mbox{MSE} = \frac 1 N \sum_{j=1}^N \left[ f(X_{j\cdot}) - y_j \right]^2 \ . $$

The MSE has units of $y^2$, which can make its size difficult to judge.  We often use the **root mean-squared error** (RMSE) instead.

### Linear Regression

A basic, yet quite powerful, machine learning model is **linear regression**.  It is a linear model, meaning that

$$ f(X_{j\cdot}) = \sum_i X_{ji} \beta_i = (X \cdot \beta )_j \ , $$

for some $p$-vector $\beta$.  Linear regression finds this vector by minimizing the MSE

$$ \frac 1 N \left|X\cdot\beta - y\right|^2 \ . $$

There is a closed-form solution:

$$ \hat\beta = (X^T X)^{-1} X^T y \ . $$

Our first model will be to consider only the yearly cycles:

$$ f(t) = A \sin\frac{2\pi t}{365} + B\cos\frac{2\pi t}{365} + f_0 \ . $$

At first glance, it may appear that linear regression is not suitable, since the model is not linear in $t$.  However, if we consider the $n\times 2$ feature matrix

$$ X = \left[ \sin\frac{2\pi t}{365}\ \ \cos\frac{2\pi t}{365} \right] \ , $$

we do actually have a linear model.

### Scikit Learn

This particular transformation is a particularly simple example of **feature engineering**.  Feature engineering is where much of the work of machine learning is done, so libraries will provide mechanisms to assist this process.

We will be using the *Scikit Learn* module for machine learning.  It provides many tools, but the core is two types of classes: **transformers** and **estimators**.  Transformers take in a feature matrix and return a transformed version:
``` python
class Transformer(base.BaseEstimator, base.TransformerMixin):
    
  def fit(self, X, y=None):
    # Learn about the data
    return self
  
  def transform(self, X):
    return ... # The transformed features
```


In [None]:
from sklearn import base
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [None]:
class FourierComponents(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, period):
        self.period = period
    
    def fit(self, X, y=None):
        self.X0 = X[0]
        return self
    
    def transform(self, X):
        dt = (X - self.X0).days * 2 * np.pi / self.period
        return np.c_[np.sin(dt), np.cos(dt)]

In [None]:
fc = FourierComponents(365)
fc.fit(counts.index)
plt.plot(fc.transform(counts.index))

Estimators have a `predict()` method that returns the model's prediction for the label of a particular row.
``` python                                                 
class Estimator(base.BaseEstimator, base.RegressorMixin):
  
  def fit(self, X, y):
    # Learn about the data
    return self
    
  def predict(self, X):
    return ... # The predicted labels
```
We'll use a linear regression estimator provided by Scikit Learn.

In [None]:
X_trans = fc.transform(counts.index)
lr = LinearRegression()
lr.fit(X_trans, counts)
plt.plot(counts.index, counts, counts.index, lr.predict(X_trans))

Handling each transformer manually quickly gets tedious and error-prone.  Scikit Learn comes to the rescue with **pipelines**.  Pipelines take a series of transformers and (optionally) an estimator.  A pipeline acts as an estimator itself.  When it is fit, it fits the first transformer, transforms with the first transformer, uses that value to fit the second transformer, transforms with the second transformer, *etc.*, until finally fitting the estimator.  When predict is called on the pipeline, it sends the feature matrix through each of the transformers, before finally calling predict on the estimator.

In [None]:
pipe = Pipeline([('fourier', FourierComponents(365)),
                 ('lr', LinearRegression())])
pipe.fit(counts.index, counts)
plt.plot(counts.index, counts, counts.index, pipe.predict(counts.index))

How well are we doing?  A good baseline is the mean model, which has a MSE equal to the variance of the data.  We'll take the square root to look at RMSE.

In [None]:
np.sqrt(counts.var())

This model is definitely an improvement.

In [None]:
np.sqrt(metrics.mean_squared_error(counts, pipe.predict(counts.index)))

### Categorical Features

We now consider the weekly cycle we saw.  If we group the results by day of the week, we can get a better feel for the cycle.

In [None]:
counts.index[-1], counts.index[-1].dayofweek

In [None]:
day_df = pd.DataFrame(
    {'day': counts.index.dayofweek, 'count': counts.values}
)
day_df.head()

In [None]:
day_df.groupby('day').mean()

In [None]:
day_df.groupby('day').mean().plot(kind='bar')
plt.xticks(range(7), ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

The weekly cycle is not particularly sinusoidal.  Instead of treating the day of the week as a continuous variable, we will treat it as a **categorical feature**.  Such features denote membership in a class, without any particular ordering of those classes.  Therefore, we do not encode them in a single feature, but we create a new feature for each category.  Each row gets a 1 in the column corresponding to its category and a 0 in all others.  This is known as **one-hot encoding** or **dummy variables**.

In [None]:
class DayofWeek(base.BaseEstimator, base.TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def day_vector(self, day):
        v = np.zeros(7)
        v[day] = 1
        return v
    
    def transform(self, X):
        return np.stack(self.day_vector(d) for d in X.dayofweek)

In [None]:
DayofWeek().transform(counts.index)[:10]

When used with linear regression, one-hot encoding produces per-category means.

We want to send both the Fourier components and the one-hot encoded days to a linear regressor.  While we could write a transformer that does both, it's better to use a **feature union**.  This does in parallel what a pipeline does in series.

In [None]:
union = FeatureUnion([('fourier', FourierComponents(365)),
                      ('dayofweek', DayofWeek())])
pipe = Pipeline([('union', union),
                 ('lr', LinearRegression())])

In [None]:
pipe.fit(counts.index, counts)
np.sqrt(metrics.mean_squared_error(counts, pipe.predict(counts.index)))

To understand what a model is doing correctly, and what it's missing, it's useful to plot the **residual**, the difference between the actual and predicted values.

In [None]:
plt.plot(counts - pipe.predict(counts.index))

Ridership has been growing with time.  The growth doesn't appear to be linear, but quadratic might be a good fit.  It's simple to add two more features representing $t$ and $t^2$ to attempt to fit this background growth.

In [None]:
class QuadBackground(base.BaseEstimator, base.TransformerMixin):
    
    def fit(self, X, y=None):
        self.X0 = X[0]
        return self
    
    def transform(self, X):
        days = (X - self.X0).days
        return np.c_[days, days**2]

In [None]:
union = FeatureUnion([('date', QuadBackground()),
                      ('fourier', FourierComponents(365)),
                      ('dayofweek', DayofWeek())])
pipe = Pipeline([('union', union),
                 ('lr', LinearRegression())])

In [None]:
pipe.fit(counts.index, counts.values)

In [None]:
plt.plot(counts - pipe.predict(counts.index))

In [None]:
np.sqrt(metrics.mean_squared_error(counts, pipe.predict(counts.index)))

## Exercises

1. Account for the monthly seasonality.  Examine how ridership varies over the month.  Develop a model to account for this.  How much does this improve the RMSE?

2. It seems reasonable to assume that weather affects the usage of the CitiBike system.  The `weatherdata/nycp.csv` file contains daily National Weather Service records for Central Park.  Add features from these records to your model.  Does this improve the RMSE of the model?  How much does ridership increase for every degree Fahrenheit?  (Hint: The coefficients of the linear model are stored in the `.coef_` attribute of a `LinearRegression` object.)