In [None]:
import statsmodels
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint, adfuller

<font color='blue'>A commonly untested assumption in time series analysis is the stationarity of the data. 

<font color='blue'>Data are stationary when the parameters of the data generating process do not change over time. 

In [None]:
def generate_datapoint(params):
    mu = params[0]
    sigma = params[1]
    return np.random.normal(mu, sigma)

In [None]:
# Set the parameters and the number of datapoints
params = (0, 1)
T = 100

A = pd.Series(index=range(T))
A.name = 'A'

for t in range(T):
    A[t] = generate_datapoint(params)

plt.plot(A)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(['Series A']);

In [None]:
# Set the number of datapoints
T = 100

B = pd.Series(index=range(T))
B.name = 'B'

for t in range(T):
    # Now the parameters are dependent on time
    # Specifically, the mean of the series changes over time
    params = (t * 0.1, 1)
    B[t] = generate_datapoint(params)

plt.plot(B)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(['Series B']);

<font color='blue'>Many statistical tests, deep down in the fine print of their assumptions, require that the data being tested are stationary. 

Also, <font color='blue'>if you naively use certain statistics on a non-stationary data set, you will get garbage results.

In [None]:
m = np.mean(B)

plt.plot(B)
plt.hlines(m, 0, len(B), linestyles='dashed', colors='r')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(['Series B', 'Mean']);

In [None]:
def check_for_stationarity(X, cutoff=0.01):
    # H_0 in adfuller is unit root exists (non-stationary)
    # We must observe significant p-value to convince ourselves that the series is stationary
    pvalue = adfuller(X)[1]
    if pvalue < cutoff:
        print 'p-value = ' + str(pvalue) + ' The series ' + X.name +' is likely stationary.'
        return True
    else:
        print 'p-value = ' + str(pvalue) + ' The series ' + X.name +' is likely non-stationary.'
        return False

In [None]:
# Set the number of datapoints
T = 100

C = pd.Series(index=range(T))
C.name = 'C'

for t in range(T):
    # Now the parameters are dependent on time
    # Specifically, the mean of the series changes over time
    params = (np.sin(t), 1)
    C[t] = generate_datapoint(params)

plt.plot(C)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(['Series C']);An important concept in time series analysis is moving average representation.

<font color='blue'>A cyclic movement of the mean will be very difficult to tell apart from random noise. 

<font color='blue'>In practice on noisy data and limited sample size it can be hard to determine if a series is stationary and whether any drift is random noise or part of a trend.

In each individual case the test may or may not pick up subtle effects like this.

In [None]:
check_for_stationarity(C);

An important concept in time series analysis is <font color='blue'>moving average representation.

<font color='blue'>This representation expresses any time series $Y_t$ as 
$Y_t = \sum_{j=0}^\infty b_j \epsilon_{t-j} + \eta_t$
* $\epsilon$ is <font color='blue'>the 'innovation' series
* $b_j$ are the <font color='blue'>moving average weights of the innovation series
* $\eta$ is <font color='blue'>a deterministic series

 $\eta$ is deterministic, such as a sine wave. Therefore we could perfectly model it.

<font color='blue'>The innovation process is stochastic and there to simulate new information occuring over time. </font><br> Specifically, $\epsilon_t = \hat Y_t - Y_t$ where $\hat Y_t$ is the in the optimal forecast of $Y_t$ using only information from time before $t$.

 In other words, <font color='blue'>the best prediction you can make at time $t-1$ cannot account for the randomness in $\epsilon$.

Each $b_j$ just says how much previous values of $\epsilon$ influence $Y_t$.

<font color='blue'>A time series is said to be $I(0)$ if the following condition holds in a moving average representation. 
$\sum_{k=0}^\infty |b_k|^2 < \infty$

This property turns out to be true of all stationary series, but by itself is not enough for stationarity to hold. <font color='blue'>This means that stationarity implies $I(0)$, but $I(0)$ does not imply stationarity. 

For more on orders of integration, please see the following links.
- https://en.wikipedia.org/wiki/Order_of_integration
- https://en.wikipedia.org/wiki/Wold%27s_theorem

<font color='blue'>In practice testing whether the sum of the autocorrelations is finite may not be possible. 

It is possible in a mathematical derivation, but <font color='blue'>when we have a finite set of data and a finite number of estimated autocorrelations, the sum will always be finite. 

Given this difficulty, tests for $I(0)$ rely on stationarity implying the property.<font color='blue'> If we find that a series is stationary, then it must also be $I(0)$.

In [None]:
plt.plot(A)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(['Series A']);

<font color='blue'>If one takes an $I(0)$ series and cumulatively sums it (discrete integration), the new series will be $I(1)$. 

<font color='blue'>The same relation applies in general, to get $I(n)$ take an $I(0)$ series and iteratively take the cumulative sum $n$ times.

In [None]:
A1 = np.cumsum(A)

plt.plot(A1)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(['Series A1']);

In [None]:
A2 = np.cumsum(A1)

plt.plot(A2)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(['Series A2']);

Conversely, <font color='blue'>to find the order of integration of a given series, we perform the inverse of a cumulative sum, which is the $\Delta$ or itemwise difference function.

Specifically $(1-L) X_t = X_t - X_{t-1} = \Delta X$ <br> $(1-L)^d X_t$ <br> In this case <font color='blue'>$L$ is the lag operator.


Sometimes also written as $B$ for 'backshift'. $L$ fetches the second to last elements in a time series, and $L^k$ fetches the k-th to last elements. <br>
So $L X_t = X_{t-1}$ and $(1-L) X_t = X_t - X_{t-1}$

<font color='blue'>A series $Y_t$ is $I(1)$ if the $Y_t - Y_t-1$ is $I(0)$. In other words, if you take an $I(0)$ series and cumulatively sum it, you should get an $I(1)$ series.

Once all the math has settled, remember that <font color='blue'>any stationary series is $I(0)$

In [None]:
symbol_list = ['MSFT']
prices = get_pricing(symbol_list, fields=['price']
                               , start_date='2014-01-01', end_date='2015-01-01')['price']
prices.columns = map(lambda x: x.symbol, prices.columns)
X = prices['MSFT']

In [None]:
check_for_stationarity(X);

In [None]:
plt.plot(X.index, X.values)
plt.ylabel('Price')
plt.legend([X.name]);

In [None]:
X1 = X.diff()[1:]
X1.name = X.name + ' Additive Returns'
check_for_stationarity(X1)
plt.plot(X1.index, X1.values)
plt.ylabel('Additive Returns')
plt.legend([X1.name]);

In [None]:
X1 = X.pct_change()[1:]
X1.name = X.name + ' Multiplicative Returns'
check_for_stationarity(X1)
plt.plot(X1.index, X1.values)
plt.ylabel('Multiplicative Returns')
plt.legend([X1.name]);

As always, <font color='blue'>you should not naively assume that because a time series is stationary in the past it will continue to be stationary in the future. 

<font color='blue'>Tests for consistency of stationarity such as cross validation and out of sample testing are necessary. 

 Returns may also go in and out of stationarity, and may be stationary or non-stationary depending on the timeframe and sampling frequency.

<font color='blue'>The reason returns are usually used for modeling in quantitive finance is that they are far more stationary than prices. This makes them easier to model and returns forecasting more feasible.

<font color='blue'>Forecasting prices is more difficult, as there are many trends induced by their $I(1)$ integration. 

Even using a returns forecasting model to forecast price can be tricky, <font color='blue'>as any error in the returns forecast will be magnified over time.

<font color='blue'>A linear combination of the time series ($X_1$, $X_2$, $\dots$, $X_k$) is a new time series $Y$ constructed as follows for any set of real numbers $b_1 \dots b_k$
$$Y = b_1X_1 + b_2X_2 + \dots + b_kX_k$$


<font color='blue'>For some set of time series ($X_1$, $X_2$, $\dots$, $X_k$), if all series are $I(1)$, and some linear combination of them is $I(0)$, we say the set of time series is cointegrated.

<font color='blue'>The intuition here is that for some linear combination of the series, the result lacks much auto-covariance and is mostly noise.<br>This is useful for cases such as pairs trading, in which we find two assets whose prices are cointegrated.</font> Since the linear combination of their prices $b_1A_1 + b_2A_2$ is noise, we can bet on the relationship $b_1A_1 + b_2A_2$ mean reverting and place trades accordingly. 

In [None]:
# Length of series
N = 100

# Generate a stationary random X1
X1 = np.random.normal(0, 1, N)
# Integrate it to make it I(1)
X1 = np.cumsum(X1)
X1 = pd.Series(X1)
X1.name = 'X1'

# Make an X2 that is X1 plus some noise
X2 = X1 + np.random.normal(0, 1, N)
X2.name = 'X2'

In [None]:
plt.plot(X1)
plt.plot(X2)
plt.xlabel('Time')
plt.ylabel('Series Value')
plt.legend([X1.name, X2.name]);

In [None]:
Z = X2.diff()[1:]
Z.name = 'Z'

check_for_stationarity(Z);

In [None]:
Z = X2 - X1
Z.name = 'Z'

plt.plot(Z)
plt.xlabel('Time')
plt.ylabel('Series Value')
plt.legend(['Z']);

check_for_stationarity(Z);

<font color='blue'>There are a bunch of ways to test for cointegration. This [wikipedia article](https://en.wikipedia.org/wiki/Cointegration) describes some. 

In general we're just trying to <font color='blue'>solve for the coefficients $b_1, \dots b_k$ that will produce an $I(0)$ linear combination. If our best guess for these coefficients does not pass a stationarity check, then we reject the hypothesis that the set is cointegrated. 

This will lead to <font color='blue'>risk of Type II errors (false negatives), as we will not exhaustively test for stationarity on all coefficent combinations. However Type II errors are generally okay here,</font> as they are safe and do not lead to us making any wrong forecasts.

<font color='blue'>In practice a common way to do this for pairs of time series is to use linear regression to estimate $\beta$ in the following model.
$X_2 = \alpha + \beta X_1 + \epsilon$

<font color='blue'>The idea is that if the two are cointegrated we can remove $X_2$'s depedency on $X_1$, leaving behind stationary noise. <br>The combination $X_2 - \beta X_1 = \alpha + \epsilon$ should be stationary.

In [None]:
symbol_list = ['ABGB', 'FSLR']
prices = get_pricing(symbol_list, fields=['price']
                               , start_date='2014-01-01', end_date='2015-01-01')['price']
prices.columns = map(lambda x: x.symbol, prices.columns)
X1 = prices[symbol_list[0]]
X2 = prices[symbol_list[1]]

In [None]:
plt.plot(X1.index, X1.values)
plt.plot(X1.index, X2.values)
plt.xlabel('Time')
plt.ylabel('Series Value')
plt.legend([X1.name, X2.name]);

In [None]:
X1 = sm.add_constant(X1)
results = sm.OLS(X2, X1).fit()

# Get rid of the constant column
X1 = X1[symbol_list[0]]

results.params

In [None]:
b = results.params[symbol_list[0]]
Z = X2 - b * X1
Z.name = 'Z'

plt.plot(Z.index, Z.values)
plt.xlabel('Time')
plt.ylabel('Series Value')
plt.legend([Z.name]);

check_for_stationarity(Z);

Remember as with anything else, you should not assume that because some set of assets have passed a cointegration test historically, they will continue to remain cointegrated. <font color='blue'>You need to verify that consistent behavior occurs, and use various model validation techniques as you would with any model.

One of the most important things done in finance is to make many independent bets. Here a quant would find many pairs of assets they hypothesize are cointegrated, and evenly distribute their dollars between them in bets. This only requires more than half of the asset pairs to remain cointegrated for the strategy to work. 

Luckily there are some pre-built tests for cointegration. Here's one. Read up on the [documentation](http://statsmodels.sourceforge.net/devel/_modules/statsmodels/tsa/stattools.html) on your own time.

In [None]:
from statsmodels.tsa.stattools import coint

coint(X1, X2)