# Bitcoin ARIMA analysis
Abstract: this analysis is an attempt to use ARIMA in analyzing bitcoin prices from the Bitstamp exchange. The objective is to create a model that is able to forecast future prices, identify trends, seasonality, and other remarkable properties. Theoretically, analysis like this could be used in trading. Bitstamp's data was made stationary by using log-diff followed by rolling average diff techniques; achieving a p-value of 7.659524e-26 on the Dickey-Fuller test. Through ACF and PACF an ARIMA model was built with (p,d,q) of (2,1,2) with an RSS of 0.235. Overall, the model was not a good predictor of Bitcoin's price.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import datetime
from datetime import date
import random

import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

import gc

import warnings
warnings.filterwarnings('ignore')

# Load the data

In [None]:
def load_data(filename):
    df = pd.read_csv('../input/{0}'.format(filename))
    df['Seconds'] = df.Timestamp.values.astype(int) # rename field
    df.Timestamp = pd.to_datetime(df.Timestamp, unit='s') # true timestamp
    df = df[df.Open.notnull()] # remove fields w/o data to lower memory requirements

    df = df.reset_index().drop('index', axis=1).reset_index()
    df['counter'] = df.index
    df = df.drop('index', axis=1)
    df = df.set_index('Seconds')
    df_original = df.copy()
    
    # downsample to days
    df = df.reset_index().set_index('Timestamp').resample('D').mean()
    df = pd.DataFrame(df)
    return (df,df_original)

Summary: data from Bitstamp is the most complete, but there are three missing rows. BTC-e has the most volume and price stability during the identified periods. Therefore, use Bitstamp as our base and backfill it with BTC-e.

Data given from the exchanges is on a per multiple second resolution. The candlestick graph of their data shows high volitility and swings within short periods of time. Projects goal is for create a timeseries on a per day basis, hence it is necessary to transforming the data using a smoothing technique to both improve the models predictability and lower the data size.

In [None]:
!ls ../input

In [None]:
# df1,df1_original = load_data('../input/btceUSD_1-min_data_2012-01-01_to_2017-05-31.csv')
df2,_ = load_data('coinbaseUSD_1-min_data_2014-12-01_to_2018-01-08.csv')
# df3,_ = load_data('../input/krakenUSD_1-min_data_2014-01-07_to_2017-05-31.csv')
df4,_ = load_data('bitstampUSD_1-min_data_2012-01-01_to_2018-01-08.csv')
# print('entries missing in df1', sum(df1.Weighted_Price.isnull()))
print('entries missing in df2', sum(df2.Weighted_Price.isnull()))
# print('entries missing in df3', sum(df3.Weighted_Price.isnull()))
print('entries missing in df4', sum(df4.Weighted_Price.isnull()))


In [None]:
len(df4), len(df2)

In [None]:
df4.head()

In [None]:
df2.head()

In [None]:
_ = df4.reset_index().Timestamp.map(lambda y: y.to_datetime().date())
_ = np.asarray(_, dtype=date)
df4['Date'] = _
df4.head()

df4 is the best datasource for having all dates. Confirm whether which of the other df's are the base to use to fill in the missing days. From the data below we can see that df1 is the best choise due to it's volume. Finally, confirm that there are no missing entries in df4

In [None]:
missing_entries_timestamp = df4[df4.Weighted_Price.isnull()].index
missing_entries_timestamp

> # EDA
summary: The exchanges data range is from 2011-12-31 to 2017-10-20. Although the beginning of bitcoin appears overall calm at first, we see after zooming in that is not the case; the first 500 days are just as volitile. A plot of these dates shows around 2016 the price goes expoentially upwards. The early days of Bitcoin are more suitable for time series analysis, we will analyze the first 380 days.

Time Series analysis typically requires data to be stationary. Just by looking at the graph we notice that it is not stationary. Per the Dickey-Fuller test we cannot reject the null hypothesis with the given p-value of 0.925053.

Making the data stationary was accomplished by taking the log-difference followed by the rolling average, producing a 'p-value' of 7.659524e-26 and 'Test Statistic' of -10.382883 which is far below it's 'Critical Value (1%)' of -3.448052.


In [None]:
sns.boxplot(x="Date", y="Weighted_Price", data=df4, palette="PRGn")
sns.despine(offset=10, trim=True)
plt.show()

In [None]:
sns.boxplot(x="Date", y="Weighted_Price", data=df4[:500], palette="PRGn")
sns.despine(offset=10, trim=True)
plt.show()

### what are the date ranges we are dealing with?

In [None]:
df = df4.Weighted_Price
df = pd.DataFrame(df)
df.index.min(), df.index.max()

### Plot out what we got so far...

its quite clear that its not stationary. Starting from the beginning of 1012 to end of 2013 its fairly linear, and again from 2015 to 2016. Thereafter it appears to be squared.

In [None]:
plt.figure(figsize=(15,6))
plt.plot(df)
plt.show()

Hence, let's first limit our TS to what we know will work; the first xxx days.

In [None]:
df = df[:380] # take the first xxx days
plt.figure(figsize=(15,6))
plt.plot(df)
plt.show()

# Make it stationary

Our time series is not stationary - for it to be stationary it requires that the mean and variance remain constant over time. Dickey-Fuller test is the math way of determining whether it's stationary or not, let's run that now so we have a baseline for future comparisons.

In [None]:
# source: https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/ 
def test_stationarity(timeseries):
    #Determing rolling statistics
    rolmean = pd.rolling_mean(timeseries, window=12)
    rolstd = pd.rolling_std(timeseries, window=12)

    #Plot rolling statistics:
    plt.figure(figsize=(15,6))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    Dickey_Fuller_test(timeseries)
    
def Dickey_Fuller_test(timeseries):
    #Perform Dickey-Fuller test:
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
test_stationarity(df.Weighted_Price)

### The p-value is too high to reject the null hypothesis.

Based on the initial curve of the graph, the data needs to be transformed to get it stationary. First try transforming the data with log to remove the trend.

### take the log
Summary: Taking the log will flatten the curve to a near linear line.

In [None]:
ts_df_log = np.log(df)
#test_stationarity(ts_df_log.Weighted_Price)
plt.figure(figsize=(15,6))
plt.plot(df, color='blue', label='original')
plt.plot(ts_df_log, color='red', label='log')
plt.title('original (blue) vs log (red)')
plt.legend(loc='best')
plt.show()
Dickey_Fuller_test(ts_df_log.Weighted_Price)

### Rolling average
Summary: linear regression and rolling average can be used to remove the trend, here we can see a positive upwards trend. Removing the trend should help up

In [None]:
window = 7
Rolling_average = ts_df_log.rolling(window = window, center= False).mean()
ts_df_log_rolling = Rolling_average.dropna()
plt.figure(figsize=(15,6))
plt.plot(ts_df_log, label = 'Log Transformed')
plt.plot(ts_df_log_rolling, color = 'red', label = 'Rolling Average')
plt.legend(loc = 'best')
plt.show()

Notice that the RA is off, lets fix that to improve our representation of the data

In [None]:
window = 7
shift_by_days = -2
Rolling_average = ts_df_log.rolling(window = window, center= False).mean()
ts_df_log_rolling_temp = Rolling_average.shift(shift_by_days).dropna()
plt.figure(figsize=(15,6))
plt.plot(ts_df_log, label = 'Log Transformed')
plt.plot(ts_df_log_rolling_temp, color = 'red', label = 'Rolling Average')
plt.legend(loc = 'best')
plt.show()

MUCH BETTER! Notice how the peaks in both are consistent. Let's do a diff and figure out our new p-value.

In [None]:
ts_df_log_rolling = (ts_df_log - ts_df_log_rolling_temp).dropna()
plt.figure(figsize=(15,6))
plt.plot(ts_df_log, label = 'Log Transformed')
plt.plot(ts_df_log_rolling, color = 'red', label = 'Log and Rolling Average Transformed')
plt.legend(loc = 'best')
plt.show()
Dickey_Fuller_test(ts_df_log_rolling.Weighted_Price)

BAM!!! 

p-value 7.659524e-26 is less than our alpha of 0.05 - per this test it is stationary! 'Test Statistic' is significantly below the 'Critical Value (1%)' indicating stationary. In addition, just looking at the line above (red), it appears stationary.
<br/>
<br/>
<br/>


# Building a model
Summary: Now that we have achieved stationarity, the next step is to build an ARIMA model. In building a model three terms are needed p,q,d: p = # of AR term using PACF, d = # of differences, q = # of MA term using ACF. 

Per https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/ our next step is to build ACF and PACF to justify using either AR, MA, ARMA, or ARIMA. "The ARIMA forecasting for a stationary time series is nothing but a linear (like a linear regression) equation." We need to calculate out the p,q, and d.

AFC bars fall within the 95% confidence inteval at the 2nd bar implying that the first bar (t) impact reaches through the second (t-1) and onwards to the third bar (t-2). Question is whether this is truly due to correlation or the remaining 5% probability. I would argue that it is due to correlation because remaining bars have a sine-wave pattern and are not due to a random walk. We can also see that the coefficients are positive.

PACF bars fall within the 95% condifence interval at the 2nd bar implying that each bar has a correlation with the prior bar and not ancestor bars. It's sine-like wave indicates that it might be an AR(2+) process.

In [None]:
# ACF and PACF plots
lag = 20
lag_pacf = pacf(ts_df_log_rolling, nlags=lag, method='ols')
lag_acf = acf(ts_df_log_rolling, nlags=lag)

In [None]:
#Plot ACF: 
plt.figure(figsize=(15,3))
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_df_log_rolling)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_df_log_rolling)),linestyle='--',color='gray')
plt.title('ACF')
plt.tight_layout()
plt.show()

plt.figure(figsize=(15,3))
plot_acf(ts_df_log_rolling, ax=plt.gca(),lags=lag)
plt.show()

In [None]:
#Plot PACF:
plt.figure(figsize=(15,3))
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_df_log_rolling)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_df_log_rolling)),linestyle='--',color='gray')
plt.title('PACF')
plt.tight_layout()
plt.show()

plt.figure(figsize=(15,3))
plot_pacf(ts_df_log_rolling, ax=plt.gca(), lags=lag)
plt.tight_layout()
plt.show()

Notice that the first entry to break the upper confidence interval is the second dot (zero based).

In [None]:
p=2

In [None]:
q=2

Above graph tells us our p and q values:

Using the above information, lets build a pure AR, MA, and then ARIMA models.

In [None]:
d=1

### AR

In [None]:
# AR
model = ARIMA(ts_df_log_rolling, order=(p, d, 0))  
results_AR = model.fit(disp=-1)
plt.figure(figsize=(15,6))
plt.plot(ts_df_log_rolling)
plt.plot(results_AR.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_AR.fittedvalues-ts_df_log_rolling.Weighted_Price).dropna()**2))
plt.show()

In [None]:
results_AR.summary()

### MA

In [None]:
model = ARIMA(ts_df_log_rolling, order=(0, d, q))  
results_MA = model.fit(disp=-1) 
plt.figure(figsize=(15,6))
plt.plot(ts_df_log_rolling)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_MA.fittedvalues-ts_df_log_rolling.Weighted_Price).dropna()**2))
plt.show()

In [None]:
results_MA.summary()

### ARIMA

In [None]:
# ARIMA
model = ARIMA(ts_df_log, order=(p, d, q))  
results_ARIMA = model.fit(disp=-1, trend='nc')
plt.figure(figsize=(15,6))
plt.plot(ts_df_log_rolling, label='ts_df_log_rolling')
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_df_log_rolling.Weighted_Price).dropna()**2))
plt.legend(loc='best')
plt.show()

### plot out the residuals - a good model will have residuals that look like white noise

In [None]:
x = pd.DataFrame(results_ARIMA.fittedvalues)
x.columns = ts_df_log_rolling.columns
x = x - ts_df_log_rolling
# x = x.cumsum()
plt.plot(x, label='residuals')
plt.legend(loc='best')
plt.show()

Conclusion: I don't think this is completely white noise as what a good model should have, I think there is signal coming through. Future: why is the predictor so bad. NLP weighting? If this is a signal then that means there are features in here that have yet to be discovered.

## Forecast
summary: Obviously our forecast is not good at all. Bitcoin, like the stock market, under the assumption of no other factors is a day to day random walk. With out data that random walk happens to have an upwards trend most of the time. And this makes sense because if other traders had a way to predict the future price they would place a trade and a new equilibrium would set it. That being said the story of 'Black-Scholes formula' and 'Long-Term Capital Management' is a great story about a system in transition to a new equilibriums.

In [None]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_df_log.Weighted_Price, index=ts_df_log.index)
predictions_ARIMA = np.exp(predictions_ARIMA_log)

plt.figure(figsize=(15,3))
plt.plot(df, label='first 380 days')
plt.plot(predictions_ARIMA, 'r+', label='predicted')
plt.legend(loc='best')
plt.show()

In [None]:
start = 360
end = 400
forecast = results_ARIMA.predict(start=start, end=end)
f = (forecast + forecast.shift(-1))
f = f.shift(-3).dropna()
forecast = f

plt.figure(figsize=(15,3))
plt.plot(df[:end].Weighted_Price, label='original data')
plt.show()
plt.plot(forecast, color='red', label='predicted')
plt.legend(loc='best')
plt.show()
plt.plot(df4[start:end].Weighted_Price, label='actual')
plt.legend(loc='best')
plt.show()