# Time Series analysis and prediction using Recurrent Neural Networks

## Introduction

With the advent of IoT devices, we have a lot of timeseries data being generated by these devices. Timeseries analysis of stock prices, cryptocurrency prices, sales of a company, etc can increase the profitablity of an indevidual or a business a whole. Using past datapints for above quantities and using external data sources we can predict these datapoints accurately. Timeseries being time dependent, the basic assumption of a linear regression model (observations being independent are not true). Motivation for this tutorial is to predict prices of the hottest cryptocurrency, **Bitcoin** and earn **money** eventually.

<img src="https://upload.wikimedia.org/wikipedia/commons/c/c5/Bitcoin_logo.svg">


### Tutorial content

In this tutorial we will cover the following key points:
1. Analyse bitcoin price, by checking for any trends and sesonality
2. Use Traditional algorithms to analyse the bitcoin price
3. Predict bicoin price using historical reddit sentiments and price of past bitcoin by training a recurrent neural network

We'll be using past bitcoin prices from the API privided by [Coindesk](https://www.coindesk.com/api/). We will use [Archive.org](http://archive.org) to get snapshot of historical reddit topics and subtopics. Further we will use NLTK's Sentiment Analyser to get the sentiments.

We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Data Collection](#Data-Collection)
- [Analysis using Traditional algorithms](#Analysis-using-Traditional-algorithms)
- [Time Series Forecasting](#Time-Series-Forecasting)
- [Using a RNN network with LSTM](#Using-a-RNN-network-with-LSTM)
- [Train the LSTM model](#Train-the-LSTM-model)

## Installing the libraries

Before getting started, you'll need to install the various libraries that we will use. You will need Keras with Tensorflow backend. We will be using [Plotly](https://plot.ly/) for interactive visualizations.

`$ pip install --ignore-installed --upgrade tensorflow` <br>
`$ pip install keras`<br>
   `$ pip install plotly`<br>

If you want to install everything in a condas virtual environment, follow the steps [here](https://www.tensorflow.org/install/)

In [1]:
from bs4 import BeautifulSoup
from math import sqrt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import adfuller
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima_model import ARIMA


import numpy as np
import pandas as pd
import requests
import json
import datetime

import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools
py.init_notebook_mode(connected=True)

  from pandas.core import datetools
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Data Collection


Now that we've installed and loaded the libraries, let's get the historical **Bitcoin** price. Here we will use [Coindesk's](https://www.coindesk.com/api/) API to with desired date limits.
- Start Date: March 12, 2018
- End date: April 11, 2016 
### Bitcoin price extraction

In [2]:
#end and start date
end_year = 2018
end_month = 3
end_day = 12
end_date = datetime.date(end_year, end_month, end_day)
start_date = end_date - datetime.timedelta(days=700)

req = requests.get("https://api.coindesk.com/v1/bpi/historical/close.json?start="+str(start_date)+"&end="+str(end_date))
data = json.loads(req.text)
df = pd.Series(data["bpi"], name='DateValue')
df.index.name = 'Date'
df_price = df.reset_index()
df_price.columns = ['tmpstamp', 'price']
df_price.head()

Unnamed: 0,tmpstamp,price
0,2016-04-11,422.99
1,2016-04-12,425.99
2,2016-04-13,424.401
3,2016-04-14,425.106
4,2016-04-15,429.98


### Historic Reddit title extraction and sentiment generation
Here we first generate date string and pass it through the url: <br>
`http://archive.org/wayback/available?url=reddit.com/r/bitcoin&timestamp= + date` <br>
This would return a json string with the url where the archiev page recides. 

> {"url": "reddit.com/r/bitcoin", "timestamp": "20180311", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20180212144640/https://www.reddit.com/r/Bitcoin/", "timestamp": "20180212144640"}}}

We then scrape for titles using beautiful soup. The required titles are in `{"class": "title may-blank "}`. Once the titles are collected we pass it  through a NLTK sentment analyser. This returns a sentiment score for the titles:

> {'neu': 0.88, 'compound': -0.2197, 'pos': 0.055, 'neg': 0.065}

Here we only use `pos` from the above variables. 

**Note**: Remove the comment from `compute()` to run the following code. I have commented it out because this chunk takes 10 minutes to gather all the titles due to the response time of the API and the number of requests. I have ran the code chunk below to generate the required file `redditSentiments.csv`


In [3]:
def get_sentiment_score(sia, year, month, day):
    
    month_str = str(month)
    if len(month_str) < 2:
        month_str = "0" + month_str
        
    day_str = str(day)
    if len(day_str) < 2:
        day_str = "0" + day_str
    date = str(year) + month_str + day_str
    
    url = "http://archive.org/wayback/available?url=reddit.com/r/bitcoin&timestamp=" + date
    req_arc = requests.get(url)
    
    if(req_arc.status_code == 200):
        data = json.loads(req_arc.text)
        red_archive_url = data['archived_snapshots']['closest']['url']
    else:
        red_archive_url = None
        print("Error return code: "+str(req_arc.status_code))
        return (None,None)

    req_arc_page = requests.get(red_archive_url)
    titles=[]

    if(req_arc_page.status_code == 200):
        soup = BeautifulSoup(req_arc_page.text, 'html.parser')
        all_a = soup.findAll("a", {"class": "title may-blank "})
        for a in all_a:
            titles.extend(a)
        titles_str = ". ".join(titles)
        res = sia.polarity_scores(titles_str)
        return (res["pos"],res["neg"])
    
    else:
        print("Error return code: "+str(req_arc_page.status_code))
        return (None,None)
def compute():
    sia = SIA()
    date = datetime.date(end_year, end_month, end_day)
    column_names = ['tmpstamp','pos','neg']
    df = pd.DataFrame( columns=column_names)
    for i in range(700):
        stamp = date.year*10000+date.month*100+date.day
        value = get_sentiment_score(sia, date.year, date.month, date.day)
        date = date - datetime.timedelta(days=1)

        new_df = pd.DataFrame([[date,value[0],value[1]]], columns=column_names)
        df = df.append(new_df, ignore_index=True)
    df.to_csv("redditSentiments.csv", sep=',')

# compute()

### Merge datasets and visualize 

We have two data sets `df_sentiments` and `df_price`. We can combine these into one dataframe to carry on with our analysis.

In [4]:
df_sentiments = pd.read_csv('redditSentiments.csv')
df_sentiments = df_sentiments[['tmpstamp','pos','neg']]

df = pd.merge(df_price, df_sentiments)

#Set date as the index
df['tmpstamp'] = pd.to_datetime(df['tmpstamp'], format='%Y-%m-%d')
df=df.set_index('tmpstamp')

btc_price = go.Scatter(x=df.index, y=df['price'], name= 'Price')

py.iplot([btc_price])

## Analysis using Traditional algorithms

### Stationary check

Traditional timeseies models assume, the timeseries to be starionary. It is said to be starionary if statistical properties such as mean and variance are constant over time. Also it should have an autocovariance that does not depend on time. This can be checked by using:
1. **Rolling Statistics test:** Here we’ll take the average and variance of the 7 days
2. **Dickey-Fuller Test:** This is a statistical tests for checking stationarity, where the null hypothesis is that the timeseries is not stationary.<br>
**Note**: We will plot standard deviation instead of variance to keep the unit similar to mean.

In [5]:
def stationarity_stats(timeseries):
    #rolling statistics
    rolmean = timeseries.rolling(window=10,center=False).mean()
    rolstd = timeseries.rolling(window=10,center=False).std()
    
    #Ploting stats
    original_plt = go.Scatter(x=df_price.index, y=timeseries, name= 'Original')
    mean_plt = go.Scatter(x=df_price.index, y=rolmean, name= 'Rolling Mean')
    rol_std_plt = go.Scatter(x=df_price.index, y=rolstd, name= 'Rolling Std')
    data = [original_plt, mean_plt,rol_std_plt]
    py.iplot(data)
    
    #Perform Dickey-Fuller test:
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    df_output = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        df_output['Critical Value (%s)'%key] = value
    print (df_output)
    

In [6]:
stationarity_stats(df['price'])

Results of Dickey-Fuller Test:
Test Statistic                  -1.244490
p-value                          0.654171
#Lags Used                      20.000000
Number of Observations Used    679.000000
Critical Value (5%)             -2.865806
Critical Value (1%)             -3.440017
Critical Value (10%)            -2.569042
dtype: float64


In the above analysis we can see from the graph that mean and variation are not constant. Also, the Dickey-Fuller Test shows that the t stats is more than the critical value. Hence it is not stationary.<br>
<br>

### Stationarise Time Series

The first step would be  to estimate trend and seasonality and remove these to get a stationary series, so that our models can be used. We can then convert the predictions to the original scale.<br>
We can **remove trend** by taking a log transform as log suppresses higher values.


In [7]:
ts_log = np.log(df['price'])
ts_log_plt = go.Scatter(x=ts_log.index, y=ts_log, name= 'Log of Price')
py.iplot([ts_log_plt])

**Moving average** can be used to average ‘k’ consecutive values. Since there is specific cycle for bitcoin pricing, calculating moving averages won't suffice. So we use a ***weighted moving average***, where recent values are given a higher weight. **Halflife** is used to define the amount of exponential decay. 

In [8]:
expwighted_avg = ts_log.ewm(halflife=10,ignore_na=False,adjust=True,min_periods=0).mean()
expwighted_avg_plt = go.Scatter(x=expwighted_avg.index, y=expwighted_avg, name='expwighted avg')
data = [expwighted_avg_plt, ts_log_plt]
py.iplot(data)


If we substract the orange line from the blue, we can get rid of the trend. Note that the first 9 values will not be defined since there were no enough values to calculate mean. We will drop these NaNs and perform our test.

In [9]:
moving_avg =ts_log.rolling(window=12,center=False).mean()
ts_log_moving_avg_diff = ts_log - moving_avg
ts_log_moving_avg_diff.dropna(inplace=True)
stationarity_stats(ts_log_moving_avg_diff)

Results of Dickey-Fuller Test:
Test Statistic                -7.374121e+00
p-value                        8.813772e-11
#Lags Used                     3.000000e+00
Number of Observations Used    6.850000e+02
Critical Value (5%)           -2.865769e+00
Critical Value (1%)           -3.439932e+00
Critical Value (10%)          -2.569022e+00
dtype: float64


The timeseries now have less fluctions in mean and varience. Also, test statistics is smaller than 1% critical value. In our case we were able to achieve a conciderable test stastics. In some cases like sales of items, there is a factor of seasonality. For example, people buy more in december. There are other techniques that can  be used here like, Differencing (taking the differece with a particular time lag) or decomposition (modeling both trend and seasonality and removing them from the model).

## Time Series Forecasting


Our timeseries has significant dependence among values. This can be modeled using ARIMA (Auto-Regressive Integrated Moving Averages). It represents a linear equation. The parameter in the model are:
Number of  Auto-Regressive (p): It is the possible lag-window of dependent variable. For example, if p = 10 the predictors will be x(t-1) to x(t-10)

**Number of MA terms (q):** It is the possible lag-window forecast error in prediction equation. For eample if q is 10, the predictors for x(t) will be e(t-1) to e(t-10) where e(i) = moving average at i - actual value.

**Number of differences (d):** It is the number of nonseasonal differences

To determine the value of 'p' and 'q', we use following analysis.

**Autocorrelation Function (ACF)** is the correlation between the the timeseries with a lagged version of itself and Partial **Autocorrelation Function (PACF)** is the correlation between the timeseries with a lagged version of itself after eliminating the variations already explained by the intervening comparisons

The plots below show confidence intervals. They are used to determine 'p' and 'q'. 'p' and  'q' are the values where PACF and ACF crocess the upper confidence interval for the first time respectitvely. (In our case the graphs are a lot overlapped)

In [10]:
#ACF and PACF plots:
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff.dropna(inplace=True)
lag_acf = acf(ts_log_diff, nlags=20)
lag_pacf = pacf(ts_log_diff, nlags=20, method='ols')


trace0 = go.Scatter(y=lag_acf, name = 'Autocorrelation Function')
trace1 = go.Scatter(y=lag_pacf,name = 'Partial Autocorrelation Function')
data = [trace0,trace1]

layout = {
    'xaxis': {
        'range': [0, 20]
    },
    'yaxis': {
        'range': [-0.2, 1]
    },
    'shapes': [
        {
            'type': 'line',
            'x0': 0,
            'y0': -1.96/np.sqrt(len(ts_log_diff)),
            'x1': 20,
            'y1': -1.96/np.sqrt(len(ts_log_diff)),
            'line': {
                'color': 'rgb(50, 171, 96)',
                'width': 4,
                'dash': 'dashdot',
            },
        },
        {
            'type': 'line',
            'x0': 0,
            'y0': 1.96/np.sqrt(len(ts_log_diff)),
            'x1': 20,
            'y1': 1.96/np.sqrt(len(ts_log_diff)),
            'line': {
                'color': 'rgb(50, 171, 96)',
                'width': 4,
                'dash': 'dashdot',
            },
        },        
    ]
}

fig = {
    'data': data,
    'layout': layout,
}
py.iplot(fig, filename='shapes-lines')

## Testing a model

We try diffrent values for p and q within the range of 0 to 2. We get the best results for following parameters.


In [11]:
model = ARIMA(ts_log, order=(2, 1, 2))  
results_ARIMA = model.fit(disp=0)  

print(results_ARIMA.summary())
print ('RSS: ',sum((results_ARIMA.fittedvalues-ts_log_diff)**2))

ts_log_diff = ts_log - ts_log.shift()
ts_log_diff_plt = go.Scatter(x=ts_log_diff.index, y=ts_log_diff, name='ts_log_diff')

fitted_arima_plt = go.Scatter(x=ts_log_diff.index, y=results_ARIMA.fittedvalues, name='results_ARIMA')
py.iplot([ts_log_diff_plt,fitted_arima_plt])


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.


Conversion of the second argument of issubdtype from `complex` to `np.complexfloating` is deprecated. In future, it will be treated as `np.complex128 == np.dtype(complex).type`.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.



                             ARIMA Model Results                              
Dep. Variable:                D.price   No. Observations:                  699
Model:                 ARIMA(2, 1, 2)   Log Likelihood                1201.924
Method:                       css-mle   S.D. of innovations              0.043
Date:                Mon, 26 Mar 2018   AIC                          -2391.848
Time:                        22:30:39   BIC                          -2364.550
Sample:                    04-12-2016   HQIC                         -2381.295
                         - 03-11-2018                                         
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.0044      0.002      2.514      0.012       0.001       0.008
ar.L1.D.price     0.2114      0.329      0.643      0.520      -0.433       0.856
ar.L2.D.price     0.6388      0.274     

## Rescaling it and calculating RMSE

We try diffrent values for p and q within the range of 0 to 2. We get the best results for following parameters.

In [12]:
predictions_difference = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_difference_cumsum = predictions_difference.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_difference_cumsum,fill_value=0)
predictions_ARIMA_log.head()



.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated



tmpstamp
2016-04-11    6.047349
2016-04-12    6.051798
2016-04-13    6.056327
2016-04-14    6.060492
2016-04-15    6.065009
dtype: float64

In [13]:
predictions_ARIMA = np.exp(predictions_ARIMA_log)

predictions_ARIMA_plt = go.Scatter(x=predictions_ARIMA.index, y=predictions_ARIMA, name='predictions_ARIMA')
py.iplot([btc_price,predictions_ARIMA_plt])

print('RMSE: ',np.sqrt(sum((predictions_ARIMA-df['price'])**2)/len(df['price'])))

RMSE:  2248.0215066208616


The above RMSE is very big. This means that we will be able to predict the price for next day with and error margin of more than $2000. 

## Using a RNN network with LSTM

Lets try and improve this using recurrent neural networks with LSTM. We will train this network with the extracted sentiments and the price of bitcoin.

### Normalise the data set

We first reshape the values to be in the range of -1 and +1. We then use MinMaxScaler to transform features by scaling each feature to a given range of 0 to 1.
 

In [14]:
price = df['price'].values.reshape(-1,1)
pos = df['pos'].values.reshape(-1,1)
price = price.astype('float32')
pos = pos.astype('float32')
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(price)

Split the data in 80% training set and 20% test

In [15]:
train_size = int(len(scaled) * 0.8)
test_size = len(scaled) - train_size
train = scaled[0:train_size,:] 
test = scaled[train_size:len(scaled),:]
print(len(train), len(test))

560 140


The following creates a window of features for our model to train. For example if its 10 it will create features like x(t-1) to x(t-10)

In [16]:
# https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
def create_dataset(dataset, look_back, pos, sent=False):
    dataX, dataY = [], []
    for i in range(len(dataset) - look_back):
        if i >= look_back:
            a = dataset[i-look_back:i+1, 0]
            a = a.tolist()
            if(sent==True):
                a.append(pos[i].tolist()[0])
            dataX.append(a)
            dataY.append(dataset[i + look_back, 0])
    #print(len(dataY))
    return np.array(dataX), np.array(dataY)

I tested for 4 values of `look_back`. The best result I got was for `look_back = 0`. We also reshape the dataset to be fed in the neural net.

In [17]:
look_back = 0
trainX, trainY = create_dataset(train, look_back, pos[0:train_size],sent=True)
testX, testY = create_dataset(test, look_back, pos[train_size:len(scaled)], sent=True)
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

## Train the LSTM model
Here we are using a sequential model with LSTM input layer and a LSTM hidden layer. We then add a dense layer at the end to get regression output. Here we are using `adam` optimiser which is claimed to have better performance and takes less time to train.
A LSTM Cell looks like:

<img src ="https://upload.wikimedia.org/wikipedia/commons/5/53/Peephole_Long_Short-Term_Memory.svg">


In [21]:
model = Sequential()
model.add(LSTM(100, input_shape=(trainX.shape[1], trainX.shape[2]), return_sequences=True))
model.add(LSTM(100))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
history = model.fit(trainX, trainY, epochs=500, batch_size=100, validation_data=(testX, testY), verbose=0, shuffle=False)

Here we predict the price for the test dataset and visualise the predicted and actual.

In [22]:
yhat = model.predict(testX)
yhat_plt = go.Scatter(y=yhat.T[0], name='yhat')
test_plt = go.Scatter(y=testY, name= 'testY')
data = [yhat_plt, test_plt]
py.iplot(data)

Now we inverse the data point and test for `RMSE`

In [23]:
Yhat_inverse = scaler.inverse_transform(yhat.reshape(-1, 1))
testY_inverse = scaler.inverse_transform(testY.reshape(-1, 1))
rmse = sqrt(mean_squared_error(testY_inverse, Yhat_inverse))
print('Test RMSE: %.3f' % rmse)

Test RMSE: 108.950


We see that we get an RMSE arround `$150`. This is a huge improvement over the ARIMA model. This also means that we can predict price of Bitcoin with an error of `$150`.

<br>
**Further reading:**

Keras: https://keras.io/getting-started/sequential-model-guide/ <br>
ARIMA: http://people.duke.edu/~rnau/411arim3.htm <br>
Timeseries using ARIMA: https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/ <br>

**References:**
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/ <br>
https://github.com/llSourcell/bitcoin_prediction <br>
https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/ <br>