# Cryptocurrency Prediction using an RNN

## Overview

This notebook is based on the [Deep Learning basics with Python, TensorFlow and Keras](https://pythonprogramming.net/cryptocurrency-recurrent-neural-network-deep-learning-python-tensorflow-keras/) tutorial series, starting from Part 8.

Parts 8 to 10 cover data preparation, while part 11 covers training of a model.

## Parts 8 to 10


These parts of the tutorial were focused on preparing the data for training a model.

We begin by importing a single file and previewing the data:

In [1]:
import pandas as pd

names = ['time', 'low', 'high', 'open', 'close', 'volume']
df = pd.read_csv('data/crypto_data/LTC-USD.csv', names=names)
df.head()

Unnamed: 0,time,low,high,open,close,volume
0,1528968660,96.580002,96.589996,96.589996,96.580002,9.6472
1,1528968720,96.449997,96.669998,96.589996,96.660004,314.387024
2,1528968780,96.470001,96.57,96.57,96.57,77.129799
3,1528968840,96.449997,96.57,96.57,96.5,7.216067
4,1528968900,96.279999,96.540001,96.5,96.389999,524.539978


Load all data files:

In [2]:
main_df = pd.DataFrame()

ratios = ['BTC-USD', 'LTC-USD', 'BCH-USD', 'ETH-USD']
for ratio in ratios:
    dataset = f'data/crypto_data/{ratio}.csv'
    print(f"Loading {dataset}")
    df = pd.read_csv(dataset, names=names)
    
    # Rename 'close' and 'volumn' columns so that we can identify the source dataset after merging
    df.rename(columns={'close': f'{ratio}_close', 'volume': f'{ratio}_volume'}, inplace=True)
    
    # set time as index so we can join them on this shared time
    df.set_index('time', inplace=True)
    
    # Remove other columns
    df = df[[f'{ratio}_close', f'{ratio}_volume']]

    if len(main_df) == 0:
        main_df = df
    else:
        main_df = main_df.join(df)

# if there are gaps in data, use previously known values
main_df.fillna(method="ffill", inplace=True)
main_df.dropna(inplace=True)
main_df.head()

Loading data/crypto_data/BTC-USD.csv
Loading data/crypto_data/LTC-USD.csv
Loading data/crypto_data/BCH-USD.csv
Loading data/crypto_data/ETH-USD.csv


Unnamed: 0_level_0,BTC-USD_close,BTC-USD_volume,LTC-USD_close,LTC-USD_volume,BCH-USD_close,BCH-USD_volume,ETH-USD_close,ETH-USD_volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1528968720,6487.379883,7.706374,96.660004,314.387024,870.859985,26.856577,486.01001,26.019083
1528968780,6479.410156,3.088252,96.57,77.129799,870.099976,1.1243,486.0,8.4494
1528968840,6479.410156,1.4041,96.5,7.216067,870.789978,1.749862,485.75,26.994646
1528968900,6479.97998,0.753,96.389999,524.539978,870.0,1.6805,486.0,77.355759
1528968960,6480.0,1.4909,96.519997,16.991997,869.98999,1.669014,486.0,7.5033


Next, we need to create a target. To do this, we need to know which price we're trying to predict. We also need to know how far out we want to predict. We'll go with Litecoin for now. Knowing how far out we want to predict probably also depends how long our sequences are. If our sequence length is 3 (so...3 minutes), we probably can't easily predict out 10 minutes. If our sequence length is 300, 10 might not be as hard. I'd like to go with a sequence length of 60, and a future prediction out of 3. We could also make the prediction a regression question, using a linear activation with the output layer, but, instead, I am going to just go with a binary classification.

If price goes up in 3 minutes, then it's a buy. If it goes down in 3 minutes, not buy/sell. With all of that in mind, I am going to make the following constants:

In [3]:
SEQ_LEN = 60
FUTURE_PERIOD_PREDICT = 3
RATIO_TO_PREDICT = 'LTC-USD'

Create a new column called 'future', containing the close price from `FUTURE_PERIOD_PREDICT` periods in the future:

In [4]:
main_df['future'] = main_df[f'{RATIO_TO_PREDICT}_close'].shift(-FUTURE_PERIOD_PREDICT)

Create a new column called 'target', which will be '1' if value of 'future' is greater that the current closing price, and '0' otherwise:

In [5]:
def classify(current, future):
    if float(future) > float(current):
        return 1
    else:
        return 0
    
main_df['target'] = list(map(classify, main_df[f'{RATIO_TO_PREDICT}_close'], main_df['future']))
main_df.head()

Unnamed: 0_level_0,BTC-USD_close,BTC-USD_volume,LTC-USD_close,LTC-USD_volume,BCH-USD_close,BCH-USD_volume,ETH-USD_close,ETH-USD_volume,future,target
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1528968720,6487.379883,7.706374,96.660004,314.387024,870.859985,26.856577,486.01001,26.019083,96.389999,0
1528968780,6479.410156,3.088252,96.57,77.129799,870.099976,1.1243,486.0,8.4494,96.519997,0
1528968840,6479.410156,1.4041,96.5,7.216067,870.789978,1.749862,485.75,26.994646,96.440002,0
1528968900,6479.97998,0.753,96.389999,524.539978,870.0,1.6805,486.0,77.355759,96.470001,1
1528968960,6480.0,1.4909,96.519997,16.991997,869.98999,1.669014,486.0,7.5033,96.400002,0


## Part 9

In [6]:
times = sorted(main_df.index.values)
times[:10]

[1528968720,
 1528968780,
 1528968840,
 1528968900,
 1528968960,
 1528969020,
 1528969080,
 1528969140,
 1528969200,
 1528969260]

In [7]:
# get cutoff timestamp for last 5% of the dat
last_5pct = sorted(main_df.index.values)[-int(0.05*len(times))]
last_5pct

1534922100

In [8]:
validation_df = main_df[(main_df.index >= last_5pct)]
validation_df.shape

(4886, 10)

In [9]:
train_df = main_df[(main_df.index < last_5pct)]
train_df.shape

(92837, 10)

Next, we need to balance and normalize this data. We also need to take our data and make sequences from it.

We'll start by making a function that will process the dataframes:

In [10]:
import numpy as np
import random

from collections import deque
from sklearn import preprocessing

def preprocess_df(df):
    df = df.drop('future', 1)
    print(df.head())
    for col in df.columns:
        if col != 'target':
            df[col] = df[col].pct_change()
            df.dropna(inplace=True)
            df[col] = preprocessing.scale(df[col].values)
    
    df.dropna(inplace=True)
    
    sequential_data = []
    prev_days = deque(maxlen=SEQ_LEN)
    
    for i in df.values:
        prev_days.append([n for n in i[:-1]])
        if len(prev_days) == SEQ_LEN:
            sequential_data.append([np.array(prev_days), i[-1]])
            
    random.shuffle(sequential_data)
    
    buys = []
    sells = []
    
    for seq, target in sequential_data:
        if target == 0:
            sells.append([seq, target])
        else:
            buys.append([seq, target])
    
    lower = min(len(buys), len(sells))
    
    sequential_data = buys + sells
    random.shuffle(sequential_data)
    
    X = []
    y = []
    for seq, target in sequential_data:
        X.append(seq)
        y.append(target)
    
    return np.array(X), y

Now we can preprocess the training data:

In [11]:
train_x, train_y = preprocess_df(train_df)

            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time                                                                       
1528968720    6487.379883        7.706374      96.660004      314.387024   
1528968780    6479.410156        3.088252      96.570000       77.129799   
1528968840    6479.410156        1.404100      96.500000        7.216067   
1528968900    6479.979980        0.753000      96.389999      524.539978   
1528968960    6480.000000        1.490900      96.519997       16.991997   

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume  \
time                                                                       
1528968720     870.859985       26.856577      486.01001       26.019083   
1528968780     870.099976        1.124300      486.00000        8.449400   
1528968840     870.789978        1.749862      485.75000       26.994646   
1528968900     870.000000        1.680500      486.00000       77.355759   
1528968960 

And the validation data:

In [12]:
validation_x, validation_y = preprocess_df(validation_df)

            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time                                                                       
1534922100    6684.500000        0.969366      57.509998       66.463028   
1534922160    6684.500000        0.611018      57.509998        3.616516   
1534922220    6682.740234        1.121768      57.509998       13.260421   
1534922280    6682.660156        0.912729      57.509998       19.851404   
1534922340    6682.450195        0.334119      57.509998       17.104265   

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume  \
time                                                                       
1534922100     550.719971        5.058020     285.739990      194.228867   
1534922160     550.710022        0.136300     285.730011       11.172032   
1534922220     551.299988       75.830658     285.730011        1.411576   
1534922280     551.299988        8.701156     285.739990        3.382381   
1534922340 

Sanity check:

In [13]:
print(f"train data: {len(train_x)} validation: {len(validation_x)}")
print(f"Dont buys: {train_y.count(0)}, buys: {train_y.count(1)}")
print(f"VALIDATION Dont buys: {validation_y.count(0)}, buys: {validation_y.count(1)}")

train data: 92770 validation: 4819
Dont buys: 53809, buys: 38961
VALIDATION Dont buys: 2889, buys: 1930


## Part 11

### Basic model

Now that we've prepared the training data, we can build and train a model using TensorFlow:

In [14]:
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, CuDNNLSTM, BatchNormalization

Construct the model:

In [15]:
model = Sequential()

# First layer; note that CuDNNLSTM activation defaults to tanh
model.add(CuDNNLSTM(128, input_shape=(train_x.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())

# Second layer
model.add(CuDNNLSTM(128, return_sequences=True))
model.add(Dropout(0.1))
model.add(BatchNormalization())

model.add(CuDNNLSTM(128))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(2, activation='softmax'))

Model compile settings:

In [16]:
opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=opt,
    metrics=['accuracy'])

Define some constants:

In [22]:
import time

# how many passes through our data
EPOCHS = 2

# how many batches? Try smaller batch if you're getting OOM (out of memory) errors.
BATCH_SIZE = 64

Train the model:

In [23]:
history = model.fit(
    train_x, train_y,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(validation_x, validation_y))

Train on 92770 samples, validate on 4819 samples
Epoch 1/2
Epoch 2/2


Evaluate the model:

In [21]:
score = model.evaluate(validation_x, validation_y, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.6613788193106824
Test accuracy: 0.6102925917250804


### Using TensorBoard and Model Checkpoints

Okay, so we're able to train a model. But we can go a step further and use TensorBoard to compare results using different hyper-parameters.

In [None]:
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint