## RNN with Cryptocurrency data
We're going to work on using a recurrent neural network to predict against a time-series dataset, which is going to be cryptocurrency prices.  

The data we'll be using is `Open`, `High`, `Low`, `Close`, `Volume` data for *Bitcoin*, *Ethereum*, *Litecoin* and *Bitcoin Cash*.  
(Dataset can be downloaded from <a href="https://pythonprogramming.net/static/downloads/machine-learning-data/crypto_data.zip">here</a>.)

For our purposes here, we're going to only be focused on the Close and Volume columns.  
The **Close** column measures the final price at the end of each interval. In this case, these are 1 minute intervals. So, at the end of each minute, what was the price of the asset.  
The **Volume** column is how much of the asset was traded per each interval, in this case, per 1 minute.  

We're going to be tracking the Close and Volume every minute for Bitcoin, Litecoin, Ethereum, and Bitcoin Cash.

## Idea
The idea is that these cryptocoins all have relationships with each other. Could we possibly predict future movements of, say, Litecoin, by analyzing the last 60 minutes of prices and volumes for all 4 of these coins? We would start with a guess that there exists some (at least better than random) relationship here that a recurrent neural network could discover.

## PART 1: Dealing with Data
### Input and Target
Our data isn't already in some beautiful format where we have sequences mapped to targets. In fact, there are no targets at all. It's just some datapoints every 60 seconds.  

- First, we need to combine price and volume for each coin into a single featureset, then we want to take these featuresets and combine them into sequences of 60 of these featuresets. This will be our input.  
- Next, we'll be trying to predict if the price will rise or fall. So, we need to take the "prices" of the item we're trying to predict. Let's stick with saying we're trying to predict the price of Litecoin. So we need to grab the future price of Litecoin, then determine if it's higher or lower to the current price. We need to do this at every step.

Besides deciding on input and output, we also need to:
1. Balance the dataset between buys and sells. We can also use class weights, but balance is superior.
2. Scale/normalize the data in some way.
3. Create reasonable out of sample data that works with the problem.


Lets import everything we may need

In [1]:
import pandas as pd
import numpy as np

Lets look at our data

In [2]:
df = pd.read_csv("data/LTC-USD.csv", names=['time', 'low', 'high', 'open', 'close', 'volume'])
# print(df.head())

What we want to do is somehow take the close and volume from here, and combine it with the other 3 cryptocurrencies.

In [3]:
main_df = pd.DataFrame()

currencies = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"]

Lets go through each csv file, rename the columns with specific currency types and ignore all the columns other than **close** and **volume**. And then, we will join all of them into single dataframe.

In [4]:
for currency in currencies:
    print(currency)
    dataset = f"data/{currency}.csv"
    df = pd.read_csv(dataset, names=["time", "low", "high", "open", "close", "volume"])
    df.rename(columns={"close": f"{currency}_close", "volume": f"{currency}_volume"}, inplace=True)
    df.set_index("time", inplace=True)
    df = df[[f"{currency}_close", f"{currency}_volume"]]
    
    if len(main_df) == 0:
        main_df = df
    else:
        main_df = main_df.join(df)  

BTC-USD
LTC-USD
BCH-USD
ETH-USD


Lets drop **NA** values and see the data.

In [5]:
# if there are gaps in data, use previously known values
main_df.fillna(method="ffill", inplace=True)
# and then drop invalid values if any
main_df.dropna(inplace=True)
main_df.head()

Unnamed: 0_level_0,BTC-USD_close,BTC-USD_volume,LTC-USD_close,LTC-USD_volume,BCH-USD_close,BCH-USD_volume,ETH-USD_close,ETH-USD_volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1528968720,6487.379883,7.706374,96.660004,314.387024,870.859985,26.856577,486.01001,26.019083
1528968780,6479.410156,3.088252,96.57,77.129799,870.099976,1.1243,486.0,8.4494
1528968840,6479.410156,1.4041,96.5,7.216067,870.789978,1.749862,485.75,26.994646
1528968900,6479.97998,0.753,96.389999,524.539978,870.0,1.6805,486.0,77.355759
1528968960,6480.0,1.4909,96.519997,16.991997,869.98999,1.669014,486.0,7.5033


Next, we need to create a target. To do this, we need to know which price we're trying to predict. We also need to know how far out we want to predict.   

We'll choose Litecoin for now. Knowing how far out we want to predict probably also depends how long our sequences are. If our sequence length is 3 (3 minutes), we probably can't easily predict out 10 minutes. If our sequence length is 300, 10 might not be as hard. So we'll start with a sequence length of 60, and a future prediction out of 3.  

We could also make the prediction a regression question, using a linear activation with the output layer, but, instead, Iwe will just go with a binary classification.  

If price goes up in 3 minutes, then it's a buy. If it goes down in 3 minutes, not buy/sell. With all of that in mind, lets define few constants:

In [6]:
SEQ_LEN = 60  # how long of a preceeding sequence to collect for RNN
FUTURE_PERIOD_PREDICT = 3  # how far into the future are we trying to predict?
CURRENCY_TO_PREDICT = "LTC-USD"

Next, we will define a simple classification function that we'll use in future:

In [7]:
def classify(current , future):
    if float(future) > float(current):
        return 1
    else:
        return 0

This function will take values from 2 columns. If the "future" column is higher, it's a 1 (buy). Otherwise it's a 0 (sell).   
To do this, first, we need a future column!

In [8]:
main_df['future'] = main_df[f"{CURRENCY_TO_PREDICT}_close"].shift(-FUTURE_PERIOD_PREDICT)

A `shift` will just shift the columns for us, a negative shift will shift them "up." So shifting up 3 will give us the price 3 minutes in the future, and we're just assigning this to a new column.

Now that we've got the future values, we can use them to make a target using the function we defined above.

In [9]:
main_df['target'] = list(map(classify, main_df[f"{CURRENCY_TO_PREDICT}_close"], main_df['future']))

The `map()` is used to map a function. The first parameter here is the function we want to map (classify), then the next ones are the parameters to that function. In this case, the current close price, and then the future price.  
The map part is what allows us to do this row-by-row for these columns, but also do it quite fast. The list part converts the end result to a list, which we can just set as a column.

Now lets check out our data

In [10]:
main_df.head()

Unnamed: 0_level_0,BTC-USD_close,BTC-USD_volume,LTC-USD_close,LTC-USD_volume,BCH-USD_close,BCH-USD_volume,ETH-USD_close,ETH-USD_volume,future,target
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1528968720,6487.379883,7.706374,96.660004,314.387024,870.859985,26.856577,486.01001,26.019083,96.389999,0
1528968780,6479.410156,3.088252,96.57,77.129799,870.099976,1.1243,486.0,8.4494,96.519997,0
1528968840,6479.410156,1.4041,96.5,7.216067,870.789978,1.749862,485.75,26.994646,96.440002,0
1528968900,6479.97998,0.753,96.389999,524.539978,870.0,1.6805,486.0,77.355759,96.470001,1
1528968960,6480.0,1.4909,96.519997,16.991997,869.98999,1.669014,486.0,7.5033,96.400002,0


Note that *future* and *target* columns are specific only to currency we chose to predict in the beginning.

## PART 2: Normalizing and creating Sequences

The first thing we want to do is separate out our validation(out of sample) data.  

In the past, all we did was shuffle data, then slice it. But the problem with that method is, the data is inherently sequential, so taking sequences that don't come in the future is likely a mistake. This is because sequences in our case, for example, 1 minute apart, will be almost identical. Chances are, the target is also going to be the same (buy or sell). Because of this, any overfitting is likely to actually pour over into the validation set.  

Instead, we want to slice our validation while it's still in order.  

So lets take the last 5% of the data as validation data.

In [11]:
# get the times
times = sorted(main_df.index.values)
# get the last 5% of the times
last_5pct = sorted(main_df.index.values)[-int(0.05 * len(times))]
# this is the timestamp from where we want to seperate validation data
print(last_5pct)

# make the validation data where the index is in the last 5%
validation_main_df = main_df[(main_df.index >= last_5pct)]
# now the main_df is all the data up to the last 5%
main_df = main_df[(main_df.index < last_5pct)]

1534922100


Next, we need to balance and normalize this data. By balance, we want to make sure the classes have equal amounts when training, so our model doesn't just always predict one class.  

One way to counteract this is to use class weights, which allows you to weight loss higher for lesser-frequent classifications.  

We also need to take our data and make sequences from it.  

- Let's start by removing the future column (the actual target is called literally target and only needed the future column temporarily to create it). 
- Then we'll Normalize the data and Acale it too
- Next, we will create sequences of a fixed time period (say 60 minutes each) from the data.
- Next, we need to balance the dataset. (For a Cats vs Dogs classifier, it is better to have equal amount of Cat images and Dog images. Similarly, in this scenario, it is better to balance the dataset with equal amount of buys and sells - based on target).
- Finally lets separate features and labels and pass them back to caller.

In [35]:
from sklearn import preprocessing
from collections import deque
import random

def preprocess_df(df):
    df = df.drop("future", 1)
    
    ## NORMALIZING AND SCALING ##
    for col in df.columns:
        # normalize all columns for the target column
        if col != "target":
            # pct change "normalizes" the different currencies 
            # each crypto coin has vastly different values, 
            # so we're really more interested in the other coin's movements)
            df[col] = df[col].pct_change()
            # remove the NA's created by pct_change
            df.dropna(inplace=True)
            # scale between 0 and 1
            df[col] = preprocessing.scale(df[col].values)
           
    # cleanup again (just in case)
    df.dropna(inplace=True)
    
    ## CREATING THE SEQUENCES ##
    
    # this is a list that will contain the sequences
    sequential_data = []
    # these will be our actual sequences
    # they are made with deque, which keeps the maximum length by popping out older values as new ones come in
    prev_minutes = deque(maxlen=SEQ_LEN)

    # iterate over the values (rows)
    for i in df.values:
        # store everything in the row, except target
        prev_minutes.append([n for n in i[:-1]])
        if len(prev_minutes) == SEQ_LEN:
            sequential_data.append([np.array(prev_minutes), i[-1]])

    # shuffle for good measure
    random.shuffle(sequential_data)
    
    ## BALANCING THE DATASET ##
    # list that will store our buy sequences and targets
    buys = []
    # list that will store our sell sequences and targets
    sells = []
    
    # split the dataset into buys and sells based on target
    for seq, target in sequential_data:
        if target == 0:
            sells.append([seq, target])
        elif target == 1:
            buys.append([seq, target])
       
    # shuffle them both
    random.shuffle(buys)
    random.shuffle(sells)
    
    # find out which is smaller, buys or sells
    lower = min(len(buys), len(sells))
    
    # them limit each of them to that 'lower' number
    buys = buys[:lower]
    sells = sells[:lower]
    
    # add them both and shuffle them again
    sequential_data = buys + sells
    random.shuffle(sequential_data)
    
    
    X = []
    y = []
    
    for seq, target in sequential_data:
        X.append(seq)
        y.append(target)
        
    return np.array(X), np.array(y)

Lets preprocess our data

In [36]:
train_x, train_y = preprocess_df(main_df)
validation_x, validation_y = preprocess_df(validation_main_df)

Let's print some stats real quick to make sure things are what we expect:

In [37]:
print(f"Training Data Size: {len(train_x)} \nValidation Data Size: {len(validation_x)}")
# print(f"Training data:\nSells: {train_y.count(0)} | Buys: {train_y.count(1)}")
# print(f"Validation data:\nSells : {validation_y.count(0)} | Buys: {validation_y.count(1)}")

Training Data Size: 77922 
Validation Data Size: 3860


Now, lets start with tensorflow stuff.  
Lets import all the necessary things (models, layers and callbacks) and define some constants

In [38]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint

In [39]:
import time

# how many passes through our data
EPOCHS = 10
# number of batches (try smaller batch if you're getting OOM - out of memory errors
BATCH_SIZE = 64
# a unique name for the model
NAME = f"{CURRENCY_TO_PREDICT}-{SEQ_LEN}-SEQ-{FUTURE_PERIOD_PREDICT}-PRED-{int(time.time())}"

Lets build our model.  

We will start with a Sequential model, add few LSTM layers and a Dense layer and finally an output layer. 

In [40]:
model = Sequential()

model.add(LSTM(128, input_shape=(train_x.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())    #normalizes activation outputs

model.add(LSTM(128, activation='relu', return_sequences=True))
model.add(Dropout(0.1))
model.add(BatchNormalization())    #normalizes activation outputs

model.add(LSTM(128, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())    #normalizes activation outputs

model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(2, activation='softmax'))

Lets choose an optimizer and compile our model

In [41]:
opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=opt,
    metrics=['accuracy']
)   

Now, lets add a Tensorflow callback for futuer analysis

In [42]:
tensorboard = TensorBoard(log_dir="logs/{}".format(NAME))

Lets also add the ModelCheckpoint callback for saving best models with good accuracy for future use

In [65]:
checkpoint_filepath = "models/RNN_Final-{epoch:02d}-{val_accuracy:.3f}.hd5"
# saves only the best models
checkpoint = ModelCheckpoint(filepath=checkpoint_filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

Now, finally, lets train the model

In [66]:
history = model.fit(
    train_x, train_y,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(validation_x, validation_y),
    callbacks=[tensorboard, checkpoint]
)

Epoch 1/10
Epoch 00001: val_accuracy improved from -inf to 0.50648, saving model to models/RNN_Final-01-0.506.hd5
INFO:tensorflow:Assets written to: models/RNN_Final-01-0.506.hd5\assets
Epoch 2/10
Epoch 00002: val_accuracy did not improve from 0.50648
Epoch 3/10
Epoch 00003: val_accuracy did not improve from 0.50648
Epoch 4/10
Epoch 00004: val_accuracy improved from 0.50648 to 0.51606, saving model to models/RNN_Final-04-0.516.hd5
INFO:tensorflow:Assets written to: models/RNN_Final-04-0.516.hd5\assets
Epoch 5/10
Epoch 00005: val_accuracy did not improve from 0.51606
Epoch 6/10
Epoch 00006: val_accuracy did not improve from 0.51606
Epoch 7/10
Epoch 00007: val_accuracy did not improve from 0.51606
Epoch 8/10
Epoch 00008: val_accuracy did not improve from 0.51606
Epoch 9/10
Epoch 00009: val_accuracy did not improve from 0.51606
Epoch 10/10
Epoch 00010: val_accuracy did not improve from 0.51606


Lets visualize the accuracy and loss using Tensorboard

In [69]:
%load_ext tensorboard
import datetime, os

logs_base_dir = "./logs"
os.makedirs(logs_base_dir, exist_ok=True)
%tensorboard --logdir {logs_base_dir}

ERROR: Timed out waiting for TensorBoard to start. It may still be running as pid 644.

After executing above cell, got to localhost:6006 to see the output