This notebook is part of an article about how to forecast and detect anomalies on time-series data. The main objective is to train a RNN regressor on the Bitcoin dataset to predict future values on then detect anomalies in the whole data window - that last step achieved by implementing a RNN Autoencoder.

You'll see some other models in the notebook that I've provided to you in case they are of your interest and this RNN regressor + RNN Autoencoder doesn't perform well for your purpose in any other scenario.

The dataset used is available at https://www.kaggle.com/mczielinski/bitcoin-historical-data and contains BITCOIN/USD 1-minute candle data, from 2012-01-01 to 2020-12-31. I hope you can get advantage of this approach!

# Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import plotly.graph_objects as go
from sklearn.preprocessing import MinMaxScaler
import gc
import joblib
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers 
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
import json
import urllib
from datetime import datetime, timedelta,timezone
import requests

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
btc = pd.read_csv('/kaggle/input/bitcoin-historical-data/bitstampUSD_1-min_data_2012-01-01_to_2020-12-31.csv')
btc.head()

# Data preprocessing
Let's resample the data, take only the variable we're going to use and determine what's the window of data that's more meaningful for our purpose. Let's also clean null values.

In [None]:
btc['Timestamp'] = pd.to_datetime(btc.Timestamp, unit='s')
btc.head()

In [None]:
print('Minutes in dataset: ',len(btc))
print('Hours in dataset: ',len(btc)/60)
print('Days in dataset: ',len(btc)/60/24)

In [None]:
btc = btc[['Timestamp','Weighted_Price']]
btc.head()

In [None]:
btc.info()

In [None]:
# Data re-sampling based on 1 hour
# If you want to sample by day, change H by D
btc = btc.resample('H', on='Timestamp')[['Weighted_Price']].mean()

Let's plot the whole panorama to visually understand what portion of the set must me removed

In [None]:
    pano = btc.copy() #We're going to use this later
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=pano.index, y=pano['Weighted_Price'],name='Full history BTC price'))
    fig.update_layout(showlegend=True,title="BTC price history",xaxis_title="Time",yaxis_title="Prices",font=dict(family="Courier New, monospace"))
    fig.show()

As you can see above, there's a portion of data at the beginning of the set that contains null values. In addition, those are values that are not common in the current BTC price. We need to get rid of them.

In [None]:
print('Starting date selected: ',btc.index[51000])
print('NaN values: ',btc.iloc[51000:].isna().sum())

In [None]:
btc = btc.iloc[51000:]
btc.fillna(method ='bfill', inplace = True)
print('NaN values: ',btc.isna().sum())

In [None]:
print('New data points quantity: ',len(btc))

Let's see how the new dataset looks like once the null and the close-to-cero values were removed.

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=btc.index, y=btc['Weighted_Price'],name='BTC price'))
fig.update_layout(showlegend=True,title="BTC price history",xaxis_title="Time",yaxis_title="Prices",font=dict(family="Courier New, monospace"))
fig.show()

Now that charts fits better with the current BTC price reality. Let's use those samples as our new dataset.

# Data Splitting
We're going to take the test set as the first 20% window. The next 80s as the training set.

In [None]:
data_for_us = btc.copy() #To be used later on Unsupervised Learning
training_start = int(len(btc) * 0.2)

train = btc.iloc[training_start:]
test = btc.iloc[:training_start]
print("Total datasets' lenght: ",train.shape, test.shape)

# Data scaling

This stage is extremely important as a requisite to train Neural Networks. If you skip this step maybe your model won't converge.

In [None]:
scaler = MinMaxScaler().fit(train[['Weighted_Price']])

In [None]:
def scale_samples(data,column_name,scaler):
    data[column_name] = scaler.transform(data[[column_name]])
    return data

In [None]:
joblib.dump(scaler, 'scaler.gz')
scaler = joblib.load('scaler.gz')

In [None]:
train = scale_samples(train.copy(),train.columns[0],scaler)
train.head()

In [None]:
test = scale_samples(test,test.columns[0],scaler)
test.head()

# Sequences generation and dataset creation

In [None]:
def shift_samples(data,column_name,lookback=24):
    """This function takes a *data* dataframe and returns two numpy arrays: 
    - X corresponds to the same values but packed into n frames of *lookback* values each
    - Y corresponds to the sample shifted *lookback* steps to the future
    """
    data_x = []
    data_y = []
    for i in range(len(data) - int(lookback)):
        x_floats = np.array(data.iloc[i:i+lookback])
        y_floats = np.array(data.iloc[i+lookback])
        data_x.append(x_floats)
        data_y.append(y_floats)
    return np.array(data_x), np.array(data_y)

In [None]:
X_train, y_train = shift_samples(train[['Weighted_Price']],train.columns[0])
X_test, y_test = shift_samples(test[['Weighted_Price']], test.columns[0])
gc.collect()

In [None]:
print("Final datasets' shapes:")
print('X_train: '+str(X_train.shape)+', y_train: '+str(y_train.shape))
print('X_test: '+str(X_test.shape)+', y_train: '+str(y_test.shape))

In [None]:
tsteps = X_train.shape[1]
nfeatures = X_train.shape[2]

# Anomaly detectors' training
## LSTM Autoencoder Neural Network

The one that we'll be using along this notebook.

In [None]:
#First model - LSTM Autoencoder for anomaly detections

detector = Sequential()
detector.add(layers.LSTM(128, input_shape=(tsteps, nfeatures),dropout=0.2))
detector.add(layers.Dropout(rate=0.5))
detector.add(layers.RepeatVector(tsteps))
detector.add(layers.LSTM(128, return_sequences=True,dropout=0.2))
detector.add(layers.Dropout(rate=0.5))
detector.add(layers.TimeDistributed(layers.Dense(nfeatures))) 

detector.compile(loss='mae', optimizer='adam')
detector.summary()

In [None]:
checkpoint = ModelCheckpoint("/kaggle/working/detector.hdf5", monitor='val_loss', verbose=1,save_best_only=True, mode='auto', period=1)
history1 = detector.fit(X_train,y_train,epochs=50,batch_size=128,verbose=1,validation_split=0.1,callbacks=[checkpoint],shuffle=False)

In [None]:
plt.plot(history1.history['loss'], label='Training Loss')
plt.plot(history1.history['val_loss'], label='Validation Loss')
plt.legend()

In [None]:
#Let's load the best model obtained during training
detector = load_model("detector.hdf5")
detector.evaluate(X_test, y_test)

### Determining threshold for Autoencoder detector

In [None]:
X_train_pred = detector.predict(X_train)
loss_mae = np.mean(np.abs(X_train_pred - X_train), axis=1) #This is the formula to calculate MAE
sns.distplot(loss_mae, bins=100, kde=True)

In [None]:
X_test_pred = detector.predict(X_test)
loss_mae = np.mean(np.abs(X_test_pred - X_test), axis=1) 
sns.distplot(loss_mae, bins=100, kde=True)

As you can see in the charts from above, observations after 0.150 become unusual. Let's set that number as the threshold.

In [None]:
threshold = 0.15

test_df = pd.DataFrame(test[tsteps:])
test_df['loss'] = loss_mae
test_df['threshold'] = threshold
test_df['anomaly'] = test_df.loss > test_df.threshold
test_df['Weighted_Price'] = test[tsteps:].Weighted_Price

### Plotting prices' anomalies

In [None]:
anomalies = test_df[test_df.anomaly == True]
anomalies.head()

In [None]:
yvals1 = scaler.inverse_transform(test[tsteps:][['Weighted_Price']])
yvals1 = yvals1.reshape(-1)

In [None]:
yvals2 = scaler.inverse_transform(anomalies[['Weighted_Price']])
yvals2 = yvals2.reshape(-1)

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=test[tsteps:].index, y=yvals1,mode='lines',name='BTC Price'))
fig.add_trace(go.Scatter(x=anomalies.index, y=yvals2,mode='markers',name='Anomaly'))
fig.update_layout(showlegend=True,title="BTC price anomalies",xaxis_title="Time",yaxis_title="Prices",font=dict(family="Courier New, monospace"))
fig.show()

In [None]:
scaled_pano = test.append(train, ignore_index=False)
X_shifted, y_shifted = shift_samples(scaled_pano[['Weighted_Price']], scaled_pano.columns[0])
print("Scaled pano datasets' shapes:")
print('X_shifted: '+str(X_shifted.shape)+', y_shifted: '+str(y_shifted.shape))

In [None]:
X_shifted_pred = detector.predict(X_shifted)
loss_mae = np.mean(np.abs(X_shifted_pred - X_shifted), axis=1)

In [None]:
non_scaled_pano = pano.copy()[51000:]
non_scaled_pano.fillna(method ='bfill', inplace = True)
non_scaled_pano = non_scaled_pano[:-24]

In [None]:
non_scaled_pano['loss_mae'] = loss_mae
non_scaled_pano['threshold'] = threshold
non_scaled_pano['anomaly'] = non_scaled_pano.loss_mae > non_scaled_pano.threshold
non_scaled_pano.head()

In [None]:
pano_outliers = non_scaled_pano[non_scaled_pano['anomaly'] == True]

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=non_scaled_pano.index, y=non_scaled_pano['Weighted_Price'].values,mode='lines',name='BTC Price'))
fig.add_trace(go.Scatter(x=pano_outliers.index, y=pano_outliers['Weighted_Price'].values,mode='markers',name='Anomaly'))
fig.update_layout(showlegend=True,title="BTC price anomalies - Autoencoder",xaxis_title="Time",yaxis_title="Prices",font=dict(family="Courier New, monospace"))
fig.show()

## Isolation forest model (Bonus)

In [None]:
# Preparing data to be passed to the model
outliers = pano.copy()[51000:]
outliers.fillna(method ='bfill', inplace = True)

# Training the model
isolation_detector = IsolationForest(n_estimators=150,random_state=0,contamination='auto')
isolation_detector.fit(outliers['Weighted_Price'].values.reshape(-1, 1))

In [None]:
data_ready = np.linspace(outliers['Weighted_Price'].min(), outliers['Weighted_Price'].max(), len(outliers)).reshape(-1,1)
outlier = isolation_detector.predict(data_ready)

In [None]:
outliers['outlier'] = outlier
outliers.head()

## Plotting prices' anomalies

In [None]:
a = outliers.loc[outliers['outlier'] == 1] #anomaly
fig = go.Figure()
fig.add_trace(go.Scatter(x=outliers['Weighted_Price'].index, y=outliers['Weighted_Price'].values,mode='lines',name='BTC Price'))
fig.add_trace(go.Scatter(x=a.index, y=a['Weighted_Price'].values,mode='markers',name='Anomaly',marker_symbol='x',marker_size=2))
fig.update_layout(showlegend=True,title="BTC price anomalies - IsolationForest",xaxis_title="Time",yaxis_title="Prices",font=dict(family="Courier New, monospace"))
fig.show()

## K-Means Clustering (Bonus)

In [None]:
# Preparing data to be passed to the model
outliers_k_means = pano.copy()[51000:]
outliers_k_means.fillna(method ='bfill', inplace = True)
kmeans = KMeans(n_clusters=2, random_state=0).fit(outliers_k_means['Weighted_Price'].values.reshape(-1, 1))
outlier_k_means = kmeans.predict(outliers_k_means['Weighted_Price'].values.reshape(-1, 1))
outliers_k_means['outlier'] = outlier_k_means
outliers_k_means.head()

In [None]:
a = outliers_k_means.loc[outliers_k_means['outlier'] == 1] #anomaly

fig = go.Figure()
fig.add_trace(go.Scatter(x=outliers_k_means['Weighted_Price'].index, y=outliers_k_means['Weighted_Price'].values,mode='lines',name='BTC Price'))
fig.add_trace(go.Scatter(x=a.index, y=a['Weighted_Price'].values,mode='markers',name='Anomaly',marker_symbol='x',marker_size=2))
fig.update_layout(showlegend=True,title="BTC price anomalies - KMeans",xaxis_title="Time",yaxis_title="Prices",font=dict(family="Courier New, monospace"))
fig.show()

As you may see and compare, K-Means model achieved better results than IsolationForest and very similar results than the Autoencoder.

# Time-series forecasting models

Let's test a few models to determine which one fits better the dataset and delivers better results

## LSTM Neural Network

In [None]:
#Second model - LSTM regressor for price predictions
regressor = Sequential()
regressor.add(layers.LSTM(256, activation='relu', return_sequences=True, input_shape=(tsteps, nfeatures),dropout=0.2))
regressor.add(layers.LSTM(256, activation='relu',dropout=0.2))
regressor.add(layers.Dense(1))

regressor.compile(loss='mse', optimizer='adam')
regressor.summary()

In [None]:
checkpoint = ModelCheckpoint("/kaggle/working/regressor.hdf5", monitor='val_loss', verbose=1,save_best_only=True, mode='auto', period=1)
history2 = regressor.fit(X_train,y_train,epochs=30,batch_size=128,verbose=1,validation_data=(X_test, y_test),callbacks=[checkpoint],shuffle=False)

## Conv1D Neural Network

In [None]:
#Third model - Conv1D regressor for price prediction

regressor2 = Sequential()
regressor2.add(layers.Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(tsteps, nfeatures)))
regressor2.add(layers.Conv1D(filters=64, kernel_size=3, activation='relu'))
regressor2.add(layers.Dropout(0.5))
regressor2.add(layers.MaxPooling1D(pool_size=2))
regressor2.add(layers.Flatten())
regressor2.add(layers.Dense(50, activation='relu'))
regressor2.add(layers.Dense(1))

regressor2.compile(optimizer='adam', loss='mse')
regressor2.summary()

In [None]:
checkpoint = ModelCheckpoint("/kaggle/working/regressor2.hdf5", monitor='val_loss', verbose=1,save_best_only=True, mode='auto', period=1)
history3 = regressor.fit(X_train,y_train,epochs=30,batch_size=128,verbose=1,validation_data=(X_test, y_test),callbacks=[checkpoint],shuffle=False)

# Neural networks' evaluation

In [None]:
plt.plot(history2.history['loss'], label='Training Loss')
plt.plot(history2.history['val_loss'], label='Validation Loss')
plt.legend()

In [None]:
plt.plot(history3.history['loss'], label='Training Loss')
plt.plot(history3.history['val_loss'], label='Validation Loss')
plt.legend()

In [None]:
regressor = load_model("regressor.hdf5")
regressor.evaluate(X_test, y_test)

In [None]:
regressor2 = load_model("regressor2.hdf5")
regressor2.evaluate(X_test, y_test)

In [None]:
test = regressor.predict(X_test[0].reshape(1,24,1))

As you could see above, the LSTM model delivers better results. Let's keep that. Let's inspect now if the output has the shape that we were expecting. The model must return a single scalar by each sequence of 24 floating numbers:

In [None]:
test.shape

In [None]:
scaler.inverse_transform(test)

Great! Let's move on.

# Gathering crypto data from the API

## Getting current date and time

In [None]:
past = datetime.now(tz=timezone.utc) - timedelta(days=1) #yesterday's date
past = datetime.strftime(past, '%s') #reshaping to unix format
current = datetime.now(tz=timezone.utc).strftime('%s') #today's date

print(past)
print(current)

## Connecting to Poloniex public API

In [None]:
# connect to poloniex's API
url = 'https://poloniex.com/public?command=returnChartData&currencyPair=USDT_BTC&start='+str(past)+'&end='+str(current)+'&period=300'
result = requests.get(url)
result = result.json()
print(result)

In [None]:
last_data = pd.DataFrame(result)

In [None]:
last_data

## Preprocessing API data

In [None]:
last_data['date'] = pd.to_datetime(last_data.date, unit='s') #To get date in readable format
last_data.head()

In [None]:
last_data = last_data[['date','weightedAverage']]
last_data.head()

In [None]:
last_data = last_data.resample('H', on='date')[['weightedAverage']].mean()

In [None]:
last_data = last_data[-24:]
unscaled = last_data.copy()
len(last_data)

In [None]:
last_data_scaled = scale_samples(last_data,last_data.columns[0],scaler)
last_data_scaled.head()

# Predicting on API data

## Implementing Neural Networks approach

In [None]:
predictions = regressor.predict(last_data_scaled.values.reshape(1,24,1))
unscaled = unscaled.iloc[1:]
unscaled = unscaled.append(pd.DataFrame(scaler.inverse_transform(predictions)[0], index= [unscaled.index[len(unscaled)-1] + timedelta(hours=1)],columns =['weightedAverage']))
future_scaled = scale_samples(unscaled.copy(),unscaled.columns[0],scaler)
future_scaled_pred = detector.predict(future_scaled.values.reshape(1,24,1))
future_loss = np.mean(np.abs(future_scaled_pred - future_scaled.values.reshape(1,24,1)), axis=1)
unscaled['threshold'] = threshold 
unscaled['loss'] = future_loss[0][0]
unscaled['anomaly'] = unscaled.loss > threshold
unscaled.head()

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=unscaled.index, y=unscaled.weightedAverage.values,mode='lines',name='BTC Price'))
fig.add_trace(go.Scatter(x=unscaled.index, y=unscaled[unscaled['anomaly']==True]['weightedAverage'].values,mode='markers',marker_symbol='x',marker_size=10,name='Anomaly'))
fig.add_vrect(x0=unscaled.index[-2], x1=unscaled.index[-1],fillcolor="LightSalmon", opacity=1,layer="below", line_width=0)
fig.update_layout(showlegend=True,title="BTC price predictions and anomalies",xaxis_title="Time (UTC)",yaxis_title="Prices",font=dict(family="Courier New, monospace"))

fig.show()

## Detecting outliers with classic Unsupervised Learning models

### Isolation Forest

In [None]:
anomalies_24h = np.linspace(unscaled['weightedAverage'].min(), unscaled['weightedAverage'].max(), len(unscaled)).reshape(-1,1)
outlier = isolation_detector.predict(anomalies_24h)
unscaled['outlier'] = outlier
unscaled.head()

In [None]:
print('Anomalies in prediction: ',len(unscaled[unscaled['outlier'] == 1]))

### KMeans

In [None]:
outlier_k_means = kmeans.predict(unscaled['weightedAverage'].values.reshape(-1, 1))
unscaled['outlier'] = outlier_k_means
unscaled.head()

In [None]:
print('Anomalies in prediction: ',len(unscaled[unscaled['outlier'] == 1]))

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=unscaled.index, y=unscaled.weightedAverage.values,mode='lines',name='BTC Price'))
fig.add_trace(go.Scatter(x=unscaled.index, y=unscaled[unscaled['outlier']==True]['weightedAverage'].values,mode='markers',marker_symbol='x',marker_size=10,name='Anomaly'))
fig.add_vrect(x0=unscaled.index[-2], x1=unscaled.index[-1],fillcolor="LightSalmon", opacity=1,layer="below", line_width=0)
fig.update_layout(showlegend=True,title="BTC price predictions and anomalies",xaxis_title="Time (UTC)",yaxis_title="Prices",font=dict(family="Courier New, monospace"))

fig.show()

I hope this notebook has been useful to you! Thanks a lot.