<div>
    <center>
        <h1>
            <font color="#902e59">CQF June 2023 Intake: Final Project</font>
        </h1>
    </center>
    <center>
        <h3>Deep Learning for Asset Prediction</h3>
    </center>
    <center>
        <h5>Will Colgate, Singapore, January 2024</h5>
    </center>
</div>


### Problem Statement
The objective is to produce a model that can predict positive moves using Long Short-Term Memory (LSTM) networks.

I have chosen Ethereum (ETH) as the ticker to analyse (technically a pair with USD). Crypto markets are notoriously volatile and it seems like a decent challenge to try and tease some insight out of the mess.

For this purpose, I will aim to predict an hourly positive return. Defining a positive return is discussed in more detail as part of the labels section. This will be a binary classification problem with accuracy as the main metric used to measure the effectiveness of the model. The baseline to test the effectiveness fo the model against would be a random guess (i.e. a 50% chance of being correct) and the baseline for a strategy based on this prediction would be to buy on day one and hold until the end of the test period.

In [1]:
# Imports
from src.config import *

# Base
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# Feature engineering
import pandas_ta as ta

# Warnings
import warnings
warnings.filterwarnings("ignore")

### Data Collection

##### Raw Data

Using `download.py`, I have downloaded two years worth of hourly ticker data. This script wraps a simple function that pulls data from `yfinance`. The data is saved locally in "ETH-USD_2y_1h.csv" for convenience.

`yfinance` has a restriction on the amount of hourly data that can be downloaded and restricts this to 730 days of data (i.e. 2 years). Given crypto markets never close, this amounts to 17k+ data points. As a general rule of thumb, 5 years of daily data would be required to predict daily returns. On a normal security, this would only be approximately 1,300 data points. Therefore, 2 years of hourly data should be more than sufficient for this problem. In fact, the amount of data may need to be reduced due to hardware constraints.

In [2]:
# Collect the data into a dataframe
csv = 'data/raw/ETH-USD_2y_1h.csv'
df = pd.read_csv(csv, index_col='Datetime', parse_dates=True)
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Open,17485.0,1940.922,672.844,902.475952,1559.277222,1784.905029,2040.799,4205.823
High,17485.0,1947.399,675.6417,921.278198,1564.480469,1790.515625,2048.271,4227.112
Low,17485.0,1934.211,669.7479,896.300049,1553.950684,1778.494019,2032.429,4156.187
Close,17485.0,1940.835,672.6981,896.575623,1559.341064,1784.716431,2040.379,4204.396
Volume,17485.0,163365700.0,4769001000.0,0.0,0.0,0.0,121633800.0,627337500000.0
Dividends,17485.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Stock Splits,17485.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The above looks fairly standard for data from `yfinance`. Immediately we can drop Dividends and Stock Splits given this is a crypto token and the entries are all zero in the data.

There appear to be large range of volumes (including nil volumes).

In [3]:
# Looking at the zero volume entries
df['Volume'][df['Volume'] == 0].count()

9100

The above suggests the volume data is unreliable given it is unlikely there were 375 days (i.e. 9,100 hours) where not a single transactions was registered. 

Therefore, I looked for an alternative datasource and found [www.cryptodatadownload.com]() which has hourly data for Gemini, a fairly reputable exchange. I can't speak for the source of the data and it's accuracy but the Gemini API itself only serves data for the past two months, which is unlikely to be enough. For the sake of this experiment, I will use the csv data downloaded from the website. In production, more reliable data straight from the exchange should be procured before making any investment decisions with real money.

The hourly data is expressed in New York East Coast time.

In [4]:
df = pd.read_csv('data/raw/Gemini_ETHUSD_1h.csv', skiprows=[0])
# Parse unix timestamp as UTC dates
df['unix'] = pd.to_datetime(df['unix'], unit='ms', utc=True)
df.set_index(df['unix'], inplace=True)
# Drop date, symbol and Volume USD columns
df.drop(['unix', 'date', 'symbol', 'Volume USD'], axis=1, inplace=True)
# Rename
df.rename(columns={'Volume ETH': 'volume'}, inplace=True)
# Sort date ascending
df.sort_index(inplace=True)
# Take the last 2.5 years to account for losses NaN when feature engineering begins
df = df['2021-6-13':]
df.head()

Unnamed: 0_level_0,open,high,low,close,volume
unix,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-06-13 00:00:00+00:00,2370.8,2395.82,2353.41,2390.15,571.740206
2021-06-13 01:00:00+00:00,2390.15,2419.83,2389.32,2412.34,320.737513
2021-06-13 02:00:00+00:00,2412.34,2416.56,2380.05,2384.52,198.350042
2021-06-13 03:00:00+00:00,2384.52,2384.85,2315.0,2336.26,3260.305701
2021-06-13 04:00:00+00:00,2336.26,2347.56,2326.2,2342.47,384.432689


In [5]:
# Plotting closing price and volume
# plot_price_vol(df, 'close', 'volume')

The new data looks much more reasonable and seems to include more data points for volatility. There is also much more data availably, which should hopefully help with training the model. Again, I may need to remove some of this data due to hardware limitations. I will use this going forwards.

The crypto market is notoriously emotion driven. Even glancing at social maedia or news outlets allows a person to gain a sense of how this is true. It follows then that some kind of sentiment regarding this emotional investing would potentially give some interesting insight into the problem statement. There is an interesting resource updated daily on [alternative.me](https://alternative.me/crypto/fear-and-greed-index/) called the fear and greed index. 

The index takes a weighted approach to a number of factors across 5 (formally, 6) datasources. A numerical value is assigned which falls into categories of:

- Extreme Fear
- Fear
- Neutral
- Greed
- Extreme Greed

The index is updated daily at 00:00 UTC.

In [6]:
# Load fear and greed data
df_fear_greed = pd.read_csv('data/raw/crypto_greed_fear_index.csv', parse_dates=True, index_col='timestamp')
# Drop unneeded columns
df_fear_greed.drop(['time_until_update', 'timestamp.1'], axis=1, inplace=True)
# Rename columns
df_fear_greed.columns = ['fg_value', 'fg_value_classification']
# Put classification to lower case
df_fear_greed['fg_value_classification'] = df_fear_greed['fg_value_classification'].str.lower()
df_fear_greed.head()

Unnamed: 0_level_0,fg_value,fg_value_classification
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-12-23 00:00:00+00:00,70,greed
2023-12-22 00:00:00+00:00,74,greed
2023-12-21 00:00:00+00:00,70,greed
2023-12-20 00:00:00+00:00,74,greed
2023-12-19 00:00:00+00:00,73,greed


In [7]:
# Join the sentiment data to price data
df = df.join(df_fear_greed)
df['fg_value'].ffill(inplace=True)
df['fg_value_classification'].ffill(inplace=True)
df.tail()

Unnamed: 0_level_0,open,high,low,close,volume,fg_value,fg_value_classification
unix,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-12-12 19:00:00+00:00,2187.81,2188.03,2172.59,2176.29,103.032324,67.0,greed
2023-12-12 20:00:00+00:00,2176.29,2187.75,2170.26,2186.1,262.450524,67.0,greed
2023-12-12 21:00:00+00:00,2186.1,2190.25,2166.69,2172.23,173.656678,67.0,greed
2023-12-12 22:00:00+00:00,2172.23,2190.22,2170.76,2187.34,93.143372,67.0,greed
2023-12-12 23:00:00+00:00,2187.34,2205.3,2187.34,2202.01,193.550211,67.0,greed


The above table is a sample of the final raw data that will be used for the remainder of the analyis.

##### Feature Engineering



In [8]:
# Create the technical indicators
df.ta.study(cores=0)

In [9]:
# Add hours, days, months to investigate seasonality
df['hour'] = df.index.hour
df['day_of_week'] = df.index.day
df['month'] = df.index.month

In [10]:
# Checking where there are null values for more than a thousand rows and discarding column
cols_to_discard = df.isnull().sum()[df.isnull().sum() > 1000].index.values
df.drop(cols_to_discard, axis=1, inplace=True)
df.dropna(axis=0, inplace=True)
df.isnull().sum().sum()

0

##### Labelling the Data

Given the problem statement is to predict an hourly positive return. A practical approach to predicting a positve return for these purposes would be any net reuturn (i.e. after transaction costs).

[Here](https://www.gemini.com/fees/api-fee-schedule) are the fees from the Gemini exchange for reference. The taker fee at the lowest volume per month is 0.4%. To account for interest on margin, I will round this up to 0.5% as an estimate.

Therefore, a label of 1 will mean that that the upward return in the next hour will be greater than 0.05% and 0 otherwise. Mathamatically:

$$ y_t =
  \begin{cases}
    1      & \quad \text{if  } r_{t+1} > 0.005\\
    0  & \quad \text{otherwise}
  \end{cases}
$$

In [11]:
# Adding the labels
df['label'] = np.where(df['LOGRET_1'].shift(-1) > 0.005, 1, 0)
df.dtypes.to_csv('sdsds.csv')

In [12]:
# One Hot Encoding of categorical data
encoder = OneHotEncoder(sparse_output=False)
onehot = encoder.fit_transform(df[['fg_value_classification', 'hour', 'day_of_week', 'month']])
feature_names = encoder.get_feature_names_out()
df[feature_names] = onehot
df.drop(['fg_value_classification', 'hour', 'day_of_week', 'month'], axis=1, inplace=True)
# Save train and test datasets in csv in relevant folders

In [13]:
# Train test split
y = df['label']
X = df.drop('label', axis=1)
# Train test split. Shuffle is false to retain time series attributes 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, shuffle=False)

A few observations from the pairplot:

- The return does not seem to have any seasonality to it over the past 2 years, with the trend line indicating corrolation close to (if not exactly) zero.
- The fear and greed index, although having minimal correlation, appear to exhibit heteroskedasticity, indicating that as the greed index rises, the volatility in return tends to fall. The greed index could be a good indiactior of volatility in the market.
- There are a few outliers for volume, potentially indicaing it is a good candidate for a robust scaler.
- Volume tends to drop as the index indicates more greediness.
- The closing price tends to fall throughout the year. However, there are only 2 of each month in the dataset so this is unlikely to be indicitive. Potentially need to drop this point.
- As expected, returns exhibit stochastic characteristics and look relative normal in their distribution.
- There is a significant class imbalance that will need to be addressed.

Given the pairplot has a lot of variables, I have also used unifold manifold approximation & projection (UMAP) below to examine any potential realtionships in the data. 

In [14]:
# import umap
# import umap.plot
# from sklearn.preprocessing import MinMaxScaler

# reducer = umap.UMAP(n_neighbors=50, random_state=42, min_dist=0.1, n_components=3)
# scaled_X = MinMaxScaler().fit_transform(X)
# embedder = reducer.fit_transform(scaled_X)

In [15]:
# import plotly.express as px
# fig = px.scatter_3d(X, x=embedder[:, 0], y=embedder[:, 1], z=embedder[:, 2], color=y, hover_data=X.columns, opacity=0.75, width=600, title='UMAP')
# fig.update(layout_coloraxis_showscale=False)
# fig.show()

UMAP does show some clustering of data, suggesting it could be used for dimensionality reduction although it dows not seem to give much of an indication on the return label. This will be further explored later in this report. 

### Data Cleaning

Just looking at the data, it seems fairly complete but I will do some basic checks to confirm this is the case.

In [16]:
# Checking for null values
df.isnull().sum() > 0

open        False
high        False
low         False
close       False
volume      False
            ...  
month_8     False
month_9     False
month_10    False
month_11    False
month_12    False
Length: 392, dtype: bool

As expected, the data is clean and we can start with feature engineering. 

### Data Transformation
##### Feature Creation

In `feature.py`, I have created a script that calculates various technical indicators. These features are well documented online but I have included a summary pdf for easy reference.

##### Feature Scaling

### Deep Learning Model


### Model Validation


### Backtesting

