# AC109 Project Modeling Results: Predicting the returns on Cryptocurrencies

by Ali Dastjerdi, Angelina Massa, Sachin Mathur & Nate Stein

### Supporting Libraries

We outsourced some of the supporting code to other modules we wrote located in the main directory with the intent of having this notebook focus on the presentation of results. The supporting modules are:
- `crypto_utils.py` contains the code we used to scrape and clean data from coinmarket.cap. It also contains the code used to wrangle/preprocess that data (saved in CSV files) into our design matrix. By separating the creation of the design matrix in its own `.py` file, we were also able to create unit tests to ensure the resulting figures match what we expected based on hand-calculated figures, which became increasingly important as we engineered more involved features.
- `crypto_models.py` contains the code we used to iterate over multiple classification models and summarize the results in tabular form.

In [1]:
import crypto_utils as cryp
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.model_selection as model_selection
import sklearn.metrics as metrics
import time as time

from crypto_utils import fmt_date, print_update

In [2]:
# Custom output options.

np.set_printoptions(precision=4, suppress=True)
pd.set_option('display.precision', 4)
sns.set_style('white')
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['savefig.bbox'] = 'tight'
plt.rcParams['savefig.pad_inches'] = 0.05
%matplotlib inline

In [3]:
RAND_STATE = 88

## Construct Design Matrix

We want the construction of the design matrix to be agile enough to allow us to easily change whether we include certain features, which cryptocurrency's price return we want to forecast, etc.

In [4]:
def get_data(x_cryptos, y_crypto, test_size, kwargs):
    design = cryp.DesignMatrix(x_cryptos=x_cryptos, y_crypto=y_crypto, **kwargs)
    X, Y = design.get_data(lag_indicator=True)
    X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, Y, test_size=test_size, random_state=RAND_STATE)
    return X_train, X_test, y_train, y_test

In [5]:
crypto_scope = ['ltc', 'xrp', 'xlm', 'eth', 'btc']

# Store x cryptocurrencies and y crypto (the one we're forecasting)
# in list of tuples.
xy_crypto_pairs = []
for y_crypto in crypto_scope:
    x_cryptos = [c for c in crypto_scope if c != y_crypto]
    xy_crypto_pairs.append((x_cryptos, y_crypto))

# Modeling: Regression

In [6]:
import scipy.stats as stats
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor

In [7]:
N_CROSSVAL = 3
TEST_SIZE = 0.2

### Baseline Model

In [8]:
def evaluate_baseline_model(x_cryptos, y_crypto, kwargs):
    """Return MAE on test set."""
    X_train, X_test, y_train, y_test = get_data(x_cryptos, y_crypto, TEST_SIZE,
                                                kwargs)
    lr = LinearRegression().fit(X_train, y_train)
    mae = metrics.mean_absolute_error(y_test, lr.predict(X_test))
    return mae

#### Determine optimal rolling window for measuring changes in price and volume

Ultimately we want to determine which `n_rolling_volume`, `n_rolling_price` and `n_std_window` to use going forward, as it will influence our more advanced features.

In [9]:
def find_optimal_rolling_periods():
    """Iterates over many different rolling period windows and evaluates 
    MAE on test set.
    
    Notes: Takes ~18min to run.
    """
    df_results = pd.DataFrame(columns=['y', 'mae', 'n_rolling_price', 
                                       'n_rolling_volume', 'n_std_window'])

    params = {'n_rolling_price':None, 'n_rolling_volume':None,
              'x_assets':[], 'n_std_window':None}

    n_rolling_prices = range(1, 5)
    n_rolling_volumes = range(1, 5)
    n_std_windows = range(5, 60, 5)
    
    combo_total = len(n_rolling_prices) * len(n_rolling_volumes) * len(n_std_windows)
    combo_count = 0
    
    t0 = time.time()
    for n_price in n_rolling_prices:
        for n_vol in n_rolling_volumes:
            for n_std in n_std_windows:
                combo_count += 1
                print_update('Trying param combination {}/{}...'.format(
                    combo_count, combo_total))
                params['n_rolling_price'] = n_price
                params['n_rolling_volume'] = n_vol
                params['n_std_window'] = n_std
                new_row = {'n_rolling_price': n_price,
                           'n_rolling_volume': n_vol,
                           'n_std_window': n_std}
                for x_cryps, y_cryp in xy_crypto_pairs:
                    new_row['y'] = y_cryp
                    new_row['mae'] = evaluate_baseline_model(x_cryps, y_cryp, 
                                                             params)
                    df_results = df_results.append(new_row, ignore_index=True)
    print_update('Finished all parameter combinations in {:.2f} seconds.'.format(
        time.time() - t0))
    return df_results

In [10]:
# avg_results = df_results.groupby(['n_rolling_price', 'n_rolling_volume', 'n_std_window']).mean()

After iterating over many rolling window options in `find_optimal_rolling_periods()`, we can determine that the optimal parameters are:
- `n_rolling_price`: 1
- `n_rolling_volume`: 1
- `n_std_window`: 10

In [11]:
PARAMS = {'n_rolling_price':1, 'n_rolling_volume':1,
          'x_assets':[], 'n_std_window':10}

### Introduce Regularization

In [12]:
def evaluate_lasso(x_cryptos, y_crypto):
    """Returns MAE and alpha from cross-validation after evaluating Lasso 
    regression on test set.
    """
    X_train, X_test, y_train, y_test = get_data(x_cryptos, y_crypto, TEST_SIZE, 
                                                PARAMS)
    lasso = LassoCV(n_alphas=100, cv=N_CROSSVAL, random_state=RAND_STATE)
    lasso.fit(X_train, y_train)
    mae = metrics.mean_absolute_error(y_test, lasso.predict(X_test))
    return mae, lasso.alpha_, lasso, X_train

In [13]:
df_lasso = pd.DataFrame(columns=['y', 'mae', 'alpha'])
for x_cryps, y_cryp in xy_crypto_pairs:
    mae, alpha, _, _ = evaluate_lasso(x_cryps, y_cryp)
    new_row = {'y': y_cryp, 'mae': mae, 'alpha': alpha}
    df_lasso = df_lasso.append(new_row, ignore_index=True)

In [14]:
display(df_lasso)

Unnamed: 0,y,mae,alpha
0,ltc,0.0368,0.0058
1,xrp,0.0434,0.0415
2,xlm,0.0627,0.0101
3,eth,0.05,0.0066
4,btc,0.0282,0.0049


In [15]:
def get_features_df(lasso, X_train):
    df = pd.DataFrame(columns=['coeff', 'weight'])
    df['coeff'] = X_train.columns.tolist()
    df['weight'] = lasso.coef_
    df.sort_values('weight', ascending=False, inplace=True)
    df.set_index('coeff', inplace=True, drop=True)
    return df

In [16]:
# See what weights are assigned to features.

_, _, lasso, X_train = evaluate_lasso(['ltc', 'xrp', 'xlm', 'eth'], 'btc')
feature_weights = get_features_df(lasso, X_train)
display(feature_weights)

Unnamed: 0_level_0,weight
coeff,Unnamed: 1_level_1
ltc_px_std,0.0
ltc_volume_std,0.0
xrp_px_std,-0.0
xrp_volume_std,-0.0
xlm_px_std,-0.0
xlm_volume_std,-0.0
eth_px_std,-0.0
eth_volume_std,0.0
btc_px_std,0.0
btc_volume_std,0.0


### XGBRegressor

In [17]:
X_train, X_test, y_train, y_test = get_data(['ltc', 'xrp', 'xlm', 'eth'], 
                                            'btc', TEST_SIZE, PARAMS)

In [18]:
def build_xgb_model(X_train, y_train):
    """Iterate over a hyperparameter space and return best model on a 
    validation set reserved from input training data.
    """
    # Define hyperparam space.
    expon_distr = stats.expon(0, 50)
    cv_params = {
        'n_estimators': stats.randint(4, 100),
        'max_depth': stats.randint(2, 100),
        'learning_rate': stats.uniform(0.05, 0.95),
        'gamma': stats.uniform(0, 10),
        'reg_alpha': expon_distr,
        'min_child_weight': expon_distr
    }

    # Iterate over hyperparam space.
    xgb = XGBRegressor(nthreads=-1)  # nthreads=-1 => use max cores
    
    print_update('Tuning XGBRegressor hyperparams...')
    t0 = time.time()
    gs = RandomizedSearchCV(xgb, cv_params, n_iter=400, n_jobs=1, cv=3, 
                            random_state=88)
    gs.fit(X_train, y_train)
    print_update('Finished tuning XGBRegressor in {:.0f} secs.'.format(
        time.time() - t0))
    
    return gs.best_estimator_

In [19]:
xgb = build_xgb_model(X_train, y_train)

Finished tuning XGBRegressor in 24 secs.

In [20]:
mae_xgb = metrics.mean_absolute_error(y_test, xgb.predict(X_test))
print('XGBRegressor MAE: {:.2%}'.format(mae_xgb))

XGBRegressor MAE: 2.82%


# Modeling: Classification

In [21]:
import create_models

In [22]:
def get_classification_data(thresh=0.01):
    design = cryp.DesignMatrix(x_cryptos=x_cryptos, y_crypto=y_crypto, **PARAMS)
    X, Y = design.get_data(lag_indicator=True, y_category=True,
                           y_category_thresh=thresh)
    return model_selection.train_test_split(X, Y, test_size=TEST_SIZE, 
                                            random_state=RAND_STATE)

In [23]:
X_train, X_test, y_train, y_test = get_classification_data()

In [24]:
buy_count = len(np.where(y_train == 1)[0])
sell_count = len(np.where(y_train == -1)[0])
hold_count = len(np.where(y_train == 0)[0])
total = y_train.shape[0]
print('Training classification breakdown:')
print('\tBuy: {0} ({1:.0%})'.format(buy_count, buy_count/total))
print('\tSell: {0} ({1:.0%})'.format(sell_count, sell_count/total))
print('\tNeutral: {0} ({1:.0%})'.format(hold_count, hold_count/total))

Training classification breakdown:
	Buy: 284 (37%)
	Sell: 171 (22%)
	Neutral: 312 (41%)


In [26]:
clf_perf = create_models.traditional_models(X_train, y_train, X_test, 
                                            y_test, pos_label=[1])

[ 0.     0.016  0.016  0.024  0.024  0.04   0.04   0.064  0.064  0.08   0.08
  0.096  0.096  0.112  0.112  0.144  0.144  0.152  0.152  0.16   0.16
  0.192  0.192  0.208  0.208  0.264  0.264  0.272  0.272  0.296  0.296
  0.304  0.304  0.32   0.32   0.336  0.336  0.344  0.344  0.368  0.368
  0.384  0.384  0.392  0.392  0.416  0.416  0.432  0.432  0.448  0.448
  0.496  0.496  0.504  0.504  0.536  0.536  0.544  0.544  0.592  0.592  0.6
  0.6    0.608  0.608  0.616  0.616  0.632  0.632  0.64   0.64   0.68   0.68
  0.704  0.704  0.712  0.712  0.744  0.744  0.752  0.752  0.768  0.768  0.8
  0.8    0.816  0.816  0.84   0.84   0.848  0.848  0.864  0.864  0.888
  0.888  0.936  0.936  0.944  0.944  0.96   0.96   0.976  0.976  0.992
  0.992  1.   ]


In [None]:
# from sklearn.linear_model import LogisticRegressionCV

# cvals = [1e-20, 1e-15, 1e-10, 1e-5, 1e-3, 1e-1, 1, 10, 100, 10000, 100000]
# logregcv = LogisticRegressionCV(Cs=cvals, cv=5)
# logregcv.fit(X_train, y_train)
# yhat = logregcv.predict(X_test)
# logreg_acc = metrics.accuracy_score(y_test, yhat)
# fpr_log, tpr_log, thresholds = metrics.roc_curve(
#           y_test, logregcv.predict_proba(X_test)[:, 1], pos_label=[1])
# logreg_auc = metrics.auc(fpr_log, tpr_log)

In [27]:
display(clf_perf)

Unnamed: 0,AUC,Accuracy
LogReg,0.5124,0.2917
KNN,0.5349,0.3333
LDA,0.479,0.3646
QDA,0.4165,0.375
RandomForest,0.4965,0.3125
ADABoost,0.522,0.3542
SVM,0.4968,0.3542
