## Rakeshwer's Master Thesis 2022: 

### "Time Series Prediction with ANNs on non-linear and non-stationary Time Series"

**Note:** Former working title was slightly different: "Predicting Regime Shifts for non-linear and non-stationary Time Series with Artificial Neural Networks".

**Research Questions:**

- **Q0:** Set the benchmark: How do **naive heuristics** perform? Try naive buy&hold strategy, which means being always long. Then the accuracy is supposed to be the relative amount of UP movements. Then - in addition to naive buy&hold - take majority vote of last *k* days movement as predictor for next day. What accuracy do we achieve with that? And do we already outperform the naive buy&hold approach? Get baselines for Train / Validation / Test data seperately for both heuristics and all four targets.
- **Q1:** Can we predict trends in stock price movements from their own history alone? (open, high, low, close, range, traded shares, observed volatility, moving averages, RSI, MACD, Bollinger limits) â€”> We expect to fail and get only poor accuracy (50%). These are all **technical features**.
- **Q2:** Do **external features** help improving accuracy? (mostly macroeconomic indices: GDP, inflation expectation, interest rates,...)
- **Q3:** If we still get poor results, why and where do our models fail? Do we find similarities in terms of patterns or regimes, in which our model performs better or worse? (e.g. high volatility regime,...) Start with analysing samples with highest loss (deviation of prediction from true target), then apply (subspace-)clustering in input feature space.
- **Q4:** And eventually we want to make use of our models and try to set up some trading model, that beats the market (= naive buy-and-hold). Plot cumulative outperformance, cumulative gain from naive buy-and-hold and closing price history (abs values), to recognize, in which market situations our model performs best.

**Outline:**

Here we **load data from *csv* file** - preprocessed in EXCEL, including technical AND external features plus targets. **Note:** Start with S&P500 data history at Jan 7th, 1985, since from there on we have all values available. The limiting factor is Baltic Dry Index. Without that we could start in April 1982, since from there on we find open/high/low/close/vol on [Yahoo Finance](https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC). 

We then tackle **Q0** to **set the benchmark with naive heuristics**.

### Tech Preamble

In [1]:
import sys
import datetime
import numpy as np
from os.path import join
import matplotlib.pyplot as plt

### Import data

In [2]:
# Define path to data folder:
path_to_data = "../data"

# Import data from csv file:
data = np.genfromtxt(join(path_to_data,'SP500_InputFeatures_Targets.csv'), delimiter=';', skip_header=1)

In [4]:
# Check dimensions:
print("data shape (samples, features): ", data.shape)
print("\nfirst row: \n", data[0])

print("\nfeatures [column number]:")
print("=========================")
print("[0] - year")
print("[1] - open (abs)")
print("[2] - high (abs)")
print("[3] - low (abs)")
print("[4] - close (abs)")
print("[5] - close (rel chg)")
print("[6] - vol (abs)") # number of traded shares
print("[7] - vol (rel chg)")
print("[8] - range (abs)") # high - low
print("[9] - range (rel)") # range (abs) / close (abs)
print("[10] - target 1d (rel chg)")
print("[11] - target 5d (rel chg)")
print("[12] - target 10d (rel chg)")
print("[13] - target 20d (rel chg)")
print("[14] - close (abs) moving average 20d")
print("[15] - close (abs) moving average 50d")
print("[16] - close (abs) moving average 100d")
print("[17] - close (abs) moving average 200d")
print("[18] - dist (rel) of close (abs) to moving average 20d")
print("[19] - dist (rel) of close (abs) to moving average 50d")
print("[20] - dist (rel) of close (abs) to moving average 100d")
print("[21] - dist (rel) of close (abs) to moving average 200d")
print("[22] - RSI with 14d lookback")
print("[23] - Bollinger dist (rel) of close (abs) to its moving average 20d - expressed in std devs")
print("[24] - Bollinger short signal")
print("[25] - Bollinger long signal")
print("[26] - Combined Bollinger short/long signal")
print("[27] - Consumer Confidence")
print("[28] - Inflation rate YoY")
print("[29] - Inflation rate MoM")
print("[30] - Inflation expectation (short term)")
print("[31] - Consumer Price Index (abs)")
print("[32] - Consumer Price Index (rel chg)")
print("[33] - Producer Prices change YoY")
print("[34] - GDP growth rate QoQ")
print("[35] - Unemployment rate (abs)")
print("[36] - Unemployment rate (rel chg)")
print("[37] - FED Funds rate (abs)")
print("[38] - FED Funds rate (change)")
print("[39] - M0 money supply (abs)")
print("[40] - M0 money supply (rel chg)")
print("[41] - 10y US Bond yield (abs)")
print("[42] - 10y US Bond yield (change)")
print("[43] - Oil price (abs)")
print("[44] - Oil price (rel chg)")
print("[45] - Business Confidence")
print("[46] - Philly FED Index")
print("[47] - Federal Debt relative to GDP")
print("[48] - Baltic Dry Index (abs)")
print("[49] - Baltic Dry Index (rel chg)")
print("[50] - Dist (rel) of volume (abs) to its moving average 20d - expressed in std devs")

data shape (samples, features):  (9406, 51)

first row: 
 [ 1.98500000e+03  1.63680000e+02  1.64710000e+02  1.63680000e+02
  1.64240000e+02  3.42131000e-03  8.61900000e+07  1.12416107e-01
  1.03000000e+00  6.27131000e-03 -1.52216300e-03  3.81758400e-02
  6.69142720e-02  9.80881640e-02  1.64820000e+02  1.65369600e+02
  1.65455200e+02  1.60508200e+02 -3.51899000e-03 -6.83076000e-03
 -7.34458600e-03  2.32499020e-02  5.77415074e-01 -3.02335523e-01
  0.00000000e+00  0.00000000e+00  0.00000000e+00  9.29000000e+01
  3.90000000e-02  2.00000000e-03  3.30000000e-02  1.05300000e+02
  7.65550200e-03  1.66000000e-02  3.30000000e-02  7.30000000e-02
  1.38888890e-02  8.25000000e-02 -5.00000000e-03  2.04800000e+05
  2.09371880e-02  1.14700000e-01 -1.80000000e-03  2.55600000e+01
  1.50913420e-02  5.06000000e+01  2.10000000e+00  3.96000000e-01
  9.98000000e+02 -2.00000000e-03 -2.95289790e-02]

features [column number]:
[0] - year
[1] - open (abs)
[2] - high (abs)
[3] - low (abs)
[4] - close (abs)
[5] - 

### Description of features and targets

Importing csv data with Pandas - one finds the following column names:

- [0] **year** - Derived vom date
- [1] **open_abs** - S&P500 index data from YahooFinance, covering 21.02.1984 to 27.05.2022
- [2] **high_abs** - S&P500 index data from YahooFinance, covering 21.02.1984 to 27.05.2022
- [3] **low_abs** - S&P500 index data from YahooFinance, covering 21.02.1984 to 27.05.2022
- [4] **close_abs** - S&P500 index data from YahooFinance, covering 21.02.1984 to 27.05.2022
- [5] **close_rel_chg** - Relative change compared to previous day
- [6] **vol_abs** - S&P500 index data from YahooFinance, covering 21.02.1984 to 27.05.2022
- [7] **vol_rel_chg** - Relative change compared to previous day
- [8] **range_abs** - Difference between day's high and low
- [9] **range_rel** - The calculated absolute range divided by day's closing price
- [10] **1d_rel_chg** - Target: Relative change for the next day (on closing price)
- [11] **5d_rel_chg** - Target: Relative change for the next 5 days (on closing price)
- [12] **10d_rel_chg** - Target: Relative change for the next 10 days (on closing price)
- [13] **20d_rel_chg** - Target: Relative change for the next 20 days (on closing price)
- [14] **20d_mvg_avg** - Moving average of absolute closing price for the last 20 days (including current day)
- [15] **50d_mvg_avg** - Moving average of absolute closing price for the last 50 days (including current day)
- [16] **100d_mvg_avg** - Moving average of absolute closing price for the last 100 days (including current day)
- [17] **200d_mvg_avg** - Moving average of absolute closing price for the last 200 days (including current day)
- [18] **rel_dist_close_ma20** - Relative distance of current day's closing price to moving average of last 20 days (including current day)
- [19] **rel_dist_close_ma50** - Relative distance of current day's closing price to moving average of last 50 days (including current day)
- [20] **rel_dist_close_ma100** - Relative distance of current day's closing price to moving average of last 100 days (including current day)
- [21] **rel_dist_close_ma200** - Relative distance of current day's closing price to moving average of last 200 days (including current day)
- [22] **RSI_14d** - Relative strength index (as e.g. described on https://www.investopedia.com/terms/r/rsi.asp) with 14 days lookback period.
- [23] **Boll_rel_dist_k** - Relative distance of absolute closing price to its moving average (over 20 days) expressed in number standard deviations k (over 20 days)
- [24] **Boll_short** - Binary signal: If closing price raises above upper bollinger limit, set to 1, else zero.
- [25] **Boll_long** - Binary signal: If closing price falls below lower bollinger limit, set to 1, else zero.
- [26] **Boll_short_long** - Combines binary signals "Boll_short" and "Boll_long" into ONE signal with THREE states: -1 if Boll_short=1, +1 if Boll_long=1, 0 else.
- [27] **Consumer_Confidence** - Index value in points, downloaded from tradingeconomics.com, source: University of Michigan, frequency: monthly
- [28] **Inflation_YoY** - Inflation as change of prices compared to previous year: Downloaded from tradingeconomics.com, source: US Bureau of Labor Statistics, frequency: monthly
- [29] **Inflation_MoM** - Inflation as change of prices relative to previous month: Downloaded from tradingeconomics.com, source: US Bureau of Labor Statistics, frequency: monthly
- [30] **Inflation_exp** - Short-term Inflation expectation: Downloaded from tradingeconomics.com, source: University of Michigan, frequency: monthly
- [31] **CPI_abs** - Consumer Price Index: Index value in points, downloaded from tradingeconomics.com, source: US Bureau of Labor Statistics, frequency: monthly
- [32] **CPI_rel_chg** - Consumer Price Index change relative to previous month: Calculated from CPI_abs, keeping value constant until new absolute CPI value is reported.
- [33] **PP_YoY** - Producer prices change relative to previous year's same month: Downloaded from tradingeconomics.com, source: US Bureau of Labor Statistics, frequency: monthly
- [34] **GDP_growth_rate** - Gross Domestic Product of US relative change to previous quarter. Downloaded from tradingeconomics.com, source: US Bureau Economic Analysis, frequency: quaterly
- [35] **Unemployment_rate_abs** - Unemployment rate in %. Downloaded from tradingeconomics.com, source: US Bureau of Labour Statistics, frequency: monthly.
- [36] **Unemployment_rate_rel_chg** - Unemployment rate change relative to previous month: Calculated from absolute Unemployment rate, keeping value constant until new absolute value is reported. Downloaded from tradngeconomics.com, source: US Bureau of Labour Statistics, frequency: monthly.
- [37] **FED_interest_rates_abs** - FED Funds rate set by the Federal Reserve, central bank of US: Downloaded from tradingeconomics.com, source: Federal Reserve, frequency: daily.
- [38] **FED_interest_rates_change** - Absolute change in FED Funds rate set by the Federal Reserve, keeping value constant until new absolute value is reported. Source: Federal Reserve, frequency: daily.
- [39] **M0_abs** - Money supply in Million USD: Downloaded from tradingeconomics.com, source: Federal Reserve, frequency: monthly.
- [40] **M0_rel_chg** - Money supply change relative to previous month: Calculated from M0_abs, keeping value constant until new absolute M0 value is reported, source: Federal Reserve, frequency: monthly.
- [41] **10y_Bond_yield_abs** - US Treasury bonds' yield for maturity in 10 years. Downloaded from tradingeconomics.com, source: US Department of the Treasury, frequency: daily.
- [42] **10y_Bond_yield_abs** - Absolute change in 10y Bond yield compared to previous day, source: US Department of the Treasury, frequency: daily.
- [43] **Crude_Oil_abs** - Oil price in USD. Downloaded from tradingeconomics.com, frequency: daily.
- [44] **Crude_Oil_rel_chg** - Relative change in oil price compared to previous day, frequency: daily.
- [45] **Business_Confidence** - United States ISM Purchasing Managers Index (PMI). Index value in points, downloaded from tradingeconomics.com, source: US Institute for Supply Management (ISM), frequency: monthly.
- [46] - **Philly_Fed** - Philadelphia Fed Manufacturing Index. Index value in points, downloaded from tradingeconomics.com, source: Federal Reserve Bank of Philadelphia, frequency: monthly.
- [47] - **Debt_to_GDP**: United States Gross Federal Debt to GDP. Relative value, downloaded from tradingeconomics.com, source: Office of Management and Budget, The White House, frequency: yearly.
- [48] - **Baltic_Dry_abs** - Baltic Dry Index. Index value (daily close) in points, downloaded from tradingeconomics.com, source: Baltic Dry Exchange in London, frequency: daily.
- [49] - **Baltic_Dry_rel_chg** - Relative change Baltic Dry Index compared to previous day, frequency: daily.
- [50] - **val_rel_dist_k** - Relative distance of absolute volume to its moving average (over 20 days) expressed in number standard deviations k (over 20 days).

In [4]:
# Extract year for seperating data into train / validation / test sets
year = data[:,0]

## Extract close (rel chg) as input feature.
# [5] - close (rel chg)

inputs = data[:,(5,)]

## Extract target values for 1d / 5d / 10d / 20d as rel chg:
# [10] - target 1d (rel chg)
# [11] - target 5d (rel chg)
# [12] - target 10d (rel chg)
# [13] - target 20d (rel chg)

targets = data[:,10:14]

# Check dimensions:
print("year shape (samples): ", year.shape)
print("inputs shape (samples, input features): ", inputs.shape)
print("targets shape (samples, output features): ", targets.shape)

year shape (samples):  (9406,)
inputs shape (samples, input features):  (9406, 1)
targets shape (samples, output features):  (9406, 4)


In [5]:
## Split inputs and targets into train / validation / test sets, according to year:

# Train data: 1985 .. 2009
# Val data:   2010 .. 2019
# Test data:  2020 .. end

train_input = inputs[year <= 2009]
val_input = inputs[(year >= 2010) & (year < 2020)]
test_input = inputs[year >= 2020]

train_target = targets[year <= 2009]
val_target = targets[(year >= 2010) & (year < 2020)]
test_target = targets[year >= 2020]

# Convert to binary targets (up = 1, down = 0):
train_target_bin = train_target>0
val_target_bin = val_target>0
test_target_bin = test_target>0

# Check dimensions:
print("train_input shape (samples, time steps, features): ", train_input.shape)
print("val_input shape (samples, time steps, features): ", val_input.shape)
print("test_input shape (samples, time steps, features): ", test_input.shape)
print("\ntrain_target shape (samples, features): ", train_target.shape)
print("val_target shape (samples, features): ", val_target.shape)
print("test_target shape (samples, features): ", test_target.shape)
print("\ntrain_target_bin shape (samples, features): ", train_target_bin.shape)
print("val_target shape_bin (samples, features): ", val_target_bin.shape)
print("test_target shape_bin (samples, features): ", test_target_bin.shape)

train_input shape (samples, time steps, features):  (6303, 1)
val_input shape (samples, time steps, features):  (2516, 1)
test_input shape (samples, time steps, features):  (587, 1)

train_target shape (samples, features):  (6303, 4)
val_target shape (samples, features):  (2516, 4)
test_target shape (samples, features):  (587, 4)

train_target_bin shape (samples, features):  (6303, 4)
val_target shape_bin (samples, features):  (2516, 4)
test_target shape_bin (samples, features):  (587, 4)


In [6]:
## Look at heuristics to set benchmark:

# Get "naive" accuracy, when we always predict an UP movement.
# In other words this reflects the relative amount of UP movements.
train_acc_naiv = np.sum(train_target_bin, axis=0) / len(train_target)
val_acc_naiv = np.sum(val_target_bin, axis=0) / len(val_target)
test_acc_naiv = np.sum(test_target_bin, axis=0) / len(test_target)

print("naive acc (%) for target [1d  5d  10d  20d]")
print("===========================================")
print("naive train acc (%)", np.round(train_acc_naiv * 100,1), "%")
print("naive val acc (%)", np.round(val_acc_naiv * 100,1), "%")
print("naive test acc (%)", np.round(test_acc_naiv * 100,1), "%")

naive acc (%) for target [1d  5d  10d  20d]
naive train acc (%) [53.6 56.8 58.6 61.8] %
naive val acc (%) [54.8 60.7 63.9 67.2] %
naive test acc (%) [55.  60.1 63.9 68.3] %


In [7]:
## Try another heuristic: Use majority vote of last k days binary close (rel chg) as predictor.

# Define function to split time series 'sequence' into 'n_steps'
def split_sequence(sequence, n_steps):
    X = list()
    for i in range(len(sequence)):
        # Find the end of this pattern
        end_ix = i + n_steps
        # Check if we are beyond the sequence
        if end_ix > len(sequence):
            break
        # Gather input and output parts of the pattern
        seq_x = sequence[i:end_ix]
        X.append(seq_x)
    return np.array(X)

In [8]:
# Define function to get accuracy from majority vote

def acc_majority_vote(k=1, verbose=False):
    
    # Split inputs into sequences of specified k:
    train_input_split = split_sequence(train_input, k)
    val_input_split = split_sequence(val_input, k)
    test_input_split = split_sequence(test_input, k)

    # Convert to binary inputs (up = 1, down = 0):
    train_input_bin_split = train_input_split>0
    val_input_bin_split = val_input_split>0
    test_input_bin_split = test_input_split>0

    # Adjust targets: Cut first (k - 1) entries
    train_target_cut = train_target[k-1:]
    val_target_cut = val_target[k-1:]
    test_target_cut = test_target[k-1:]

    # Adjust binary targets: Cut first (k - 1) entries
    train_target_bin_cut = train_target_bin[k-1:]
    val_target_bin_cut = val_target_bin[k-1:]
    test_target_bin_cut = test_target_bin[k-1:]

    # Check dimensions, if desired:
    if verbose:
        print("train_input shape AFTER splitting (samples, timesteps, input features): ", train_input_split.shape)
        print("val_input shape AFTER splitting (samples, timesteps, input features): ", val_input_split.shape)
        print("test_input shape AFTER splitting (samples, timesteps, input features): ", test_input_split.shape)
        print("\ntrain_input_bin shape AFTER splitting (samples, timesteps, input features): ", train_input_bin_split.shape)
        print("val_input_bin shape AFTER splitting (samples, timesteps, input features): ", val_input_bin_split.shape)
        print("test_input_bin shape AFTER splitting (samples, timesteps, input features): ", test_input_bin_split.shape)
        print("\ntrain_target shape AFTER splitting (samples, timesteps, input features): ", train_target_cut.shape)
        print("val_target shape AFTER splitting (samples, timesteps, input features): ", val_target_cut.shape)
        print("test_target shape AFTER splitting (samples, timesteps, input features): ", test_target_cut.shape)
        print("\ntrain_target_bin shape AFTER splitting (samples, timesteps, input features): ", train_target_bin_cut.shape)
        print("val_target_bin shape AFTER splitting (samples, timesteps, input features): ", val_target_bin_cut.shape)
        print("test_target_bin shape AFTER splitting (samples, timesteps, input features): ", test_target_bin_cut.shape)
        
    # Get majority vote from binary input samples with length k as predictor:
    # First sum binary predictions for all k days (up=1, down=0).
    # Need to divide that sum by the number days k.
    # Then check if sum is greater then 0.5, meaning that the majority was pointing UP.
    train_pred = (np.sum(train_input_bin_split,axis=1)/k > 0.5)
    val_pred = (np.sum(val_input_bin_split,axis=1)/k > 0.5)
    test_pred = (np.sum(test_input_bin_split,axis=1)/k > 0.5)

    # Check dimensions, if desired:
    if verbose:
        print("train_pred shape: ", train_pred.shape)
        print("val_pred shape: ", val_pred.shape)
        print("test_pred shape: ", test_pred.shape)
        
    # Get accuracy as correctly predicted UP or DOWN movements:
    train_acc_majority = np.sum((train_pred == train_target_bin_cut), axis=0) / len(train_target_bin_cut)
    val_acc_majority = np.sum((val_pred == val_target_bin_cut), axis=0) / len(val_target_bin_cut)
    test_acc_majority = np.sum((test_pred == test_target_bin_cut), axis=0) / len(test_target_bin_cut)
    
    # Print resulting accuracy:
    print("train acc (%)", np.round(train_acc_majority * 100,1), "%")
    print("val acc (%)", np.round(val_acc_majority * 100,1), "%")
    print("test acc (%)", np.round(test_acc_majority * 100,1), "%")

In [9]:
## Get accuracy from majority vote as predictor: 

# Loop over look back period from k=1..10 days.
for k in range(1,11,2):
    
    # Print header:
    print("Look back k =",k," days, acc (%) from majority vote for target [1d  5d  10d  20d]")
    print("===============================================================================")
    
    # Call function on k:
    acc_majority_vote(k=k)
    print("\n")

Look back k = 1  days, acc (%) from majority vote for target [1d  5d  10d  20d]
train acc (%) [49.1 50.1 49.2 50.7] %
val acc (%) [49.  50.4 52.7 51.8] %
test acc (%) [45.8 48.9 52.6 50.3] %


Look back k = 3  days, acc (%) from majority vote for target [1d  5d  10d  20d]
train acc (%) [49.  49.8 49.8 51.2] %
val acc (%) [50.  51.4 53.1 52.7] %
test acc (%) [49.6 49.1 54.2 53.3] %


Look back k = 5  days, acc (%) from majority vote for target [1d  5d  10d  20d]
train acc (%) [48.6 49.5 50.5 51.1] %
val acc (%) [50.5 52.4 54.2 54.4] %
test acc (%) [50.1 51.5 57.3 55.1] %


Look back k = 7  days, acc (%) from majority vote for target [1d  5d  10d  20d]
train acc (%) [48.5 49.6 50.5 50.8] %
val acc (%) [51.  53.8 55.  54.5] %
test acc (%) [49.7 53.4 58.9 54.2] %


Look back k = 9  days, acc (%) from majority vote for target [1d  5d  10d  20d]
train acc (%) [48.  50.4 50.8 51.5] %
val acc (%) [51.4 53.9 55.3 54.3] %
test acc (%) [52.3 54.7 59.2 55.6] %




### Discussion on Q0: Get benchmark accuracy from heuristics

If we later want to evaluate our models' performance, we need some benchmark to compare our results to. The question was: How do **naive heuristics** perform? We tried two heuristics:

- Naive buy&hold strategy, which means being always long. Then the accuracy is supposed to be the relative amount of UP movements. 
- Then - in addition to naive buy&hold - we took a majority vote of the last *k* days binary movement as predictor for next day(s). 

On the long run stock markets tend to go up. This is reflected in a higher number of days with UP movements. For target 1d we find the relative amount of UP movements to be 53.6% / 54.8% / 55.0% for train / validation / test data.

The longer the target horizon, the more likely UP movements become. This is also found in an increased accuracy from our naive buy&hold strategy, hence in an increased relative amount of *positive* target values of 61.8% / 67.2% / 68.3% for train / validation / test data for target 20d.

We then looked at accuracies from majority votes looking 1 / 3 / 5 / 7 / 9 days back: The longer we look back, the better the achieved accuracy. But overall **our naive buy&hold strategy already sets a very good benchmark**.