# bao-yue-rao
Source article:
- https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180944

Source article raw data:
- https://figshare.com/articles/Raw_Data/5028110

In [198]:
# Data Cleaning & Loading

In [199]:
# Imports (External)
import numpy as np
import pandas as pd
import datetime as dt

from collections import OrderedDict
import xlrd
import xlsxwriter

# Visualization/plotting imports
import matplotlib as mpl
import matplotlib.pyplot as plt

# Machine learning imports
import sklearn
import tensorflow as tf
import keras

import pywt
from scipy import signal

In [200]:
# Imports (Internal)

In [201]:
# Load in excel file and map each excel sheet to an ordered dict
raw_xlsx_file = pd.ExcelFile("data/raw_data.xlsx")
dict_dataframes =pd.read_excel(raw_xlsx_file,sheet_name = None)

In [202]:
# Alternate method to map multiple excel sheets from file and access via dataframe/ordereddict

#sheet_to_df_map = pd.read_excel(raw_xlsx_file, sheet_name=None)
#sheet_to_df_map['HangSeng Index Data']

In [203]:
type(dict_dataframes)

collections.OrderedDict

In [204]:
# Convert ordered of dataframes to regular dict
dict_dataframes= dict(dict_dataframes)
type(dict_dataframes)

dict

In [205]:
# Convert all sheet names/dict keys to lowercase using list comprehension 
    # Source: https://stackoverflow.com/a/38572808
dict_dataframes = {k.lower(): v for k, v in dict_dataframes.items()}

# Print number of sheets in raw_data
print("Number of sheets: ",len(dict_dataframes),"\n")
print("\n".join(list(dict_dataframes.keys())))
#print(raw_xlsx_file.sheet_names)

Number of sheets:  12 

hangseng index data
hangseng index future data
s&p500 index data
s&p500 index future data
csi300 index data
csi300 index future data
djia index data
djia index future data
nikkei 225 index data
nikkei 225 index future data
nifty 50 index data
nifty 50 index future data


Table 6. Profitability performance of each model (Panel A, Panel B, Panel C)
![table%206%20profitability%20non%20highlighted.PNG](attachment:table%206%20profitability%20non%20highlighted.PNG)

Panel A, Developing Market, Relatively Developed Market, Developed Market
- CSI 300 Index
- Nifty 50 Index

Panel B, Relatively Developed Market
- Hang Seng Index
- Nikkei 225 Index

Panel C, Developed Market
- S&P 500 Index
- DJIA Index

In [206]:
# Panel A, Developing Market


In [207]:
# Panel B, Relatively Developed Market


In [208]:
# Panel C, Developed Market

In [209]:
# Rename all dataframe column headers in each dataframe in dict_dataframes to lowercase
for item in dict_dataframes:
    dict_dataframes[item].columns = map(str.lower, dict_dataframes[item].columns)

In [210]:
# Convert dict back to orderdict after reorder to match Panel A/B/C format
    # Source: https://stackoverflow.com/a/46447976

key_order = ['csi300 index data',
'csi300 index future data',
'nifty 50 index data',
'nifty 50 index future data',
'hangseng index data',
'hangseng index future data',
'nikkei 225 index data',
'nikkei 225 index future data',
's&p500 index data',
's&p500 index future data',
'djia index data',
'djia index future data',
]
list_of_tuples = [(key, dict_dataframes[key]) for key in key_order]
dict_dataframes = OrderedDict(list_of_tuples)

In [211]:
for item in dict_dataframes:
    # Obtain number of rows in dataframe
    rc=dict_dataframes[item].shape[0]
    # Obtain number of columns in dataframe
    cc =  len(dict_dataframes[item].columns)
    print ("=======================================")
    print (item,"\n")
    print (dict_dataframes[item].info(verbose=False))

csi300 index data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2010 entries, 0 to 2009
Columns: 20 entries, time to wvad
dtypes: float64(19), int64(1)
memory usage: 314.1 KB
None
csi300 index future data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Columns: 3 entries, num_time to close
dtypes: float64(1), int64(2)
memory usage: 34.3 KB
None
nifty 50 index data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2043 entries, 0 to 2042
Columns: 21 entries, date to interbank offered rate
dtypes: float64(18), int64(3)
memory usage: 335.3 KB
None
nifty 50 index future data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Columns: 3 entries, date to close price
dtypes: float64(1), int64(2)
memory usage: 35.0 KB
None
hangseng index data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2037 entries, 0 to 2036
Columns: 21 entries, ntime to hibor
dtypes: float64(19), int64(2)
memory usage: 334.3 KB
None
hangseng index future da

In [212]:
# Obtain column datatypes
#print(dict_dataframes['csi300 index data'].dtypes)
#print(dict_dataframes['csi300 index future data'].dtypes)

In [213]:
# Alternate method of obtaining dataframe row/column/dtype information
#for item in dict_dataframes:
    #print(item)
    #print(dict_dataframes[item].info())

In [214]:
dict_dataframes['csi300 index data'].head()

Unnamed: 0,time,open price,high price,low price,closing price,volume,us dollar index,shibor,macd,cci,atr,boll,ema20,ma10,mtm6,ma5,mtm12,roc,smi,wvad
0,20080701,2799.2,2809.38,2690.18,2698.35,288515.85,72.34,2.5006,-213.078565,-128.949052,119.2,3173.475692,3033.748201,2851.2504,-91.594,2851.3262,-280.77,-9.424605,-0.097927,-87262780.0
1,20080702,2702.63,2745.94,2670.06,2699.6,279163.65,71.99,2.7238,-213.732249,-139.719688,75.875,3140.413385,3001.924277,2822.0828,-152.318,2797.3382,-252.645,-8.557733,-0.026497,-109549300.0
2,20080703,2654.48,2807.68,2617.26,2760.61,456603.05,72.73,2.5762,-206.941406,-112.113057,190.424,3109.044731,2978.942155,2820.8364,-208.925,2753.2792,-82.064,-2.886857,-0.031251,-58557370.0
3,20080704,2751.21,2783.85,2716.02,2741.85,379050.1,72.71,2.5632,-200.759162,-81.997539,67.825,3073.107115,2956.36214,2810.0548,-239.055,2738.4454,-249.421,-8.338289,-0.017237,-74675030.0
4,20080707,2747.61,2890.99,2747.61,2882.76,527320.24,72.71,2.5679,-182.386907,21.707767,149.133,3046.256923,2949.352699,2819.337,66.742,2756.6342,109.687,3.955427,0.010701,-35311510.0


Table 1. Description of the input variables:

![table%201%20description%20of%20input%20variables.PNG](attachment:table%201%20description%20of%20input%20variables.PNG)

In [192]:
# Drop column 'matlab_time' from all dataframes in OrderedDict
for item in dict_dataframes:
    for subitem in dict_dataframes[item]:
        if 'matlab_time' in subitem:
            print(item, subitem)
            #dict_dataframes[item].drop(subitem,axis=1, inplace=True)

csi300 index future data matlab_time
hangseng index future data matlab_time


In [215]:
# Alternate method to drop Drop column 'matlab_time' from all dataframes in OrderedDict

# a = dict_dataframes
# for i, (key, value) in enumerate(a.items()):
#     print (i, key)
#     print(value.columns)
#     if ('matlab_time' in value.columns) is True:
#         print("True")
#         print(i,key)
#         dict_dataframes[key].drop('matlab_time',axis=1, inplace=True)
#     i[item].drop('matlab_time', inplace=True)

In [218]:
# Save cleaned data to disk

# frames_to_excel() source: https://stackoverflow.com/q/51696940
def frames_to_excel(df_dict, path):
    """Write dictionary of dataframes to separate sheets, within 
        1 file."""
    writer = pd.ExcelWriter(path, engine='xlsxwriter')
    for tab_name, dframe in df_dict.items():
        dframe.to_excel(writer, sheet_name=tab_name)
    writer.save() 
    
frames_to_excel(dict_dataframes,"data/clean_data.xlsx")

Panel A variables: (OHLC, H/L price, volume)

Panel B variables: (MACD, CCI, ATR, BOLL, EMA20, MA5/MA10, MTM5/MTM12, ROC, SMI, WVAD)

Panel A variables (OHLC) and Panel B variables (Technical Indicators) can all be calculated/recreated for bitcoin and other crypto (or even stocks/forex)

Panel C variables (Exchange rate and Interest rates) are macroeconomic variables I'd have to find a replacement, substitute, or rough equivalent (if possible) for crypto
possibly even something like the stablecoin exchange premium or overall marketcap per currency/ticker

Splitting data this particular way includes the previous observations into the next training set:

In [None]:
from sklearn.model_selection import TimeSeriesSplit

train_list = []
test_list = []
X = clean_data.values

splits = TimeSeriesSplit(n_splits=6)
index = 1
for train_index, test_index in splits.split(X):
    train = X[train_index]
    test = X[test_index]
    print('Observations: %d' % (len(train) + len(test)))
    print('Training Observations: %d' % (len(train)))
    print('Testing Observations: %d' % (len(test)))
    train_list.append(train)
    test_list.append(test)
    index += 1

In [None]:
len(train_list),len(test_list)

Fig 1. The flowchart of the proposed deep learning framework for financial time series:![fig%201%20flowchart%20of%20dl%20framework%20model.PNG](attachment:fig%201%20flowchart%20of%20dl%20framework%20model.PNG)

As defined by bao-yue-rao, the WSAE-LSTM model has three primary components:

1. Wavelet transform applied to denoise data as part of preprocessing
2. Stacked Autoencoders to generate high level features
3. LSTMs to forecast next day closing price

Wavelet Transform

"As a result, the two-level wavelet is applied twice in this study for data preprocessing as suggested in [23]"
"First, the denoised time series is generated via discrete wavelet transform using the Haar wavelet"

In [None]:
# Single-level wavelet transform
    # https://pywavelets.readthedocs.io/en/latest/ref/dwt-discrete-wavelet-transform.html
cA, cD = pywt.dwt(clean_data.values, 'haar')

In [21]:
# Multi-level wavelet transform
    # https://pywavelets.readthedocs.io/en/latest/ref/dwt-discrete-wavelet-transform.html#pywt.wavedec
from pywt import wavedec
coeffs = wavedec(clean_data.values, 'haar', level=2)
cA2, cD2, cD1 = coeffs

NameError: name 'clean_data' is not defined