# 5.0 Modelling

---

# Contents
    
- [1.0 About ARIMAX Model](#1.0-About-ARIMAX-Model)
- [2.0 Load Data](#2.0-Load-Data)
    - [2.1 EUR/USD Data](#2.1-EUR/USD-Data)
    - [2.2 Pattern Data](#2.2-Pattern-Data)
- [3.0 Model](#3.0-Model)
    - [3.1 Train/Test Split](#3.1-Train/Test-Split)
    - [3.2 Fit Model](#3.2-Fit-Model)
    - [3.3 Calculate Results](#3.3-Calculate-Results)
    - [3.4 Run Model](#3.4-Run-Model)
- [4.0 Results](#4.0-Results)
- [5.0 Observations](#5.0-Observations)



In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import datetime
import calendar

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA, ARMA, ARMAResults, ARIMAResults
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import r2_score, mean_squared_error
from pmdarima import auto_arima
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")


from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse


In [2]:
pd.set_option('display.max_columns', None)

---

# 1.0 About ARIMAX Model

The Arima model has 3 components:

+ Differencing Step - I - Integrated - Check for stationarity
+ Autoregressive Piece - AR - long term trends
+ Moving Average Piece - MA - Modelling sudden fluctuations

Each part has input into the model P, D, Q.

+ D is the order of differencing we found using the Augmented Dickey-Fuller test.
+ P is the number of autoregressive terms in our model. PACF is used to estimate this.
+ Q is to do with looking at the moving average.
    + If PACF has a sharp cut off and lag-1 for the ACF is negative choose q to be the lag in the ACF before cut off.
    + If PACF does not have a sharp cut off or lag -1 ACF is not negative choose q = 0

Therefore based on the charts before I will use:
    
    + p = 1
    + d = 1
    + q = 0
    
However I will use auto_arima to help decide.

The X in the model stands for Exogenous features which are external features away from the time frame. The time series models look at time as a key factor. This is accurate when time is a key feature in determining price. For example stock control will be lnked to time as the more time the less stock is liekly.

The FOREX market has a lot of external features which are important to drive the prices. Therefore I added gold price, volatility and 2 Moving Averages. The idea is the model can use the extra information to help predict the price.

---

In [3]:
results = {'algo':'','name':'','date':'', 'time_frame':'','success':0,'RMSE':0, 'MSE':0, 'classification':'' }

# 2.0 Load Data

## 2.1 EUR/USD Data

In [4]:
daily = pd.read_csv('/Users/stuartdaw/Documents/Capstone_data/data/resampled/eur-usd2daily.csv', index_col='date', parse_dates=True)
# daily = pd.read_csv('/Users/stuartdaw/Documents/Capstone_data/data/resampled/eur-usd2daily.csv', 
#                     index_col='date', parse_dates=True)

In [5]:
# daily.loc[(daily.index >= '2000-10-6') & (daily.index <= '2000-10-18')]

In [6]:
daily.columns

Index(['open', 'high', 'low', 'close', 'mid', 'wk_mv_avg', 'mnth_mv_avg',
       'volatility_3_day', 'volatility_10_day', 'pct_chge_3_prds',
       'pct_chge_5_prds', 'pct_chge_10_prds', 'height', 'height-1', 'height-2',
       'height-3', 'direction', 'marubozu', 'marubozu+1', 'marubozu-1',
       'marubozu-2', 'day-1_open', 'day-2_open', 'day-3_open', 'day-1_high',
       'day-2_high', 'day-3_high', 'day-1_low', 'day-2_low', 'day-3_low',
       'day-1_close', 'day-2_close', 'day-3_close', 'day+1_open', 'day+1_high',
       'day+1_low', 'day+1_close', 'day+2_high', 'day+2_low', 'day+3_high',
       'day+3_low', 'day+4_high', 'day+4_low', 'day+5_high', 'day+5_low',
       'exit_price', 'select', 'target', 'date+5'],
      dtype='object')

In [7]:
daily.head()

Unnamed: 0_level_0,open,high,low,close,mid,wk_mv_avg,mnth_mv_avg,volatility_3_day,volatility_10_day,pct_chge_3_prds,pct_chge_5_prds,pct_chge_10_prds,height,height-1,height-2,height-3,direction,marubozu,marubozu+1,marubozu-1,marubozu-2,day-1_open,day-2_open,day-3_open,day-1_high,day-2_high,day-3_high,day-1_low,day-2_low,day-3_low,day-1_close,day-2_close,day-3_close,day+1_open,day+1_high,day+1_low,day+1_close,day+2_high,day+2_low,day+3_high,day+3_low,day+4_high,day+4_low,day+5_high,day+5_low,exit_price,select,target,date+5
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
2000-07-18,0.9361,0.9368,0.9227,0.9256,0.93085,0.93785,0.945633,0.003582,0.003797,-0.008151,-0.022678,-0.020519,0.0105,0.0022,0.003,0.0062,-1,-1,0.0,0.0,0.0,0.9382,0.9353,0.9416,0.9402,0.9389,0.9425,0.9342,0.9318,0.933,0.936,0.9383,0.9354,0.9255,0.927,0.9193,0.9246,0.9342,0.9204,0.9384,0.9319,0.9367,0.9313,0.9433,0.9329,0.9361,0,0.9193,2000-07-25
2000-07-20,0.9245,0.9342,0.9204,0.9325,0.9285,0.93166,0.943221,0.005881,0.004167,-0.009177,-0.010655,-0.024531,0.008,0.0009,0.0105,0.0022,1,1,0.0,0.0,-1.0,0.9255,0.9361,0.9382,0.927,0.9368,0.9402,0.9193,0.9227,0.9342,0.9246,0.9256,0.936,0.9324,0.9384,0.9319,0.9365,0.9367,0.9313,0.9433,0.9329,0.945,0.9391,0.9444,0.9314,0.9405,0,0.945,2000-07-27
2000-07-25,0.9329,0.9433,0.9329,0.9412,0.93705,0.93197,0.94279,0.00307,0.005057,0.009208,0.006661,-0.016169,0.0083,0.0036,0.0041,0.008,1,1,0.0,0.0,0.0,0.9366,0.9324,0.9245,0.9367,0.9384,0.9342,0.9313,0.9319,0.9204,0.933,0.9365,0.9325,0.9411,0.945,0.9391,0.9435,0.9444,0.9314,0.9338,0.9229,0.9295,0.9224,0.9293,0.9135,0.9495,0,0.945,2000-08-01
2000-07-27,0.9434,0.9444,0.9314,0.9319,0.93765,0.93725,0.942469,0.005403,0.004752,0.003049,0.009855,-0.000906,0.0115,0.0024,0.0083,0.0036,-1,-1,-1.0,0.0,1.0,0.9411,0.9329,0.9366,0.945,0.9433,0.9367,0.9391,0.9329,0.9313,0.9435,0.9412,0.933,0.932,0.9338,0.9229,0.9241,0.9295,0.9224,0.9293,0.9135,0.9192,0.9117,0.9174,0.8997,0.9434,0,0.8997,2000-08-03
2000-07-28,0.932,0.9338,0.9229,0.9241,0.92805,0.93597,0.941517,0.008063,0.005738,-0.009605,-0.006849,-0.00934,0.0079,0.0115,0.0024,0.0083,-1,-1,0.0,-1.0,0.0,0.9434,0.9411,0.9329,0.9444,0.945,0.9433,0.9314,0.9391,0.9329,0.9319,0.9435,0.9412,0.9241,0.9295,0.9224,0.9274,0.9293,0.9135,0.9192,0.9117,0.9174,0.8997,0.9103,0.9015,0.932,0,0.8997,2000-08-04


In [8]:
### Get correct hyper parameters

In [9]:
## Arima
auto_arima(daily['close'].dropna(), seasonal=False).summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,2056.0
Model:,"SARIMAX(0, 1, 0)",Log Likelihood,6254.34
Date:,"Tue, 04 Aug 2020",AIC,-12506.68
Time:,13:32:34,BIC,-12501.052
Sample:,0,HQIC,-12504.617
,- 2056,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
sigma2,0.0001,3.2e-06,41.576,0.000,0.000,0.000

0,1,2,3
Ljung-Box (Q):,34.46,Jarque-Bera (JB):,159.63
Prob(Q):,0.72,Prob(JB):,0.0
Heteroskedasticity (H):,0.84,Skew:,0.06
Prob(H) (two-sided):,0.03,Kurtosis:,4.36


---

## 2.2 Pattern Data

In [10]:
daily_pattern = pd.read_csv('/Users/stuartdaw/Documents/Capstone_data/data/targets/daily_pattern2.csv', 
                           parse_dates=True)

In [11]:
daily_pattern.head()

Unnamed: 0,pattern_end
0,2000-10-11
1,2000-10-20
2,2001-04-05
3,2001-04-09
4,2001-08-20


In [12]:
daily_pattern['pattern_end'] = pd.to_datetime(daily_pattern['pattern_end'])

In [13]:
daily_pattern.loc[0]

pattern_end   2000-10-11
Name: 0, dtype: datetime64[ns]

In [14]:
len(daily_pattern)

62

---

# 3.0 Model

In [15]:
# def create_train_test_split(date, time_series, model_info):
#     test_end_date = time_series.loc[date,'date+5']
    
#     train_test = time_series.loc[time_series.index <= test_end_date]
  
#     target_value = time_series.loc[time_series.index == date,'double_height'].item()
    
#     train_test.insert(0, 'target_price', target_value)
    
#     model_info['signal'] = time_series.loc[date,'marubozu']
    
#     train_test.insert(0, 'signal', model_info['signal'])
    
#     model_info['start'] = len(train_test)-5
#     model_info['end'] = len(train_test)-1
    
#     model_info['train'] = train_test.iloc[:model_info['start']]
#     model_info['test'] = train_test.iloc[model_info['start']:]

#     return model_info

## 3.1 Train/Test Split

In [16]:
def create_train_test_split(date, time_series, model_info):

    # Get index of pattern and add 6 (so 5) extra rows for Test/train set
    test_end_loc = time_series.index.get_loc(date) + 6
    
    # Create train/test set using index loc of pattern 
    train_test = time_series.iloc[:test_end_loc]
    
    # Set target values
    target_value = time_series.loc[time_series.index == date,'exit_price'].item()
    
    # add target price to dataset
    train_test.insert(0, 'target_price', target_value)
    
    # Add Signal so it can be determined whether we expect the price to go up or down.
    model_info['signal'] = time_series.loc[date,'marubozu']
    
    # insert the signal to dataset
    train_test.insert(0, 'signal', model_info['signal'])
    
    # create start and end points for the test/train splits
    model_info['start'] = len(train_test)-5
    model_info['end'] = len(train_test)-1
    
    # create the train and data sets
    model_info['train'] = train_test.iloc[:model_info['start']]
    model_info['test'] = train_test.iloc[model_info['start']:]

    return model_info

## 3.2 Fit Model

In [33]:
def train_arima(model_info, p=0, d=1, q=0):
#     exog=model_info['train']['wk_mv_av']
    
    exog = np.column_stack([model_info['train']['mnth_mv_avg'], 
                            model_info['train']['wk_mv_avg'],
                            model_info['train']['volatility_3_day'],
                            model_info['train']['gold_euro'],
                            model_info['train']['gold_usd']])
    
    if model_info['signal'] == -1:
        model = ARIMA(model_info['train']['low'], exog=exog, order=(p,d,q))
    else:
        model = ARIMA(model_info['train']['high'], exog=exog, order=(p,d,q))

    results = model.fit()
    predictions = results.predict(start=model_info['start'], 
                                  end=model_info['end'], exog=exog,
                                  dynamic=True, 
                                  typ='levels').rename('ARIMA-0-1-0 Predictions')
    
    return results, predictions

## 3.3 Calculate Results

In [34]:
def meet_threshold(row):
    if row['signal'] == -1 and row['low'] <= row['target_price']:
        return -1
    elif row['signal'] == 1 and row['high'] >= row['target_price']:
        return 1    
    else:
        return 0

In [35]:
def ml_decision(row):
    if row['direction'] == -1 and row['preds'] <= row['target_price']:
        return -1
    elif row['direction'] == 1 and row['preds'] >= row['target_price']:
        return 1    
    else:
        return 0

In [36]:
def create_results_outcomes_dataframe(test, predictions):    
    outcomes = pd.DataFrame()
    outcomes['low'] = test['low']
    outcomes['high'] = test['high']
    outcomes['preds'] = predictions.values
    outcomes['target_price'] = test['target_price']
    outcomes['direction'] = test['signal']
    outcomes['correct_call'] = test.apply(meet_threshold, axis=1)
    return outcomes

In [37]:
def print_chart(outcomes):
    if model_info['signal'] == -1:
        outcomes['low'].plot(legend=False, figsize=(12,8))
    else:
        outcomes['high'].plot(legend=False, figsize=(12,8))

    outcomes['preds'].plot(legend=False);
    outcomes['target_price'].plot(legend=False);

In [38]:
def get_results(model_info):
        
    if model_info['signal'] == -1:
        mse = mean_squared_error(model_info['test']['low'], predictions)
        rmse_res = rmse(model_info['test']['low'], predictions)
    else:
        mse = mean_squared_error(model_info['test']['high'], predictions)
        rmse_res = rmse(model_info['test']['high'], predictions)       
    
    return rmse_res, mse

In [39]:
def classify(outcomes):
    
    if max(outcomes['direction']) == 1:
        
        if max(outcomes['correct_call']) == 0 and max(outcomes['ml_correct_call']) == 0:
            return 'tn'
        elif max(outcomes['correct_call']) == 1 and max(outcomes['ml_correct_call']) == 1:
            return 'tp'
        elif max(outcomes['correct_call']) == 0 and max(outcomes['ml_correct_call']) == 1:
            return 'fp'
        elif max(outcomes['correct_call']) == 1 and max(outcomes['ml_correct_call']) == 0:
            return 'fn'
        
    elif max(outcomes['direction']) == -1:
        
        if min(outcomes['correct_call']) == 0 and min(outcomes['ml_correct_call']) == 0:
            return 'tn'
        elif min(outcomes['correct_call']) == -1 and min(outcomes['ml_correct_call']) == -1:
            return 'tp'
        elif min(outcomes['correct_call']) == 0 and min(outcomes['ml_correct_call']) == -1:
            return 'fp'
        elif min(outcomes['correct_call']) == -1 and min(outcomes['ml_correct_call']) == 0:
            return 'fn'
        
    else:
        return 'ERROR'
    

## 3.4 Run Model

In [40]:

arima_results = []

for match in daily_pattern['pattern_end']:
    print(match, type(match))
    
    model_info = {"train":None,"test":None,"start":None,"end":None,"signal":None}

    
    results_dict = {'name':None,'pattern':None,'date':None,
                   'time_frame':None,'RMSE':None,
                   'MSE':None, 'classification':None}
    
    results_dict['name'] = 'arima-0-1-0' + str(match)
    results_dict['strategy'] = 'marubozu'
    results_dict['time_frame'] = 'daily'
    

    model_info = create_train_test_split(match, daily, model_info)

    if len(model_info['train']) < 10:
        continue

    results, predictions = train_arima(model_info)
    

    outcomes = create_results_outcomes_dataframe(model_info['test'], predictions)
    outcomes['ml_correct_call'] = outcomes.apply(ml_decision, axis=1)

    results_dict['RMSE'], results_dict['MSE'] = get_results(model_info)
    results_dict['classification'] = classify(outcomes)

    arima_results.append(results_dict)
    

2000-10-11 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>


KeyError: 'gold_euro'

In [None]:
# Check no errors
def check_no_errors(results_list):
    errors = 0
    for result in arima_results:
        res = result['classification']
        if res == 'ERROR':
            errors+=1
    
    if errors == 0:
        print("All patterns recorded correctly")
    elif errors > 0:
        print(f"Warning: there were {errors} errors recorded")

In [None]:
check_no_errors(arima_results)

---

# 4.0 Results

In [None]:
def create_cm(arima_results):
    
    res_cm = [[0,0],
              [0,0]]
    
    for result in arima_results:
        res = result['classification']
        
        if res == 'tp':
            res_cm[0][0] += 1
        elif res == 'fp':
            res_cm[0][1] += 1
        elif res == 'fn':
            res_cm[1][0] += 1
        elif res == 'tn':
            res_cm[1][1] += 1
    
    return res_cm

In [None]:
cm = create_cm(arima_results)

In [None]:
cm_df = pd.DataFrame(cm, index=['pred_success', 'pred_non_success'], columns=['actual success', 'actual non_success'])
cm_df

In [None]:
def print_metrics(cm):
    # Accuracy - how many did the model get right
    # Total number of correct predictions / total number of predictions
    acc= (cm[0][0]+cm[1][1])/(np.sum(cm))
    
    # Precision proportion of positive identifications that were actually correct
    # True positives/ true positives + false positives)
    prec = cm[0][0]/(cm[0][0]+cm[0][1])
    
    # Recall - proportion of actual positives that were correctly defined
    # True positives/ true positives + false negatives
    rec = cm[0][0]/(cm[0][0]+cm[1][0])

    print(f"Accuracy:\t{round(acc,2)}\nPrecision:\t{round(prec,2)}\nRecall:\t\t{round(rec,2)}")


In [None]:
# Display the results
print_metrics(cm)

---

# 5.0 Observations