# 5.0 Modelling

---

# Contents
    
- [1.0 About ARIMAX Model](#1.0-About-ARIMAX-Model)
- [2.0 Load Data](#2.0-Load-Data)
    - [2.1 EUR/USD Data](#2.1-EUR/USD-Data)
    - [2.2 Pattern Data](#2.2-Pattern-Data)
- [3.0 Model](#3.0-Model)
    - [3.1 Train/Test Split](#3.1-Train/Test-Split)
    - [3.2 Fit Model](#3.2-Fit-Model)
    - [3.3 Calculate Results](#3.3-Calculate-Results)
    - [3.4 Run Model](#3.4-Run-Model)
- [4.0 Results](#4.0-Results)
- [5.0 Observations](#5.0-Observations)



In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import datetime
import calendar

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA, ARMA, ARMAResults, ARIMAResults
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import r2_score, mean_squared_error
from pmdarima import auto_arima
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")


from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse


In [3]:
pd.set_option('display.max_columns', None)

---

# 1.0 About ARIMAX Model

The Arima model has 3 components:

+ Differencing Step - I - Integrated - Check for stationarity
+ Autoregressive Piece - AR - long term trends
+ Moving Average Piece - MA - Modelling sudden fluctuations

Each part has input into the model P, D, Q.

+ D is the order of differencing we found using the Augmented Dickey-Fuller test.
+ P is the number of autoregressive terms in our model. PACF is used to estimate this.
+ Q is to do with looking at the moving average.
    + If PACF has a sharp cut off and lag-1 for the ACF is negative choose q to be the lag in the ACF before cut off.
    + If PACF does not have a sharp cut off or lag -1 ACF is not negative choose q = 0

Therefore based on the charts before I will use:
    
    + p = 1
    + d = 1
    + q = 0
    
However I will use auto_arima to help decide.

The X in the model stands for Exogenous features which are external features away from the time frame. The time series models look at time as a key factor. This is accurate when time is a key feature in determining price. For example stock control will be lnked to time as the more time the less stock is liekly.

The FOREX market has a lot of external features which are important to drive the prices. Therefore I added gold price, volatility and 2 Moving Averages. The idea is the model can use the extra information to help predict the price.

---

In [4]:
results = {'algo':'','name':'','date':'', 'time_frame':'','success':0,'RMSE':0, 'MSE':0, 'classification':'' }

# 2.0 Load Data

## 2.1 EUR/USD Data

In [5]:
daily = pd.read_csv('/Users/stuartdaw/Documents/Capstone_data/data/resampled/daily.csv', 
                    index_col='date', parse_dates=True)

In [13]:
### Get correct hyper parameters

In [14]:
## Arima
auto_arima(daily['close'].dropna(), seasonal=False).summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,4921.0
Model:,"SARIMAX(0, 1, 0)",Log Likelihood,17044.75
Date:,"Thu, 30 Jul 2020",AIC,-34087.501
Time:,18:50:49,BIC,-34080.999
Sample:,0,HQIC,-34085.22
,- 4921,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
sigma2,5.732e-05,7.53e-07,76.145,0.000,5.58e-05,5.88e-05

0,1,2,3
Ljung-Box (Q):,30.36,Jarque-Bera (JB):,1507.98
Prob(Q):,0.86,Prob(JB):,0.0
Heteroskedasticity (H):,0.69,Skew:,0.0
Prob(H) (two-sided):,0.0,Kurtosis:,5.71


---

## 2.2 Pattern Data

In [15]:
daily_pattern = pd.read_csv('/Users/stuartdaw/Documents/Capstone_data/data/targets/daily_pattern.csv', 
                           parse_dates=True)

In [16]:
daily_pattern['pattern_end'] = pd.to_datetime(daily_pattern['pattern_end'])

In [17]:
daily_pattern.loc[1]

pattern_end   2001-04-04
Name: 1, dtype: datetime64[ns]

In [18]:
len(daily_pattern)

62

---

# 3.0 Model

In [107]:
# def create_train_test_split(date, time_series, model_info):
#     test_end_date = time_series.loc[date,'date+5']
    
#     train_test = time_series.loc[time_series.index <= test_end_date]
  
#     target_value = time_series.loc[time_series.index == date,'double_height'].item()
    
#     train_test.insert(0, 'target_price', target_value)
    
#     model_info['signal'] = time_series.loc[date,'marubozu']
    
#     train_test.insert(0, 'signal', model_info['signal'])
    
#     model_info['start'] = len(train_test)-5
#     model_info['end'] = len(train_test)-1
    
#     model_info['train'] = train_test.iloc[:model_info['start']]
#     model_info['test'] = train_test.iloc[model_info['start']:]

#     return model_info

## 3.1 Train/Test Split

In [110]:
def create_train_test_split(date, time_series, model_info):

    # Get index of pattern and add 6 (so 5) extra rows for Test/train set
    test_end_loc = time_series.index.get_loc(date) + 6
    
    # Create train/test set using index loc of pattern 
    train_test = time_series.iloc[:test_end_loc]
    
    # Set target values
    target_value = time_series.loc[time_series.index == date,'double_height'].item()
    
    # add target price to dataset
    train_test.insert(0, 'target_price', target_value)
    
    # Add Signal so it can be determined whether we expect the price to go up or down.
    model_info['signal'] = time_series.loc[date,'marubozu']
    
    # insert the signal to dataset
    train_test.insert(0, 'signal', model_info['signal'])
    
    # create start and end points for the test/train splits
    model_info['start'] = len(train_test)-5
    model_info['end'] = len(train_test)-1
    
    # create the train and data sets
    model_info['train'] = train_test.iloc[:model_info['start']]
    model_info['test'] = train_test.iloc[model_info['start']:]

    return model_info

## 3.2 Fit Model

In [111]:
def train_arima(model_info, p=0, d=1, q=0):
#     exog=model_info['train']['wk_mv_av']
    
    exog = np.column_stack([model_info['train']['mnth_mv_av'], 
                            model_info['train']['wk_mv_av'],
                            model_info['train']['vol'],
                            model_info['train']['gold_euro'],
                            model_info['train']['gold_usd']])
    
    if model_info['signal'] == -1:
        model = ARIMA(model_info['train']['low'], exog=exog, order=(p,d,q))
    else:
        model = ARIMA(model_info['train']['high'], exog=exog, order=(p,d,q))

    results = model.fit()
    predictions = results.predict(start=model_info['start'], 
                                  end=model_info['end'], exog=exog,
                                  dynamic=True, 
                                  typ='levels').rename('ARIMA-0-1-0 Predictions')
    
    return results, predictions

## 3.3 Calculate Results

In [112]:
def meet_threshold(row):
    if row['signal'] == -1 and row['low'] <= row['target_price']:
        return -1
    elif row['signal'] == 1 and row['high'] >= row['target_price']:
        return 1    
    else:
        return 0

In [113]:
def ml_decision(row):
    if row['direction'] == -1 and row['preds'] <= row['target_price']:
        return -1
    elif row['direction'] == 1 and row['preds'] >= row['target_price']:
        return 1    
    else:
        return 0

In [121]:
def create_results_outcomes_dataframe(test, predictions):    
    outcomes = pd.DataFrame()
    outcomes['low'] = test['low']
    outcomes['high'] = test['high']
    outcomes['preds'] = predictions.values
    outcomes['target_price'] = test['target_price']
    outcomes['direction'] = test['signal']
    outcomes['correct_call'] = test.apply(meet_threshold, axis=1)
    return outcomes

In [122]:
def print_chart(outcomes):
    if model_info['signal'] == -1:
        outcomes['low'].plot(legend=False, figsize=(12,8))
    else:
        outcomes['high'].plot(legend=False, figsize=(12,8))

    outcomes['preds'].plot(legend=False);
    outcomes['target_price'].plot(legend=False);

In [123]:
def get_results(model_info):
        
    if model_info['signal'] == -1:
        mse = mean_squared_error(model_info['test']['low'], predictions)
        rmse_res = rmse(model_info['test']['low'], predictions)
    else:
        mse = mean_squared_error(model_info['test']['high'], predictions)
        rmse_res = rmse(model_info['test']['high'], predictions)       
    
    return rmse_res, mse

In [124]:
def classify(outcomes):
    
    if max(outcomes['direction']) == 1:
        
        if max(outcomes['correct_call']) == 0 and max(outcomes['ml_correct_call']) == 0:
            return 'tn'
        elif max(outcomes['correct_call']) == 1 and max(outcomes['ml_correct_call']) == 1:
            return 'tp'
        elif max(outcomes['correct_call']) == 0 and max(outcomes['ml_correct_call']) == 1:
            return 'fp'
        elif max(outcomes['correct_call']) == 1 and max(outcomes['ml_correct_call']) == 0:
            return 'fn'
        
    elif max(outcomes['direction']) == -1:
        
        if min(outcomes['correct_call']) == 0 and min(outcomes['ml_correct_call']) == 0:
            return 'tn'
        elif min(outcomes['correct_call']) == -1 and min(outcomes['ml_correct_call']) == -1:
            return 'tp'
        elif min(outcomes['correct_call']) == 0 and min(outcomes['ml_correct_call']) == -1:
            return 'fp'
        elif min(outcomes['correct_call']) == -1 and min(outcomes['ml_correct_call']) == 0:
            return 'fn'
        
    else:
        return 'ERROR'
    

## 3.4 Run Model

In [131]:

arima_results = []

for match in daily_pattern['pattern_end']:
    
    model_info = {"train":None,"test":None,"start":None,"end":None,"signal":None}

    
    results_dict = {'name':None,'pattern':None,'date':None,
                   'time_frame':None,'RMSE':None,
                   'MSE':None, 'classification':None}
    
    results_dict['name'] = 'arima-0-1-0' + str(match)
    results_dict['strategy'] = 'Maribozu'
    results_dict['time_frame'] = 'daily'
    

    model_info = create_train_test_split(match, daily, model_info)

    if len(model_info['train']) < 10:
        continue

    results, predictions = train_arima(model_info)
    

    outcomes = create_results_outcomes_dataframe(model_info['test'], predictions)
    outcomes['ml_correct_call'] = outcomes.apply(ml_decision, axis=1)

    results_dict['RMSE'], results_dict['MSE'] = get_results(model_info)
    results_dict['classification'] = classify(outcomes)

    arima_results.append(results_dict)
    

<bound method IndexOpsMixin.value_counts of date
2001-02-09    0
2001-02-12    0
2001-02-13    0
2001-02-14    0
2001-02-15   -1
Name: correct_call, dtype: int64>
              low  high  preds  target_price  direction  ml_correct_call
correct_call                                                            
-1              1     1      1             1          1                1
 0              4     4      4             4          4                4
<bound method IndexOpsMixin.value_counts of date
2001-04-05    0
2001-04-06    0
2001-04-09    0
2001-04-10    0
2001-04-11    0
Name: correct_call, dtype: int64>
              low  high  preds  target_price  direction  ml_correct_call
correct_call                                                            
0               5     5      5             5          5                5
<bound method IndexOpsMixin.value_counts of date
2001-04-10   -1
2001-04-11   -1
2001-04-12   -1
2001-04-13   -1
2001-04-16   -1
Name: correct_call, dtype: int64>


<bound method IndexOpsMixin.value_counts of date
2009-07-01    0
2009-07-02   -1
2009-07-03   -1
2009-07-06   -1
2009-07-07   -1
Name: correct_call, dtype: int64>
              low  high  preds  target_price  direction  ml_correct_call
correct_call                                                            
-1              4     4      4             4          4                4
 0              1     1      1             1          1                1
<bound method IndexOpsMixin.value_counts of date
2009-07-27    0
2009-07-28    1
2009-07-29    0
2009-07-30    0
2009-07-31    1
Name: correct_call, dtype: int64>
              low  high  preds  target_price  direction  ml_correct_call
correct_call                                                            
0               3     3      3             3          3                3
1               2     2      2             2          2                2
<bound method IndexOpsMixin.value_counts of date
2009-09-01    0
2009-09-02    0
2009-09-0

<bound method IndexOpsMixin.value_counts of date
2014-09-22    0
2014-09-23    0
2014-09-24   -1
2014-09-25   -1
2014-09-26   -1
Name: correct_call, dtype: int64>
              low  high  preds  target_price  direction  ml_correct_call
correct_call                                                            
-1              3     3      3             3          3                3
 0              2     2      2             2          2                2
<bound method IndexOpsMixin.value_counts of date
2014-12-12    0
2014-12-15    0
2014-12-16    0
2014-12-17   -1
2014-12-18   -1
Name: correct_call, dtype: int64>
              low  high  preds  target_price  direction  ml_correct_call
correct_call                                                            
-1              2     2      2             2          2                2
 0              3     3      3             3          3                3
<bound method IndexOpsMixin.value_counts of date
2015-02-06    0
2015-02-09    0
2015-02-1

In [128]:
# Check no errors
def check_no_errors(results_list):
    errors = 0
    for result in arima_results:
        res = result['classification']
        if res == 'ERROR':
            errors+=1
    
    if errors == 0:
        print("All patterns recorded correctly")
    elif errors > 0:
        print(f"Warning: there were {errors} errors recorded")

In [129]:
check_no_errors(arima_results)

All patterns recorded correctly


---

# 4.0 Results

In [102]:
def create_cm(arima_results):
    
    res_cm = [[0,0],
              [0,0]]
    
    for result in arima_results:
        res = result['classification']
        
        if res == 'tp':
            res_cm[0][0] += 1
        elif res == 'fp':
            res_cm[0][1] += 1
        elif res == 'fn':
            res_cm[1][0] += 1
        elif res == 'tn':
            res_cm[1][1] += 1
    
    return res_cm

In [103]:
cm = create_cm(arima_results)

In [104]:
cm_df = pd.DataFrame(cm, index=['pred_success', 'pred_non_success'], columns=['actual success', 'actual non_success'])
cm_df

Unnamed: 0,actual success,actual non_success
pred_success,14,6
pred_non_success,24,18


In [105]:
def print_metrics(cm):
    # Accuracy - how many did the model get right
    # Total number of correct predictions / total number of predictions
    acc= (cm[0][0]+cm[1][1])/(np.sum(cm))
    
    # Precision proportion of positive identifications that were actually correct
    # True positives/ true positives + false positives)
    prec = cm[0][0]/(cm[0][0]+cm[0][1])
    
    # Recall - proportion of actual positives that were correctly defined
    # True positives/ true positives + false negatives
    rec = cm[0][0]/(cm[0][0]+cm[1][0])

    print(f"Accuracy:\t{round(acc,2)}\nPrecision:\t{round(prec,2)}\nRecall:\t\t{round(rec,2)}")


In [106]:
# Display the results
print_metrics(cm)

Accuracy:	0.52
Precision:	0.7
Recall:		0.37


---

# 5.0 Observations