# **Machine Learning Forex VWAP(Volume Weighted Average Price) Solution using scikit-learn**

This notebook demonstrates Machine Learning solution to predict VWAP direction for given currency pair based on historical prices

## Problem Formulation

In this example, we will use Historical Currency price dataset provided from FOREX Tester APP, available here: https:https://forextester.com/data/datasources.

The dataset contains Hourly pricing data on EURUSD Currency Pair daily for 201804. Initial development and proof of concept is done using EURUSD curreny data but plan to extend across all major G10 Currenvy pairs.

In [10]:
#tables and visualizations
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn import config_context
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

In [8]:
%ls /eurusd.csv


/eurusd.csv


## Load Data

Here we first load the data into python using pandas and read it in as a pandas dataframe which is the format which we will use throughout the example.
fx_volume data consist of Pricing data for the hourly interval
*   CurrencyPair Ticker
*   BUSINESSDATE AS DTYYYYMMDD
*   HOUR as TIME
*   OPEN PRICE
*   HIGH PRICE
*   LOW PRICE
*   CLOSING PRICE
*   VOLUME


In [30]:
fx_volume = pd.read_csv("/eurusd.csv")
display(fx_volume.head())
fx_volume.info()

Unnamed: 0,<TICKER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>
0,EURUSD,20180401,21:00,1.23199,1.2323,1.23165,1.23204,469600000
1,EURUSD,20180401,22:00,1.23204,1.23217,1.23132,1.23172,4095050000
2,EURUSD,20180401,23:00,1.23172,1.23206,1.23124,1.23125,3291760001
3,EURUSD,20180402,00:00,1.23127,1.23217,1.23107,1.23217,5418240002
4,EURUSD,20180402,01:00,1.23219,1.23282,1.23214,1.23263,4164520003


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 507 entries, 0 to 506
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   <TICKER>      507 non-null    object 
 1   <DTYYYYMMDD>  507 non-null    int64  
 2   <TIME>        507 non-null    object 
 3   <OPEN>        507 non-null    float64
 4   <HIGH>        507 non-null    float64
 5   <LOW>         507 non-null    float64
 6   <CLOSE>       507 non-null    float64
 7   <VOL>         507 non-null    int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 31.8+ KB


VWAP represents Volume Weighted Average Price
Computation entails:
 
1.   avg_price as average of High,Low,Open and Close by currencypair
2.   PV as avg_price*volume




In [52]:
fx_volume['<AVG_PRICE>'] = (fx_volume['<HIGH>'] + fx_volume['<LOW>'] + fx_volume['<CLOSE>'] + fx_volume['<OPEN>'])/4 
fx_volume['<PV>'] = fx_volume['<VOL>'] * fx_volume['<AVG_PRICE>'] 
display(fx_volume.head())
fx_volume.info()

Unnamed: 0,<TICKER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,AVG_PRICE,<AVG_PRICE>,<PV>,<KEY>
0,EURUSD,20180401,21:00,1.23199,1.2323,1.23165,1.23204,469600000,1.231995,1.231995,578544900.0,EURUSD 0 20180401\n1 20180401\n2 ...
1,EURUSD,20180401,22:00,1.23204,1.23217,1.23132,1.23172,4095050000,1.231813,1.231813,5044334000.0,EURUSD 0 20180401\n1 20180401\n2 ...
2,EURUSD,20180401,23:00,1.23172,1.23206,1.23124,1.23125,3291760001,1.231567,1.231567,4054025000.0,EURUSD 0 20180401\n1 20180401\n2 ...
3,EURUSD,20180402,00:00,1.23127,1.23217,1.23107,1.23217,5418240002,1.23167,1.23167,6673484000.0,EURUSD 0 20180401\n1 20180401\n2 ...
4,EURUSD,20180402,01:00,1.23219,1.23282,1.23214,1.23263,4164520003,1.232445,1.232445,5132542000.0,EURUSD 0 20180401\n1 20180401\n2 ...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 507 entries, 0 to 506
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   <TICKER>      507 non-null    object 
 1   <DTYYYYMMDD>  507 non-null    int64  
 2   <TIME>        507 non-null    object 
 3   <OPEN>        507 non-null    float64
 4   <HIGH>        507 non-null    float64
 5   <LOW>         507 non-null    float64
 6   <CLOSE>       507 non-null    float64
 7   <VOL>         507 non-null    int64  
 8   AVG_PRICE     507 non-null    float64
 9   <AVG_PRICE>   507 non-null    float64
 10  <PV>          507 non-null    float64
 11  <KEY>         507 non-null    object 
dtypes: float64(7), int64(2), object(3)
memory usage: 47.7+ KB


Model computation entails 
1. cumulative PV as sum of PV across currency pair
2. Count of number of records
VWAP price can be computed as cumulative_PV/count
VWAP direction can be computed by comparing current price against VWAP 


In [62]:
grouped = fx_volume['<PV>'].groupby(fx_volume['<TICKER>']).agg(['sum','count'])
display(grouped.head())

Unnamed: 0_level_0,sum,count
<TICKER>,Unnamed: 1_level_1,Unnamed: 2_level_1
EURUSD,5967096000000.0,507


In [14]:
fx_volume.isna().sum()

<TICKER>        0
<DTYYYYMMDD>    0
<TIME>          0
<OPEN>          0
<HIGH>          0
<LOW>           0
<CLOSE>         0
<VOL>           0
dtype: int64

## Data cleaning and EDA

We can now explore our data. We leave this exercise to the reader. For now, we can observe that there are a few NA values which will likely need imputation. We'll wait for this step so that we can put it within our training loop. For now, we'll just drop all of the sex NAs out of the dataframe.

In [72]:
fx_volume = fx_volume.dropna(subset=['<VOL>'])
fx_volume.shape

(507, 12)

In [73]:
class_column = '<DTYYYYMMDD>'
random_seed = 325

X_train, X_test, y_train, y_test = train_test_split(fx_volume.drop(columns=class_column), fx_volume[class_column],
                                                   test_size=0.25, random_state=random_seed, stratify=fx_volume[class_column])

Quick sanity check to make sure that everything seems correct:

In [74]:
# X Train
print('On X train: ')
print('X train dimensions: ', X_train.shape)
display(X_train.head())

# X test
print('\nOn X test: ')
print('X test dimensions: ', X_test.shape)
display(X_test.head())

On X train: 
X train dimensions:  (380, 11)


Unnamed: 0,<TICKER>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,AVG_PRICE,<AVG_PRICE>,<PV>,<KEY>
81,EURUSD,06:00,1.22751,1.22755,1.22586,1.2263,11351649998,1.226805,1.226805,13926260000.0,EURUSD 0 20180401\n1 20180401\n2 ...
495,EURUSD,12:00,1.2077,1.20896,1.20722,1.20872,14196780001,1.20815,1.20815,17151840000.0,EURUSD 0 20180401\n1 20180401\n2 ...
207,EURUSD,12:00,1.23209,1.23275,1.2312,1.23155,17497599999,1.231898,1.231898,21555250000.0,EURUSD 0 20180401\n1 20180401\n2 ...
181,EURUSD,10:00,1.23781,1.23845,1.2368,1.2375,11576440002,1.23764,1.23764,14327470000.0,EURUSD 0 20180401\n1 20180401\n2 ...
269,EURUSD,02:00,1.23807,1.23812,1.23757,1.23811,5157830001,1.237967,1.237967,6385226000.0,EURUSD 0 20180401\n1 20180401\n2 ...



On X test: 
X test dimensions:  (127, 11)


Unnamed: 0,<TICKER>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,AVG_PRICE,<AVG_PRICE>,<PV>,<KEY>
439,EURUSD,04:00,1.21775,1.21804,1.21743,1.21755,4521290000,1.217693,1.217693,5505541000.0,EURUSD 0 20180401\n1 20180401\n2 ...
31,EURUSD,04:00,1.23055,1.23099,1.23045,1.23095,3073420000,1.230735,1.230735,3782566000.0,EURUSD 0 20180401\n1 20180401\n2 ...
354,EURUSD,15:00,1.22789,1.22884,1.22706,1.22816,13534470000,1.227987,1.227987,16620160000.0,EURUSD 0 20180401\n1 20180401\n2 ...
94,EURUSD,19:00,1.22367,1.22382,1.22333,1.22349,5892560002,1.223577,1.223577,7210004000.0,EURUSD 0 20180401\n1 20180401\n2 ...
282,EURUSD,15:00,1.23462,1.23492,1.23364,1.23447,14305280001,1.234412,1.234412,17658620000.0,EURUSD 0 20180401\n1 20180401\n2 ...


In [75]:
# X Train
print('On y train: ')
print('y train dimensions: ', y_train.shape)
display(y_train.head())

# X test
print('\nOn y test: ')
print('y test dimensions: ', y_test.shape)
display(y_test.head())

On y train: 
y train dimensions:  (380,)


81     20180405
495    20180430
207    20180412
181    20180411
269    20180417
Name: <DTYYYYMMDD>, dtype: int64


On y test: 
y test dimensions:  (127,)


439    20180426
31     20180403
354    20180420
94     20180405
282    20180417
Name: <DTYYYYMMDD>, dtype: int64

## Establish the training pipeline

We can now establish the training pipeline for our models. Since this is a process we would need to repeat several times, it's good to essentially functionalize the process so we do not need to re-write redundant code. Here, we can impute some values that were missing, and encode any categorical values. Note that these pipelines will change according to the model and methodology you choose - additionally, the pipelines will also change depending on the data types of the columns in your dataset. 

In [76]:
#individual pipelines for differing datatypes
cat_pipeline = Pipeline(steps=[('cat_impute', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                               ('onehot_cat', OneHotEncoder(drop='if_binary'))])
num_pipeline = Pipeline(steps=[('impute_num', SimpleImputer(missing_values=np.nan, strategy='mean')),
                               ('scale_num', StandardScaler())])

In [77]:
#establish preprocessing pipeline by columns
preproc = ColumnTransformer([('cat_pipe', cat_pipeline, make_column_selector(dtype_include=object)),
                             ('num_pipe', num_pipeline, make_column_selector(dtype_include=np.number))],
                             remainder='passthrough')

In [78]:
#generate the whole modeling pipeline with preprocessing
pipe = Pipeline(steps=[('preproc', preproc),
                       ('mdl', LogisticRegression(penalty='elasticnet', solver='saga', tol=0.01))])

#visualization for steps
with config_context(display='diagram'):
    display(pipe)

## Cross-validation with hyperparameter tuning

Now that we have our pipelines, we can now use this as part of cross validation and hyperparameter tuning.

In [79]:
tuning_grid = {'mdl__l1_ratio' : np.linspace(0,1,5),
               'mdl__C': np.logspace(-1, 6, 3) }
grid_search = GridSearchCV(pipe, param_grid = tuning_grid, cv = 5, return_train_score=True)

In [80]:
tuning_grid

{'mdl__l1_ratio': array([0.  , 0.25, 0.5 , 0.75, 1.  ]),
 'mdl__C': array([1.00000000e-01, 3.16227766e+02, 1.00000000e+06])}

In [81]:
grid_search.fit(X_train, y_train)

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 418, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/usr/local/lib/python3.8/dist-packages/sklearn/pipeline.py", line 707, in score
    Xt = transform.transform(Xt)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/compose/_column_transformer.py", line 748, in transform
    Xs = self._fit_transform(
  File "/usr/local/lib/python3.8/dist-packages/sklearn/compose/_column_transformer.py", line 606, in _fit_transform
    return Parallel(n_jobs=self.n_jobs)(
  File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preproc',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('cat_pipe',
                                                                         Pipeline(steps=[('cat_impute',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot_cat',
                                                                                          OneHotEncoder(drop='if_binary'))]),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7fb8a2241580>),
                                                                        ('num_pipe',
                                                    

In [None]:
print(grid_search.best_score_)
grid_search.best_params_

In [None]:
pd.DataFrame(grid_search.cv_results_)

## Final fit

The final fit here is already present in the generated model due to the way we set our parameters in the grid search. If we want to look at the performance, we can do so. Here is a non-helpful description of the best model:

In [None]:
grid_search.best_estimator_

## Variable importance

Now we assess the importance in the selected model to reveal any potential insights.

In [None]:
grid_search.classes_

In [None]:
vip = grid_search.best_estimator_['mdl'].coef_[0]
vip

In [None]:
#get names in correct preproc order
cat_names = grid_search.best_estimator_.named_steps['preproc'].transformers_[0][1].named_steps['onehot_cat'].get_feature_names()
num_names = grid_search.best_estimator_.named_steps['preproc'].transformers_[1][2]

#create df with vip info
coef_info = pd.DataFrame({'feat_names':np.hstack([cat_names, num_names]), 'vip': vip})

#get sign and magnitude information
coef_info = coef_info.assign(coef_mag = abs(coef_info['vip']),
                             coef_sign = np.sign(coef_info['vip']))

#sort and plot
coef_info = coef_info.set_index('feat_names').sort_values(by='coef_mag', ascending=False)
sns.barplot(y=coef_info.index, x='coef_mag', hue='coef_sign', data=coef_info, orient='h', dodge=False);

## Performance metrics on test data


Here, we can see the performance of the model, which is pretty nice! We can also look into different scores specifically for more insight into the performance.

In [None]:
print(classification_report(y_test, grid_search.best_estimator_.predict(X_test)))

In [None]:
cm = confusion_matrix(y_test, grid_search.best_estimator_.predict(X_test))
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=grid_search.classes_)
disp.plot()

plt.show()

## Try it yourself!

Now that we've seen the power of pipelines in sklearn, let's now try implementing our own pipelines.

In [None]:
# Try implementing a pipeline where we use median imputation for numeric columns instead of mean imputation.

#individual pipelines for differing datatypes
cat_pipeline = Pipeline(steps=[('cat_impute', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                               ('onehot_cat', OneHotEncoder(drop='if_binary'))])
num_pipeline = Pipeline(steps=[('impute_num', SimpleImputer(missing_values=np.nan, strategy='median')),
                               ('scale_num', StandardScaler())])

#establish preprocessing pipeline by columns
preproc = ColumnTransformer([('cat_pipe', cat_pipeline, make_column_selector(dtype_include=object)),
                             ('num_pipe', num_pipeline, make_column_selector(dtype_include=np.number))],
                             remainder='passthrough')

With this new pipeline, now train a Random Forest model. Refer to the documentation for the parameters for the random forest classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Is the performance better? 

In [None]:
#generate the whole modeling pipeline with preprocessing
pipe = Pipeline(steps=[('preproc', preproc),
                       ('mdl', RandomForestClassifier())])

#visualization for steps
with config_context(display='diagram'):
    display(pipe)

Now perform cross validation and modify the n_estimators parameter to values of [100, 200,500] and max_depth parameter to values of [10,15,50] for the random forest classifier for hyperparameter tuning.

In [None]:
tuning_grid = {'mdl__n_estimators' : [100, 200 ,500],
               'mdl__max_depth': [10, 15, 20] }
grid_search = GridSearchCV(pipe, param_grid = tuning_grid, cv = 5, return_train_score=True)

In [None]:
tuning_grid

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
print(grid_search.best_score_)
grid_search.best_params_

In [None]:
pd.DataFrame(grid_search.cv_results_)

In [None]:
grid_search.best_estimator_

In [None]:
grid_search.classes_

In [None]:
vip = grid_search.best_estimator_['mdl'].feature_importances_
vip

In [None]:
#get names in correct preproc order
cat_names = grid_search.best_estimator_.named_steps['preproc'].transformers_[0][1].named_steps['onehot_cat'].get_feature_names()
num_names = grid_search.best_estimator_.named_steps['preproc'].transformers_[1][2]

#create df with vip info
coef_info = pd.DataFrame({'feat_names':np.hstack([cat_names, num_names]), 'vip': vip})

#get sign and magnitude information
coef_info = coef_info.assign(coef_mag = abs(coef_info['vip']),
                             coef_sign = np.sign(coef_info['vip']))

#sort and plot
coef_info = coef_info.set_index('feat_names').sort_values(by='coef_mag', ascending=False)
sns.barplot(y=coef_info.index, x='coef_mag', hue='coef_sign', data=coef_info, orient='h', dodge=False);

In [None]:
print(classification_report(y_test, grid_search.best_estimator_.predict(X_test)))

In [None]:
cm = confusion_matrix(y_test, grid_search.best_estimator_.predict(X_test))
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=grid_search.classes_)
disp.plot()

plt.show()