A kettle plugin to compare different FOREX markets, regarding a specific forecast window selected by the user. A Spark ML lib Random Forest model is used to predict each market value. Technical Analysis is used as feature generator tool, using TA-Lib
Table of content
- Maven, version 3+
- Java JDK 1.8
- This settings.xml in your /.m2 directory
This part refers to the installation of the forexprediction plugin in the PDI platform.
- Once the repository is cloned/downloaded, go to the extracted main folder to build the plugin
cd pdi-plugin-forexpredicton-master mvn package
- After building the plugin, a zip file will be generated in the target folder. Go to that folder and unzip the created file.
cd assemblies/plugin/target/ unzip kettle-forexpredicton-plugin-8.0-SNAPSHOT.zip
After unzipping the generated plugin, copy the generated file into the plugins folder, available in the Pentaho-kettle PDI project.
The Forex Prediction plugin should now be available in the Spoon application as a Job Step under the Big Data plugin section.
As stated before, the pdi-plugin-forexprediction is a forecasting plugin that as the ability to compare different types of FOREX markets. Therefore, there are some specific guidelines that one should follow in order to succesfully run the plugin. Each field of the plugin dialog is explained below:
Input folder path: The plugin receives N CSV's, which correspond to N different markets that user wants to evaluate. To provide this data as input, the user needs to have all the CSV's stored in the same folder, and specify this folder path in the plugin dialog. The specified folder should only contain CSV's that are going to be used by the plugin. The inputed CSV's should be properly formatted as the ones provided in the sample folder. The system does not accept non-numerical features, so the user should be careful with the values displayed by each data row.
Output folder path: An output folder should also be specified by the user. This folder is going to have a N number of CSV's that correspond to individual forecasts of each market. Additionally, an extra CSV will also be provided where each market is compared, along with a market evaluation .txt file.
Prediction Steps: This parameter specifies how many days are we going to predict.
Save created models: If save created model is checked, each trained model will be saved in the selected output folder. The models are saved as a folder with the name of the CSV that was used to train that model.
Models file path: If the user already has some pre-trained models corresponding to each inputed CSV, it is also possible to use them by a similar mechanism as the one used to specify the output folder path, avoiding the training phase. The user must provide a path to a folder where each pre-trained model is stored. Each model should have the same name as the correponding CSV, otherwise the system would not be able to associate them, and will perform unnecessary training.
Feature config file: It's possible to specify the model hyperparameters by providing the path to a simple Java Properties file. A sample file called config.properties, is available at pdi-plugin-forexpredicton-master\assemblies\plugin\src\main\resources. It's also possible to tweek the number of days that each Technical Indicator considers for it's calculation. If a .properties file is not specified by the user, the system will use pre-defined default values.
The system produces a different variety of files as outputs:
Output file: The output file is the file that makes a comparision between different markets. Each market is ranked according to volatility regarding the number of forecasted days. The file displays as metrics the market trend direction, the aprox price change, the volatility and of course the rank of each market. Each given parameter is calculated regarding the results displayed by each individual market ouput file explained below.
Market output file: As stated previously, when the plugins stops, an individual file is created with metrics regarding each market. This file is created under the name ####OutputFile.csv, where #### is the name of the market, extracted from the original CSV name. This file should be used as a complement to the information displayed in the general OutputFile.csv where each market is ranked according volatility. Its output has daily information regarding the size of the prediction selected by the user. The available metrics are the daily prediction, the market trend direction, the percentage of price change, and again the volatility in percentage but this time with respect to the last 10 days. The size of the file in terms of rows is always the size of the prediction selected by the user x2. The extra rows will be used to evaluate the market.
Evaluation file: To give to the user some information of how good were the model predictions, an evaluation .txt file was created. This file displays classical regression evaluation metrics, such as the RMSE, MAE and MSE. We also provide the highest and lowest value of the market.
There is also a python script called plot.py that is available at pdi-plugin-forexpredicton-master\assemblies\plugin\src\main\resources. This script could be used to get a visual inspection of how well the model is performing in each market. The script will ask for two CSV's: one market output file, and the original market file (the one without predictions). The plot should be similar to the one presented below. It is perfectly possible to use another plotting tool.
The orange and the green line represent predictions made by the model, the actual forecast and evalaution respectively. The blue line represents the market values from the original CSV.
To be implemented
- Create a method to evaluate how good are the markets predictions
- Create a cross validation mechanism to improve results
- Implement an hyperparameter search mechanism
- Provide automatic information extraction through an API instead of manually provide each CSV
- Make the system able to work with information in different time frames: second, weekly
- Add support running in distributed mode over Spark cluster
- Capability of having different settings for each market