# Machine Learning Engineer Nanodegree
## Capstone Project
Riccardo Rizzari  
May 2017

## I. Definition
_(approx. 1-2 pages)_

### Project Overview
As stated in the capstone proposal, Quantitative investing and the quest for an algorithm or a mathematical formula to beat the market have a long history. One of the earliest documented attempt is the 1967 famous book [Beat the market: a Scientific Stock
Market System by E. O. Thorp and S. T. Kassouf.
In recent years, the number of firms that use algorithms to trade financial markets has grown steadily.

On the other side though, there is strong scepticism about the possibility to beat or predict the market consistently.
There are also very famous theoretical results in Financial Economics on this topic, like the **Efficient Market Hypothesis** (https://en.wikipedia.org/wiki/Efficient-market_hypothesis).
According to the Efficient Market Hypothesis (EHH), current asset prices in financial markets already reflect all available information. Therefore it should not be possible to profit from the market in a consistent way.

Although the EHH's implication about the impossibility to beat the market has been criticized (for instance pointing out that the EHH does not explicitly say that you can't beat the market), the financial industry has widely applied the principle of the *impossibility of arbitrage* opportunities in the market. One example is the *Black-Scholes
model* (https://en.wikipedia.org/wiki/Black%E2%80%93Scholes_model) for option pricing (particularly in the assumption that stock prices can be modelled as *martingales*).

As a Commodity Trader in the Investment Banking industry, it is of paramount importance to deploy quantitative techniques to try to understand the market trend in the near future.
As a result of the specific market I trade, the analysis of the capstone project will focus on Commodity Futures markets.

### Problem Statement

#### Problem Definition
As we have seen in the previous section, the problem that is to be solved is whether it is possible or
not to consistently predict the market.

Specifically, a way of restricting the scope of the analysis is: **given the information we have today,
can I predict whether the market is going up or down tomorrow?**

As anticipated above, I will restrict the analysis to liquid Commodity Futures. I will start by considering one of the most liquid and traded market: the *West Texas Intermediate Crude Oil* futures market (https://en.wikipedia.org/wiki/West_Texas_Intermediate) (or 'WTI').

A simple way to picture the problem is the following: an investor is looking for an investment strategy in the Oil market. One of the very first steps that might be useful is to have a methodology to assess whether the market is going to trade higher or lower from the moment a position in the market is entered.

A natural question is: What information is available for this methodology? 
The answer is: all present information is available. 
<br> Information can be of technical nature (such as technical analysis indicators) or fundamental nature (such as data regarding the world Oil production). <br>

So, we can summarize the purpose of the capstone project as follows:

*Given a series of past data of a given commodity futures market, can we find a way to predict whether the next observation will be representend by a higher or a lower trend? Or equivalently: given the information we have today, can we predict whether the market is going up or down tomorrow?*

#### Datasets and Inputs

As inputs features, I will select a bunch of technical indicators together with some data regarding trading activity.
The dataset will be a time series, with every row being a collection of technical and fundamental indicators for a specific date in the past.

In particular, I will start by considering data for the Oil market. These are the columns of the dataset:

- 'Dates': given that we are working with time series, we will store the dates in the first columns
- 'y': the labels. `1` if the market on the corresponding date was up, `0` if it was down.
##### Price indicators ####
- 'CHG_PCT_1D': the percentage change of the market on that specific trading date
- 'CHG_PCT_5D': the percentage change of the market on the last five trading days
- 'PX_OPEN': the opening price
- 'PX_HIGH': the highest price during the trading day
- 'PX_LOW': the lowest price
- 'PX_VOLUME': the total volume of contracts traded
- 'PX LAST': the closing price
##### Technical indicators ####
- 'MOV_AVG_5D': the five days moving average
- 'MOV_AVG_30D': the moving average calculated over the previous 30 days
- 'RSI_9D': the 9-day Relative Strength Index, which is defined according to the following formula:
$$ RSI = 100 - [100 / (1 + Avg_{Up} / Avg_{Down})] $$
Where $ Avg_{Up} $ is the average of all day-on-day changes when the security closed up for the day during the period. 
$ Avg_{Down} $ is the average of all down changes for the period.

- 'RSI_14D': the RSI for the last 14 trading days
- 'RSI_30D': the RSI for the last 30 trading days
##### Volatility Market activity indicators
- 'VOLUME_TOTAL_CALL': the total daily volume traded in call options
- 'VOLUME_TOTAL_PUT' the total daily volume traded in put options
- 'VOLATILITY_10D': the 10-day realised volatility
- 'VOLATILITY_30D': the 30-day realised volatility
- '30DAY_IMPVOL_105.0%MNY_DF': the 105% strike implied volatility for options expiring in 30-days time
- '30DAY_IMPVOL_100.0%MNY_DF': the 100% strike implied volatility for options expiring in 30-days time
- '30DAY_IMPVOL_95.0%MNY_DF': the 95% strike implied volatility for options expiring in 30-days time
- '30DAY_IMPVOL_110.0%MNY_DF': the 110% strike implied volatility for options expiring in 30-days time
- '30DAY_IMPVOL_90.0%MNY_DF': the 90% strike implied volatility for options expiring in 30-days time

##### Dataset size
I will have a separate dataset for each commodity futures market. Each dataset will be in CSV format. Each CSV file has a size of approx 70 KB and it is formed of 499 entries. Each entry represents a single trading day.
Rows of data which display any '#N/A' or error in the CSV file will be dropped after being loaded in Pandas.

##### Data source
I download the data and generate the CSV file for each market via an Excel spreadsheet (named *Bbg_Market_Data.xlsm*). This Excel spreadsheet has a size of 923 KB. It contains Bloomberg Excel formulas to download data via the Bloomberg Excel API.
All the features are downloaded from Bloomberg. The labels instead are calculated in the spreadsheet (essentially, by looking at the column 'CHG_PCT_1D', the label will be `1` if the daily_%_change was positive, and `0` otherwise).
Once the dataset is loaded in Pandas, I will shift the labels with the Pandas Shift function, so that, at every time step, the corresponding labels will be `1` if the **next** trading day showed a positive performance, `0` if the performance was negative.

#### Intended Solution
The problem of predicting the market trend on a given day, given the information available from the previous days, can be viewed as a binary classification problem.
In fact, for each dataset `X` representing all available information at time `t0` (this dataset `X` will be a time series), there will be a label `y` which is `1` if the market on the next day goes up, `0` otherwise.
We can therefore treat the market sentiment prediction problem with the tools from *supervised learning*.
My purpose is to compare different supervised learning algorithms, and calculate the accuracy score for each of the methods deployed.
If the accuracy is greater than 50% (the coin flip), we have a good candidate for a market predictor. We will then check if this algorithm is robust enough to maintain an accuracy of 50% consistently, i.e. moving the time forward as the algorithm faces data that has never seen yet.
We will have to check also if the success of the algorithm does not depend on the particular market chosen, say, the oil market, but is able to generalize well for also the corn market or the corn one.

The plan of the capstone is to start with a simple model, such as Gaussian Naive Bayes, and progressively increasing the complexity of the model. I will explore `4` models in total. In ascending complexity order, these models are:

- Gaussian Naive Bayes
- Random Forest
- Feed-Forward Multi-Layered Perceptron (MLP) Neural Network
- LSTM (Long-Short Term Memory) Neural Network

I will scale up model complexity from Naive Bayes to LSTM to see if there is a gain in performance when the model complexity is increased.
Ideally, I would expect to see better performance when a more complex model is finely tuned. As outlined in the section below, performance will be evaluated under the accuracy score.
This means that, say, if we get a 51% accuracy with Naive Bayes, I expect to be able to increase the accuracy towards, say, 55% with Random Forest and MLP and, hopefully, to get something in the 60% ballpark with LSTM.

For the first three models, I will check the performance with two different approaches: a) the Walk-Forward validation; and b) the Static Validation.

For LSTM instead I will only check the more realistic Walk-Forward validation.

Given that the Walk-Forward validation should be more accurate in principle, as we make new information immediately available for any model to re-fit to it in order to predict the next state (in contrast with the Static validation, where we fit the model only once on the training set and use such a fit to predict all labels in the test set), we can view any over-performance of the Static validation as a failure of the model to accurately take new inforamtion into account when moving along the time series.

#### Tasks Outline
In synthesis, the outline of the tasks of the project is the following:

1) We start with the Oil Market, loading the data in Pandas;

2) We explore the data checking a few basic statistics, the composition of the dataset (number of rows and columns) and if one or more features are skewed.

3) If any features is skewed, we apply a log transformation to it (see the 'flatten_skew' function).

4) We adjust the data so that each label represents the futures step market trend; we drop the last (i.e. the more recent) data row of the dataset because we do not have a future label for it yet.

5) We define three benchmarks: a) the single trend follower; b) the coin flip; c) the Persistence Forecast model. The first two benchmarks do not actually depend on the data. In fact, the single trend follower simply assumes the market will follow only one trend in the next outcomes, e.g. always predicting the market to go higher; the coin-flip instead randomly choose one trend for each period (it literally flips a coin to predict whether the market will be higher or lower).
Finally, the Persistence Forecast model checks what was the trend in the previous period and predicts the market will have the same trend for the next single time step.

6) We define a 'train and test' function that will be called by each model except LSTM, which will have its own dedicated function for training and testing, giving the specificity and the complexity of this model.

7) We run Naive Bayes, Random Forest and MLP in both Walk-Forward and Static validation mode and we check the accuracy score. We optimize Random Forest and extract feature importances for the sets of parameters yielding the best and the worst results in terms of accuracy to gauge additional information about the featues selection.

8) We build and test the LSTM model in Walk-Forward Validation only.

9) We compare accuracies for all the models and we introduce the 'modified_accuracy' function to adjust the accuracy to the actualy performance of the classifier on a particular prediction. For example, if a classifier predicts `0`, i.e. a down-trend, and the percentage change for that sample has been `+5%`, such a prediction will be much worse than if the market predicts `0` and the percentage move is only `+0.25%`.

10) We repeat all the steps above for all other markets we have data for (i.e. Corn, Gold, Wheat, Coffee and Soybeans) and analyze performances for each market.


### Metrics
As anticipated above, the main metric of choice will be the *accuracy score*. Only towards the end of the analysis we will make use of a *modified_accuracy* to keep into account the amplitude of a prediction, but from a pure binary classification standpoint, the accuracy store is the simplest and best metric.

The reason why the accuracy score is the best metric for the binary classifcation problem of the capstone is that in such a trading environment we do not care about type1 or type2 errors. In fact, we do not care about False positives or False negatives because, ideally, there is no significant distinction between buying or selling a futures contract.
If the model predicts `1` (i.e. 'uptrend') and we invest '`X`' USD in a long position in futures contract (i.e. we buy) based on this signal, if the market performs `-1%` we lose `1%` of '`X`' (our initial investment). This is exactly the same scenario as if the model predicts `0` (lower trend), we short '`X`' USD of futures equivalent value and the market performs `+1%`, again, we lose `1%` of the invested amount. Because of this specularity, we only care about the accuracy score, and not on the 'direction' of the error.


## II. Analysis
_(approx. 2-4 pages)_

### Data Exploration
The first market we investigate is the Crudel Oil market. Daily data for trading days between 10Apr2017 and 10Apr2015 are stored in the dataCL.csv file.

I start by exploring some generic statistical properties of the dataset.
We have 499 rows of data. We will use the latest 249 data rows as a test in both the Walk-Forward validation and the Static validation. (`249` = `499` // `2` in Python).
Out of the 499 trading days, the market has trended higher on 233 days (i.e. `46.69%` of the times).
If we restrict the scope to the test set instead, the market has gone up `50.20%` of the times. This is extremely close to what we would get with a coin flip and justifies the use of the coin-flip as one of the naive predictors.

### Algorithms and Techniques
As anticipated above, I will explore `4` models in total:

- Gaussian Naive Bayes
- Random Forest
- Feed-Forward Multi-Layered Perceptron (MLP) Neural Network
- LSTM (Long-Short Term Memory) Neural Network

The main reason behind the choice of these algorithms is the following.
<br> If the Efficient market hypothesis is correct, it should be impossibile to achieve a forecast that beats the naive predictors listed above. In particular, any machine learning algorithm should perform just as poorly as a coin-flip.
Therefore there should be no difference: whatever supervised learning algorithm we use, we should never get too far from a `50%` accuracy score. 
<br> To test this fact, I start from a simple Gaussian Naive Bayes, where there is a strong assumption: all features are independent from each other. This is clearly an over-simplification, as we can see that, for example, the total volume in call options tend to be higher whenever the total volume of put options is higher as well (and viceversa obviously).
<br> Then I moved to a Random Forest classifier. The reason for this choice is the ensemble feature of this classifier. It should therefore be less prone to overfitting.
<br> Then we will move to Neural Networks, which in theory should be able to grasp structural non-linearities in the data. I will start with a simple feed-forward neural network. I will not tune the parameters in the project but I will provide an-already-fine tuned version of the MLP in both the Feed-forward and static validation mode.

Finally, I move on the model that should be the best to handle time series: the LSTM (Long Short Term Memory) neural network.

**INSERT EQUATION DESCRIBING THE MODEL AND THE PICTURE WHICH SHOWS THE BASIC PROCESS **

Again, if the EHH holds, none of the models above should significantly outperform the others.

### Benchmark
As we already indicated above, in the market sentiment prediction problem, we could consider three benchmark models:

1) the random walk/coin-flip, i.e. at every step when we try to predict the market trend, we flip a coin. At the end of the time series (*the walk*) we calculate the accuracy over the real trend we observed in the market, i.e. we calculates how many times the coin flip managed to guess the market trend correctly over the total number of steps in the time series.

2) the Naive predictor, i.e. a predictor which predicts that the market is always rising

However, 1) (the coin flip) seems to be a more plausible choice of benchmark. Under the Efficient Market hypothesis, flipping a coin should make no difference compared to any algorithm which tries to analyze the past to predict the future.
The Naive predictors could be distorted, because the market could be in fact trending higher or lower for many consecutive days, and we want our algorithm to capture this effect. On the contrary, we could have a market which trades roughly around the same levels for many days in a row. In this case, the Naive predictors would both perform poorly, while the random walk should still provide a solid benchmark to beat.

3) Another interesting benchmark is the *Persistence Forecast model*: this model checks what was the trend in the previous period and predicts the market will have the same trend for the next single time step.


## III. Methodology
_(approx. 3-5 pages)_

### Data Preprocessing
The first step is to check the skew across the features of the dataset. Skewed features in fact could lead to bad convergence of the classifiers we use and one way to prevent that is to apply a log-tranformation to the features.
In doing so, we have to pay extra care if some of the features values are close to `0`. To that extent, in the 'flatten skew function',  I am setting a critical skew level of +/- 0.9 (i.e. only skew levels in excess of 0.9 or less than -0.9 will be applied a log transformation to). In addition, all the features that are log-transformed are printed at the end of the function to ensure the user can visually check if, by any chance, any of the features with values potentially close to zero have been transformed (above all the 'CHG_PCT_1D' and the 'CHG_PCT_5D' features).

The next preprocessing step is the scaling. For every features in the dataset, we scale all the data to be in a given range.
This range by default is `[0,1]`, but for example for LSTM we will use the range `[-1,1]`, as this is the range of the `tanh` function which is used as activation function in the input activation and output layer.

It is very important to perform features scaling, as the objective functions of some of the classifiers (like `Random Forest` or `MLP` and `LSTM`) will not work properly and will struggle to converg to a solution of the optimization problem.

In the project, the scaling is performed via the `scale()` function, which is built upon the `MinMaxScaler` in `sklearn`.

The last preprocessing step consists in removing the last (i.e. the most recent) row of the dataset, as we do not have a label fom this observation (it is in the future for the period we are considering).
We also shift the 'y' column of the dataset to obtain the set of labels. By shifting this array by one day forward, we are trying to predict the outcome of the next trading day.
The 'y' column is calculated in the spreadsheet as the sign of the 'CHG_PCT_1D', so it reflects if the market went up or down on a given trading day.

### Implementation
All of the algorithms except `LSTM` are implemented in `sklearn`. `LSTM` instead uses `Keras` with `Tensorflow` backend.

The main step of the implementation is represented by the `train_test_model` function. This function is used to train and test all classifiers except the `LSTM` which uses its own train&test function. The `train_test_model` function has four main components: 1) it performs the `Walk-Forward` split, i.e. at the beginning takes the initial training set (which for the Oil market is the first/oldest 250 trading days), 2) it trains the a generic `sklearn` classifier to the training set; 3) calculates the prediction for the test set, which is composed of only the next trading day (i.e. the train set represents the *past days*, and the test set represents *today*). Finally, 4) the function keeps looping one step (i.e. trading day) ahead at a time, and includes the newly acquired information into the training set. It then refits to the training set and predicts the next day outcome.

As I wrote above, each `sklearn` algorithm is performed both in Walk-Forward and in Static Validation mode. The Static validation simply fits the model once for all on the first half of the available data, and then tries to predict all the subsequent labels without incorporating any new information. The Static validation is irrealistic: we always want to make new information available to a classifier and, potentially, we would want the classifier to disregard this new information if it is not useful. We check also the performance of the algorithms in Static Validation as a possibile confirmation of the unpredictability of market sentiment, in accordance with the EHH. 

As anticipated, the `LSTM` model is implemented in Walk-Forward validation only. Its `Keras` implementation is essentially the same as the `train_test_model` for `sklearn` classifiers (i.e. it follows the same walk-forward process as the `train_test_model`  function).

As the `Random Forest` classifier is quite reach in terms of parameters, we perform an optimization over the parameters set and extract the features for the best and the worst (in terms of accuracy score) Random Forest classifier.

Let's look at the implementation details more closely.

- **Gaussian Naive Bayes**. In this case we just use the standard `sklearn` implementation without specifying prior probabilities for the classes (this is justified from the fact that on average the number of up trends in the market were roughly the same as the down trends)


- **Random Forest**. We start from the basic `sklearn` implementation, which uses `10` estimators, the `gini` criterion as the loss to minimize (i.e. the algorithm will try to minimize the Gini impurity), `max_features = auto`, i.e. equal to `sqrt(number of features)` and `inpurity split` (the threshold for early stopping in the tree growth) of `1e-7`. 
<br> We later a parameter optimization over: `n_estimators` (over the range `[5,10,20]`), `criterion` (`gini` vs `entropy`), `max_features` (over the range `[None, auto, log2]`) and `inpurity split` (either `1e-7` and `1e-4`). We then extract the 5 five features which are the best on average (i.e. across the test set) for the best classifier and the worst classifier (see the 'Results' section.


- **MLP**. After some manual fine-tuning, I have settled on a network with 3 hidden layers, all with size `100`. I am using the `LBFGS` solver instead of `Stochastic Gradient Descent` or `Adam`, given that the size of the dataset is relatively small and the convergence is quite fast. `LBFGS` stands for **L**imited memory **B**royden-**F**letcher-**G**oldfarb-**S**hanno algorithm, it is an optimizer of the quasi-Newton methods type. I am using the `relu` activation function.
In the Walk-Forward validation mode I have set `alpha` of `1e-2`, `tol=1e-4` and I am setting `warm_start=True` so that the algorithm can rely on the previous time step trained weights to allow for faster convergence (it is also a way to keep some memory as moving along the time series).
In the static validation mode, `alpha` is `1e-4`, `tol=1e-3` and `warm_start=False`.
This parameters choice comes from some manual trials, and I have settled on this based on the accuracy score performance (see the 'Results' section) 

The reason why I did not investigate the MLP deeper is because the main neural network we want to implement is the `LSTM`. 


- **LSTM**. For this method I am implementing a *stateful* model, i.e. the network will start from the previously trained weights across sequential iterations (similar to the `warm_start=True` feature of the `MLP`).
The main features to explore here are the combinations of the number of neurons (i.e. layers used in the network) and number of epochs used for training. One neuron is composed of four gates: 1) the Input activation gate; 2) the Input gate; 3) the Forget gate; 4) and the Output gate. The first three gates concur to produce a state, which is then combined with the Output gate to produce an output. Each epoch is a forward-backprop iteration. 
I am exploring `30` combinations of number of neurons and epochs in total. For the number of neurons: `[1,2,3,10,50]`; for the number of epochs: `[1,5,10,100,1000,3000]`.

A batch_size of `1` is used given that the model is stateful, therefore only `1` batch of data has to be used at each iteration and the model will retain information about that batch and assess by itself how to maintain it within its weights.

### Refinement
A refiment has been introduced in the final part of the project, in terms of metrics. In fact, despite the fact that the accuracy_score is a good metric for the pure classification problem, in a financial environment we are interested in that when we are wrong and, say, we predict the market will go up, the market does not actually performs `-10%`, or viceversa.

We therefore introduced the *modified accuracy*, a metrics that register a score proportional to the realised percentage performance of the market on a given trading day. The `modified_accuracy` function implements this logic.


## IV. Results
_(approx. 2-3 pages)_

### Model Evaluation and Validation





In this section, the final model and any supporting qualities should be evaluated in detail. It should be clear how the final model was derived and why this model was chosen. In addition, some type of analysis should be used to validate the robustness of this model and its solution, such as manipulating the input data or environment to see how the model’s solution is affected (this is called sensitivity analysis). Questions to ask yourself when writing this section:
- _Is the final model reasonable and aligning with solution expectations? Are the final parameters of the model appropriate?_
- _Has the final model been tested with various inputs to evaluate whether the model generalizes well to unseen data?_
- _Is the model robust enough for the problem? Do small perturbations (changes) in training data or the input space greatly affect the results?_
- _Can results found from the model be trusted?_

### Justification
In this section, your model’s final solution and its results should be compared to the benchmark you established earlier in the project using some type of statistical analysis. You should also justify whether these results and the solution are significant enough to have solved the problem posed in the project. Questions to ask yourself when writing this section:
- _Are the final results found stronger than the benchmark result reported earlier?_
- _Have you thoroughly analyzed and discussed the final solution?_
- _Is the final solution significant enough to have solved the problem?_


## V. Conclusion
_(approx. 1-2 pages)_

### Free-Form Visualization
In this section, you will need to provide some form of visualization that emphasizes an important quality about the project. It is much more free-form, but should reasonably support a significant result or characteristic about the problem that you want to discuss. Questions to ask yourself when writing this section:
- _Have you visualized a relevant or important quality about the problem, dataset, input data, or results?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_

### Reflection
In this section, you will summarize the entire end-to-end problem solution and discuss one or two particular aspects of the project you found interesting or difficult. You are expected to reflect on the project as a whole to show that you have a firm understanding of the entire process employed in your work. Questions to ask yourself when writing this section:
- _Have you thoroughly summarized the entire process you used for this project?_
- _Were there any interesting aspects of the project?_
- _Were there any difficult aspects of the project?_
- _Does the final model and solution fit your expectations for the problem, and should it be used in a general setting to solve these types of problems?_

### Improvement
In this section, you will need to provide discussion as to how one aspect of the implementation you designed could be improved. As an example, consider ways your implementation can be made more general, and what would need to be modified. You do not need to make this improvement, but the potential solutions resulting from these changes are considered and compared/contrasted to your current solution. Questions to ask yourself when writing this section:
- _Are there further improvements that could be made on the algorithms or techniques you used in this project?_
- _Were there algorithms or techniques you researched that you did not know how to implement, but would consider using if you knew how?_
- _If you used your final solution as the new benchmark, do you think an even better solution exists?_

-----------

**Before submitting, ask yourself. . .**

- Does the project report you’ve written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Analysis** and **Methodology**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your analysis, methods, and results?
- Have you properly proof-read your project report to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?
- Is the code that implements your solution easily readable and properly commented?
- Does the code execute without error and produce results similar to those reported?