# The naivety of ML models and how to avoid the naive trap

**Keywords**
- forecast
- stock price
- machine learning
- data
- apple
- ai

## Overall Feedback

1. There's a mix of using I and we. Pick one.
2. Really great, would like to see a stronger message at the end.
3. Blog readers have shorter attention span, use headings & subheadings to break your article into sections. If a person only read the heading, they should be able to get the gist of what you're trying to say. 

(Intro)
### The naive trap
When forecasting stock price movements with fancy ML models, you want your models to learn something useful from the provided data that humans might not be aware of so you have a competitive edge. This article will show you examples of where those fancy ML models learn something dumb that looks intelligent and how to check if your models are doing the same.

I came across this problem when comparing the stock forecasting performance of various ML models to each other. After selecting the top performing models, I made a quick comparison to a Naive benchmark model as a sanity check (the Naive model just uses the current price as a forecast for the next time step so it's not an "intelligent" model that learns anything from historical data). After some invesatigation, I found that some of these ML models learned the Naive forecasting strategy, which left me with a couple of fancy-looking models that are no more useful than a toddler saying tomorrow's price will be the same as today's price. I fell into what I now call the naive trap. If you would like to avoid falling in the same trap, read on...

(Body)

### It's easy to fool yourself

> Insidious confirmation bias creeps in fast as we spot the first signs of success in our beautiful money-printing ML creations and we forget to apply a rigorous scientific approach

Most people get very excited to use machine learning (ML). Using ML to solve problems is often romanticised (I’m also guilty here). There is a unique thrill accompanying the sight of a computer beating a human at some task: It simultaneously gives you a slight existential shake, making you wonder about your long-term relevance all while satisfying your inner math geek and giving you a feeling similar to watching the first human landing on the moon. Because of this, we want our ML models to succeed. Insidious confirmation bias creeps in fast as we spot the first signs of success in our beautiful money-printing ML creations and we forget to apply a rigorous scientific approach. Consider the following example to see just how easy this can happen.



### An example
#### Overview: setting the stage
In this example, forecasting the next day's price of Apple stock was considered. The performance of the following models were compared (selected arbitrarily for illustration, feel free to plug in your own and rerun the analysis):
* Linear regression
* Random forrest
* Multi-layer perceprton (MLP)
* Naive

A summary of the analysis is given below (the full analysis can be found in [this GitHub repo](https://github.com/ruankie/naive-ml)). 

#### Details: Data processing and setting the target variable
Firstly, the raw historical data was retreived from [Yahoo Finance](https://finance.yahoo.com/quote/AAPL/). For data processing, some technical indicators were added in as part of feature engineering. These features contain momentum, trend, volume, and volatility information and are commonly used as features in stock price forecasting. The data set features were also standardised so they were on a similar scale and appropriate for training ML models.

The data set was split into a training and a testing portion. As seen in the figure below, training was done on data from 2018-01-01 to 2021-05-31 (41 months) and testing was done on data from 2021-06-01 to 2021-12-01 (6 months).

![train-test-data](../figures/train-test-AAPL.png "train-test-data")

In this supervised ML setting, the target was specified as the next day's return. A price forecast can be reconstructed from the returns forecast. To do this, first consider the definition of returns $r_{t+1}$ at time $t+1$:

\begin{align*}
    r_{t+1} &= \frac{p_{t+1} - p_{t}}{p_{t}} \\
            &= \frac{p_{t+1}}{p_{t}} -1
\end{align*}

Where $p_{t+1}$ is the price at time $t+1$ and $p_{t}$ is the price at time $t$. This can be rearranged to get an expression for the next day's price:

$$p_{t+1} = p_{t} \left( r_{t+1} + 1 \right)$$

After training the above mentioned ML models, their performance was assessed on the test set using the root mean squared error $RMSE$:

$$ RMSE(r, \hat{r}) = \left[ \frac{1}{N} \sum_{i=1}^{N} \left( r_i - \hat{r}_i \right)^2 \right]^{1/2} $$

where $r$ and $\hat{r}$ are the actual and forecasted returns of the training set respectively, each containing $N$ elements (the number of days in the test set). Because returns are used here, the result can be interpreted as a percentage price forecast error. The figure below shows the $RMSE$ obtained by each model.

#### Results: First appearances are deceiving
![rmse](../figures/errors-AAPL.png "rmse")

Notice that the errors of the Linear model and the Random Forrest model are very close to that of the Naive model. At this point, you might be tempted to choose the Linear model as your top performer and happily put it into production, proudly claiming it makes just over a 1.2% error in price forecasts. However, when considering the strategy it learned to produce these forecasts, you might reconsider. Have a look at the following plot of return forecasts to see why this is the case:

#### Results: A closer look at what was learned
![return-forecast](../figures/return-forecast-AAPL.png "return-forecast")

A quick inspection reveales that the return forecast for the Linear model was almost always 0.0. This is the same as saying there will be no price difference between today's price and tomorrow's price. If this rings a bell, well done spotting it! This is of course the the Naive forecasting strategy. It is also interesting to see the similarity between the Random Forrest's return forecast and the Naive model's. Even though it is not as similar as the Linear model, it still very closely resembles the Naive strategy. These return forecasts can be converted into price forecasts as shown above to reveal a more intuitive and familliar plot:

![price-forecast](../figures/price-forecast-AAPL.png "price-forecast")

This is a plot that more people will be familliar with. However, looking at it this price forecast, it is not that obvious that the Linear and Random Forrest models basically learned the Naive strategy. It is again easy to fall into the trap and conclude that the Linear model is a great solution. 

#### A recipe for avoiding the naive trap
This is a great example of why data scientists shouldn't blindly chase a technical metric to optimise. As you have seen, form that point of view, you can easily end up with "solutions" that are not very useful. Instead, consider the bigger picture when deciding what would constitute a solution to a given problem. This might mean further interrogating possible solutions to see if they stand up to constraints not imposed by the original optimisation metric. In the case of stock price forecasting, consider addign the following steps to your process of finding a solution to avoid the naive trap:
1. Have a look at the returns forecast
2. Check how similar your ML models are to the Naive strategy (0.0 returns forecast)
3. Consider eliminating Naive models from your list of top-performers

### Conclusion
In conclusion, this article shows you how to avoid the trap of misinterpreting low price forecast errors and falling in love with your ML models before properly interrogating them. In the case of stock price forecasting, make sure that your complex ML models don't learn something of little use like the Naive strategy after optimising for your preferred error metric. Simply plotting the return forecasts can give you valuable insight into this particular problem. Feel free to have a look at [this GitHub repo](https://github.com/ruankie/naive-ml) for more details on the specifics of this analysis or to get you up and running with the same analysis quickly if you want to plug in your own models. Remember to always compare your ML models to simple benchmark models and happy forecasting!

## Resources
1. [Yahoo Finance](https://finance.yahoo.com/quote/AAPL/) for market data
1. [yfinance library](https://github.com/ranaroussi/yfinance) for market data
1. [ta library](https://github.com/bukosabino/ta) for technical indicators
1. [scikit-learn library](https://scikit-learn.org/stable/supervised_learning.html) for ML models