# Expected vs. Realized Move in Options Using Random Forest Regression

This educational project investigates the relationship between **expected** and **realized moves** ($EM$ and $RM$, respectively) in equity options, focusing on the challenge of forecasting how much an underlying asset will move based on option market signals.

Using a **Random Forest regression** approach—a common machine learning method—we examine how closely the **market’s expectations** ($EM$), reflected in an option contract’s price and its implied volatility, correspond to the **actual realized price movements** ($RM$). We then build a Random Forest Regressor to predict the magnitude of $RM$ for a fixed horizon ($h$) and analyze the conditions under which these forecasts provide meaningful insight.

This project aims to deepen our understanding of how predictive market-implied volatility is for future movements, advancing our knowledge of risk, price forecasting, and volatility management in financial decision-making and equity investing.


## Introduction to Stock Market Trading, Options & $EM$ vs. $RM$

Stock prices change continually as investors buy and sell shares. This movement is influenced by company news, economic data, and trader sentiment, leading to periods of rising (bull) or falling (bear) prices.
 
**Options** are financial contracts that give buyers the right, but not the obligation, to buy or sell an underlying stock at a specific price (the **strike price**, $K$) on or before a set date (the **expiry**). In other words, options let you bet on whether a stock’s price will go up or down, and at what level you can make that bet.
- **Call options:** Gives the right to **buy** the stock at $K$. This is like betting that the stock price will go above a certain level. The payoff formula for a call option at expiry is: $$\max(S_T - K, 0)$$
Where,
    - $K$ is the **strike price**—the predetermined price at which the option holder can buy the stock.
    - $S_T$ is the **stock price at the time of expiry** ($T$).
    - The expression $(S_T - K)$ represents how much the stock price is above the strike price at expiry.
    - The $\max(\cdot, 0)$ function means you take the greater of $(S_T - K)$ or 0. This guarantees that if the option ends up "out-of-the-money" (i.e., $S_T < K$), the payoff cannot be negative. The option contract then will simply expires worthless.

     Put simply: if you have a call option, you get paid the difference if the stock ends up higher than your agreed price, otherwise, you get nothing. You’d only use your option if you could buy cheaper than in the market. If not, you just let it expire and lose nothing extra.


- **Put options:** Gives the right to **sell** the stock at $K$. This is like betting that the stock price will fall below a certain level. The payoff formula for a put option at expiry is: $$\max(K - S_T, 0)$$
Where,
    - $K$ is the **strike price**—the predetermined price at which the option holder can sell the stock.
    - $S_T$ is the **stock price at the time of expiry** ($T$).
    - The expression $(K - S_T)$ represents how much the strike price is above the stock price at expiry.
    - The $\max(\cdot, 0)$ function means you take the greater of $(K - S_T)$ or 0. This ensures that if the option finishes "out-of-the-money" (i.e., $S_T > K$), the payoff cannot be negative. The put then simply expires worthless.

     Put simply (no pun intended): if you have a put option, you get paid the difference if the stock ends up lower than your agreed sell price, otherwise, you get nothing. You’d only use your option if you could sell higher than in the market. If not, you just let it expire and lose nothing extra.
 
Option prices are influenced not just by the current stock price, but also by the expected amount the stock will move by expiry. This expectation is captured in the **implied volatility (IV)** of the option. This is a measure derived from option prices that reflects the market's consensus on how much the stock might fluctuate.

Traders often use IV to estimate the **Expected Move ($EM$)**, which determines how much the stock price is likely to change over a given period. The **expected move** over a horizon of $h$ days (using at-the-money implied volatility) is calculated as follows:

$$
\text{EM}_{t,h} = \sigma^{ATM}_t \sqrt{\frac{h}{252}}
$$

Where:
- $\text{EM}_{t,h}$ is the **Expected Move** over the next $h$ trading days, starting from time $t$.
- $\sigma^{ATM}_t$ is the **at-the-money implied volatility** at time $t$ (usually quoted as an annualized percentage, such as 20%).
- $h$ is the **number of trading days** in your forecast window (for example, $h=5$ for a week, $h=21$ for a month).
- 252 is the **typical number of trading days in a year** (used to annualize or de-annualize volatility).
- The square root component, $\sqrt{\frac{h}{252}}$, adjusts the annualized implied volatility to the shorter time frame of $h$ days. This reflects the idea from financial theory that volatility scales with the square root of time.

In plain language:
- The formula takes the market’s current “best guess” (from option prices) for how much the stock would fluctuate in a year, then converts it—using the square-root rule—to estimate the likely move over just $h$ days.
- For example, if you want to know how much the stock might move in the next week, plug in $h=5$.
- This gives you a market-implied estimate of the range within which the stock price might end up after $h$ days.

After the period ends, we can observe the **Realized Move ($RM$)**, which is the actual movement that occurred in the underlying stock over the horizon we’re measuring. The formula for realized move over $h$ days is:

$$\text{RM}_{t,h} = \left| \ln \left( \frac{P_{t+h}}{P_t} \right) \right|$$

Where,
- $\text{RM}_{t,h}$ is the **Realized Move** between time $t$ and $t+h$ (the *actual* observed change over $h$ days).
- $P_t$ is the **stock price at the start** (time $t$).
- $P_{t+h}$ is the **stock price at the end** of the $h$-day period.
- The fraction $\frac{P_{t+h}}{P_t}$ tells us how much the price has changed relative to its starting value.
- The natural logarithm $\ln(\cdot)$ is used to measure the size of the price change in a way that is fair for both upward and downward moves, and lets us easily compare changes across different prices or stocks.
- The absolute value $|\cdot|$ ensures that we are measuring the magnitude of the movement, regardless of whether the stock price went up or down.

In plain language:
- The realized move is the absolute size of the log-return from $t$ to $t+h$. It shows how much the stock actually changed in price (up or down) over the period, without regard to direction, just the size.

Suppose the stock price today is $100. The options market is implying a 5% expected move ($EM$) over the next week. In this scenario, people expect the price could end up anywhere between $95 and $105. This is written as:
$$
\text{EM}_{t,5\text{D}} = 5%
$$

Two possible outcomes can occur by the end of the week:

* **If the stock finishes at $108:**
  The realized move ($RM$) is
  $$
  \text{RM} = \left| \ln\left( \frac{108}{100} \right) \right| \approx 7.7% \approx 8%
  $$
  This is larger than what the options market predicted, meaning the stock was more volatile than expected.

* **If the stock finishes at $103:**
  The realized move ($RM$) is
  $$
  \text{RM} = \left| \ln\left( \frac{103}{100} \right) \right| \approx 3%
  $$
  This is smaller than the expected move; the stock was calmer than the options market anticipated.

In this example, intuitively, if the market priced in a $5 move but the stock ran up $8, that indicates unexpected volatility; if it only moved $3, things were quieter than expected.

If you calculated the realized move not just from the starting and ending prices but based on the **highest and lowest** prices during the week, you’d get a larger number because this method captures all the swings up and down, not just where you started and finished.

Finally, if you look at realized moves over longer horizons (like a month instead of a week), the big ups and downs can cancel each other out or get averaged in, which typically smooths out the realized move.




## What is Regression? What is Random Forest Regression?



**Regression** is a type of machine learning that predicts a continuous numeric value, given a set of features. In our case, we use regression to forecast the realized move ($RM$, actual volatility) of a stock, using data such as option prices, historical price patterns, and more.

A **Random Forest** is an ensemble learning method that builds many decision trees and averages their predictions. Here’s how it works:

- Each **decision tree** is trained on a random sample of the data (with replacement — aka bootstrapping).
- At each split in a tree, only a random subset of features is considered, making every tree see a slightly different part of the data.
- The **final prediction** is the average (for regression) of all trees’ predictions.

For a Random Forest, the prediction for an input $X$ is calculated as:
$$
\hat{y} = \frac{1}{N} \sum_{i=1}^{N} T_i(X)
$$
Where,

- $\hat{y}$ is the final prediction made by the Random Forest.
- $N$ represents the total number of decision trees in the forest.
- $T_i(X)$ is the prediction for $X$ made by the $i$-th decision tree.
- The sum $\sum_{i=1}^{N} T_i(X)$ means we add up the predictions from all the trees.
- Finally, we divide by $N$ to take the average, so the Random Forest output is the mean of all the tree predictions.

This process helps the Random Forest provide more stable and accurate predictions by pooling together the results from many different trees.

**We can use a Random Forest for this problem because:**
- It can model highly **nonlinear relationships**—which are common in finance.
- It’s highly effective for **tabular data** with many features (like ours).
- It’s **robust to outliers and multicollinearity** (when features are correlated).
- By averaging many trees, it reduces overfitting and variance (the **bias–variance trade-off**) without much need for feature scaling or heavy preprocessing.

Random Forest also allows tuning parameters to improve performance:
- `n_estimators`: Number of trees (more = smoother, more robust predictions)
- `max_depth`, `max_features`: Control tree complexity and diversity
- `min_samples_split`, `min_samples_leaf`: Prevent trees from growing too fine (“overfitting” small blips)

Here is a visual example of a Random Forest Regression (sourced from https://www.spotfire.com/glossary/what-is-a-random-forest):
<img src="random-forest-diagram.svg" alt="Random Forest Diagram" style="width:40%; display:block; margin:auto;">
<em>
Figure Above: A Random Forest is an ensemble of decision trees, each trained on a bootstrapped sample of the dataset and a random subset of features. Each tree makes its own prediction independently, capturing different aspects of the data’s underlying patterns. The forest then aggregates these individual outputs, averaging for regression tasks to produce a more stable, accurate, and less overfitted final prediction.
</em>


The goal of this project is to model and predict $RM$ (the realized, or actual, volatility) using features including option market data, with a **Random Forest Regression** model.


