Perfect — let’s nail **R² (R-squared)** with maximum clarity 🔥

---

# 🔹 What is R²?

* R² (coefficient of determination) tells us **how well a regression model explains the variability of the data**.
* It’s a **goodness-of-fit** measure.

👉 Think of it as:
**“Out of all the variation in `y`, how much of it does my model explain?”**

---

# 🔹 Formula (but keep it intuitive)

$$
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
$$

Where:

* $SS_{res}$ = Residual Sum of Squares = variation the model **didn’t** explain.
* $SS_{tot}$ = Total Sum of Squares = total variation in `y` (baseline).

👉 So, $\frac{SS_{res}}{SS_{tot}}$ = % of unexplained variance.
👉 Then, $1 -$ that = % of variance explained by the model.

---

# 🔹 Minimal Example

Suppose we have data:

| Hours Studied (X) | Exam Score (Y) |
| ----------------- | -------------- |
| 1                 | 50             |
| 2                 | 60             |
| 3                 | 65             |
| 4                 | 70             |
| 5                 | 80             |

---

### Step 1: Compute Mean of Y

$$
\bar{y} = \frac{50+60+65+70+80}{5} = 65
$$

---

### Step 2: Total Variation (SS\_tot)

$$
SS_{tot} = \sum (y_i - \bar{y})^2
$$

\= (50-65)² + (60-65)² + (65-65)² + (70-65)² + (80-65)²
\= 225 + 25 + 0 + 25 + 225
\= **500**

👉 This is the **total spread of data around the mean**.

---

### Step 3: Fit a Simple Regression

Say our model predicts:

| X | Predicted ŷ |
| - | ----------- |
| 1 | 52          |
| 2 | 60          |
| 3 | 66          |
| 4 | 72          |
| 5 | 80          |

---

### Step 4: Residual Variation (SS\_res)

$$
SS_{res} = \sum (y_i - \hat{y}_i)^2
$$

\= (50-52)² + (60-60)² + (65-66)² + (70-72)² + (80-80)²
\= 4 + 0 + 1 + 4 + 0
\= **9**

👉 Only 9 units of variation are unexplained!

---

### Step 5: Compute R²

$$
R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{9}{500} = 0.982
$$

✅ **R² = 0.982 (98.2%)** → The model explains almost all the variation in exam scores.

---

# 🔹 Intuition Recap

* If **R² = 0** → model explains nothing (no better than just using the mean).
* If **R² = 1** → perfect fit (model explains all variance).
* If **R² = 0.7** → model explains 70% of variance, 30% is still noise.
* R² can even be **negative** if your model is worse than just predicting the mean.

---

# 🔹 Simple Analogy

Imagine trying to predict **how much water is in a glass**:

* **SS\_tot** = total mess of guesses if you just use the “average amount of water” every time.
* **SS\_res** = leftover mess after using your smart model.
* **R²** = how much better your model is at cleaning up the mess compared to just using the average.

---

✅ So R² is **not magic** — it’s just:
**“How much better is my model than just using the average?”**

---

`sklearn.metrics.r2_score`


# Mean Absolute Error

>Mean Absolute Error: this is an interpretable metric because it has the same unit of measurment as the initial series,  [0,+∞)
 
>MAE=∑i=1n|yi−ŷ i|n

`sklearn.metrics.mean_absolute_error`
 

# Median Absolute Error (MedAE)
MedAE=median(|y1−ŷ 1|,...,|yn−ŷ n|)

`sklearn.metrics.median_absolute_error`

# Mean Absolute Error

>Mean Squared Error: the most commonly used metric that gives a higher penalty to large errors and vice versa,  [0,+∞)
 
>MSE=1n∑i=1n(yi−ŷ i)2
 

`sklearn.metrics.mean_squared_error`

# Mean Squared Logarithmic Error

>Mean Squared Logarithmic Error: practically, this is the same as MSE, but we take the logarithm of the series. As a result, we give more >weight to small mistakes as well. This is usually used when the data has exponential trends,  [0,+∞)
 
`MSLE=1n∑i=1n(log(1+yi)−log(1+ŷ i))2`

# Mean Aboslute Percentage Error

>Mean Absolute Percentage Error: this is the same as MAE but is computed as a percentage, which is very convenient when you want to explain >the quality of the model to management,  [0,+∞)
 
`MAPE=100n∑i=1n|yi−ŷ i|yi`

# Weigted Average

>Weighted average is a simple modification to the moving average. The weights sum up to 1 with larger weights assigned to more recent observations.

`ŷ t=∑n=1kωnyt+1−n`

# Double exponential smoothing

>Up to now, the methods that we've discussed have been for a single future point prediction (with some nice smoothing). That is cool, but it is also not enough. Let's extend exponential smoothing so that we can predict two future points (of course, we will also include more smoothing).

>Series decomposition will help us -- we obtain two components: intercept (i.e. level)  ℓ
  >and slope (i.e. trend)  b
 >. We have learnt to predict intercept (or expected series value) with our previous methods; now, we will apply the same exponential smoothing to the trend by assuming that the >future direction of the time series changes depends on the previous weighted changes. As a result, we get the following set of functions:

`ℓx=αyx+(1−α)(ℓx−1+bx−1)`
 

`bx=β(ℓx−ℓx−1)+(1−β)bx−1`
 

`ŷ x+1=ℓx+bx`
 

>The first one describes the intercept, which, as before, depends on the current value of the series. The second term is now split into previous values of the level and of the trend. The second function describes the trend, which depends on the level changes at the current step and on the previous value of the trend. In this case, the  β
  coefficient is a weight for exponential smoothing. The final prediction is the sum of the model values of the intercept and trend.



## See below explanation and example:

Perfect — let’s do this step by step, slowly, so the intuition sticks.

---

## 🔑 Why Do We Need Double Smoothing?

We just saw **why single exponential smoothing (SES) fails**:

* It assumes data wiggles around a stable mean.
* But when there’s a **trend**, it always lags — like chasing a moving car without knowing its speed.

👉 So, to fix this, we don’t just want to know **“where the car is now”** (the *level*) — we also want to know **“how fast it’s moving”** (the *trend*).

That’s the intuition behind **Double Exponential Smoothing (Holt’s Method)**.

---

## 🧩 The Building Blocks

Double smoothing splits the forecast into **two components**:

1. **Level (ℓ)** = Where is the series *right now*?

   * Like the car’s current position.
   * Updated each time with smoothing (just like SES did).

2. **Trend (b)** = How fast (and in which direction) is the series moving?

   * Like the car’s speed.
   * Also updated with smoothing, because trends can change over time.

Then, the forecast is simply:

$$
\text{Future} = \text{Level} + \text{Trend}
$$

👉 Position + Speed

That’s the *big picture*.

---

## 📜 The Formula (Don’t worry, we’ll unpack it gently)

1. **Update Level (ℓ):**

$$
\ell_t = \alpha y_t + (1-\alpha)(\ell_{t-1} + b_{t-1})
$$

* Combines today’s actual value ($y_t$) and yesterday’s forecast (which already included yesterday’s level + trend).
* **Goal:** Keep track of “where we are now,” but accounting for the fact that data is moving.

2. **Update Trend (b):**

$$
b_t = \beta (\ell_t - \ell_{t-1}) + (1-\beta)b_{t-1}
$$

* Looks at how the level has changed ($\ell_t - \ell_{t-1}$) = the new slope.
* Mixes it with the old slope ($b_{t-1}$).
* **Goal:** Keep track of “how fast the series is moving,” smoothed to avoid overreacting to noise.

3. **Forecast:**

$$
\hat{y}_{t+1} = \ell_t + b_t
$$

* Just like “position + speed = next position.”

---

Perfect — because your instinct is right: if we only use neat numbers like 10, 12, 14, 16, it looks too “clean” and the formula seems like magic. Let’s pick a slightly messy example where intuition becomes more obvious.

---

## Example Series (not evenly spaced)

Suppose we observe daily sales:

$$
10, \; 13, \; 15, \; 18, \; 17
$$

This looks like:

* It’s generally trending **upward**,
* But not in a perfect straight line (some noise).

---

## Step 1: Initialize

We need a starting level and trend.

* $ \ell_0 = 10$ (first observation)
* $ b_0 = 13 - 10 = 3$ (difference between first two points)

Set smoothing parameters: α = 0.5, β = 0.5 (so we balance between old and new info).

---

## Step 2: Update step by step

### At $t=2$ (actual = 13)

* Level:

$$
\ell_2 = 0.5(13) + 0.5(10 + 3) = 6.5 + 6.5 = 13
$$

* Trend:

$$
b_2 = 0.5(13 - 10) + 0.5(3) = 1.5 + 1.5 = 3
$$

### At $t=3$ (actual = 15)

* Level:

$$
\ell_3 = 0.5(15) + 0.5(13 + 3) = 7.5 + 8 = 15.5
$$

* Trend:

$$
b_3 = 0.5(15.5 - 13) + 0.5(3) = 1.25 + 1.5 = 2.75
$$

### At $t=4$ (actual = 18)

* Level:

$$
\ell_4 = 0.5(18) + 0.5(15.5 + 2.75) = 9 + 9.125 = 18.125
$$

* Trend:

$$
b_4 = 0.5(18.125 - 15.5) + 0.5(2.75) = 1.3125 + 1.375 = 2.6875
$$

### At $t=5$ (actual = 17)

* Level:

$$
\ell_5 = 0.5(17) + 0.5(18.125 + 2.6875) = 8.5 + 10.40625 = 18.90625
$$

* Trend:

$$
b_5 = 0.5(18.90625 - 18.125) + 0.5(2.6875) = 0.390625 + 1.34375 = 1.734375
$$

---

## Step 3: Forecast next point (t=6)

$$
\hat{y}_6 = \ell_5 + b_5 = 18.90625 + 1.734375 \approx 20.64
$$

👉 Notice how this balances:

* The *actual last value* was 17 (a dip),
* But because the model knows there’s an *overall upward trend*, it doesn’t predict 17 again.
* Instead, it predicts \~20.6, respecting both the dip and the upward momentum.

---

## 🎯 Intuition

* **Level** keeps track of “where the series seems to be right now.”
* **Trend** adjusts for “the direction and speed it’s moving.”
* Together, they prevent the model from overreacting to single dips/spikes while still capturing the overall slope.

---

Would you like me to **plot this messy example** side by side with **Single Smoothing** so you can *visually* see SES lagging while Double Smoothing projects forward?


# Triple exponential smoothing a.k.a. Holt-Winters¶

We've looked at exponential smoothing and double exponential smoothing. This time, we're going into triple exponential smoothing.

As you could have guessed, the idea is to add a third component - seasonality. This means that we should not use this method if our time series is not expected to have seasonality. Seasonal components in the model will explain repeated variations around intercept and trend, and it will be specified by the length of the season, in other words by the period after which the variations repeat. For each observation in the season, there is a separate component; for example, if the length of the season is 7 days (a weekly seasonality), we will have 7 seasonal components, one for each day of the week.

With this, let's write out a new system of equations:

`ℓx=α(yx−sx−L)+(1−α)(ℓx−1+bx−1)`

`bx=β(ℓx−ℓx−1)+(1−β)bx−1`

`sx=γ(yx−ℓx)+(1−γ)sx−L`

`ŷ x+m=ℓx+mbx+sx−L+1+(m−1)modL`

The intercept now depends on the current value of the series minus any corresponding seasonal component. Trend remains unchanged, and the seasonal component depends on the current value of the series minus the intercept and on the previous value of the component. Take into account that the component is smoothed through all the available seasons; for example, if we have a Monday component, then it will only be averaged with other Mondays. You can read more on how averaging works and how the initial approximation of the trend and seasonal components is done here. Now that we have the seasonal component, we can predict not just one or two steps ahead but an arbitrary m
 future steps ahead, which is very encouraging.

Below is the code for a triple exponential smoothing model, which is also known by the last names of its creators, Charles Holt and his student Peter Winters. Additionally, the Brutlag method was included in the model to produce confidence intervals:

`ŷ maxx=ℓx−1+bx−1+sx−T+m⋅dt−T`

`ŷ minx=ℓx−1+bx−1+sx−T−m⋅dt−T`

`dt=γ∣yt−ŷ t∣+(1−γ)dt−T,`

where T
 is the length of the season, d
 is the predicted deviation