# Normalization and Stationarity Analysis of S&P 500 (SPY) Time Series

## 📊 **Overview**
This notebook demonstrates two fundamental preprocessing techniques for financial time series data: **normalization** and **stationarity transformation**. These techniques are essential for preparing financial data for machine learning models and statistical analysis.

---

## 🎯 **Objectives**
1. **Data Normalization**: Scale S&P 500 price data to a standardized range (0-1) using MinMaxScaler
2. **Stationarity Transformation**: Remove trends and make the series stationary using differencing
3. **Statistical Validation**: Verify stationarity using the Augmented Dickey-Fuller (ADF) test

---

## 📈 **Dataset**
- **Source**: S&P 500 ETF (SPY) historical data
- **File**: `spy_historical_data.csv` from the data collection module
- **Features**: Datetime index and SPY closing prices
- **Purpose**: Demonstrate preprocessing techniques on real financial data

---

## 🔧 **Key Techniques Implemented**

### 1️⃣ **Normalization with MinMaxScaler**
- **Purpose**: Scale price data to range [0, 1] for improved model performance
- **Method**: `sklearn.preprocessing.MinMaxScaler`
- **Benefits**: 
  - Reduces impact of scale variations
  - Prevents larger values from dominating model training
  - Essential for neural networks and gradient-based algorithms

### 2️⃣ **Stationarity via Differencing**
- **Purpose**: Remove trends and achieve constant mean/variance over time
- **Method**: First-order differencing (`df.diff()`)
- **Benefits**:
  - Eliminates non-stationary behavior
  - Makes time series suitable for ARIMA modeling
  - Focuses on price changes rather than absolute levels

### 3️⃣ **Statistical Testing**
- **Test**: Augmented Dickey-Fuller (ADF) Test
- **Null Hypothesis**: Series has unit root (non-stationary)
- **Interpretation**: p-value < 0.05 indicates stationarity

---

## 📊 **Expected Results**
- **Original SPY Data**: Non-stationary with upward trend over time
- **Normalized Data**: Values scaled between 0 and 1, maintaining original trend pattern
- **Differenced Data**: Stationary series representing daily price changes
- **ADF Test**: Confirms statistical stationarity of differenced series

---

## 🚀 **Applications**
- **Machine Learning**: Preprocessed data ready for ML models
- **Risk Management**: Stationary returns for volatility modeling
- **Algorithmic Trading**: Normalized features for strategy development
- **Statistical Analysis**: Foundation for time series forecasting

---

## 📚 **Key Learning Points**
1. **Why Normalize**: Financial data often has different scales; normalization ensures equal treatment
2. **Why Stationarity**: Most statistical models assume stationarity for valid inference
3. **Practical Implementation**: Real-world application of preprocessing techniques
4. **Statistical Validation**: Importance of testing assumptions before modeling

Let's dive into the implementation! 👇

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [2]:
import os
# import the path
path = os.path.abspath(os.path.join("..", "01_get_the_data", "spy_historical_data.csv"))

In [3]:
# Load the CSV
df = pd.read_csv(path)

# Drop any rows that are completely NaN (e.g. 'Date' row)
df = df.dropna(how='all')

# convert df.datetime to datetime
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M:%S')

# Show the result
df.head()

Unnamed: 0,datetime,SPY
0,2021-08-31 16:30:00,451.56
1,2021-09-01 16:30:00,451.8
2,2021-09-02 16:30:00,453.19
3,2021-09-03 16:30:00,453.08
4,2021-09-07 16:30:00,451.46


In [4]:
# Normalize the close prices column using MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
df['SPY_close_normalized'] = scaler.fit_transform(df[['SPY']])

df.head()

Unnamed: 0,datetime,SPY,SPY_close_normalized
0,2021-08-31 16:30:00,451.56,0.329004
1,2021-09-01 16:30:00,451.8,0.329835
2,2021-09-02 16:30:00,453.19,0.334649
3,2021-09-03 16:30:00,453.08,0.334268
4,2021-09-07 16:30:00,451.46,0.328658


In [None]:
df['SPY_close_differenced'] = df['SPY'].diff().dropna()
df = df.dropna()
df.head()

Unnamed: 0,datetime,SPY,SPY_close_normalized,SPY_close_differenced
0,2021-08-31 16:30:00,451.56,0.329004,
1,2021-09-01 16:30:00,451.8,0.329835,0.24
2,2021-09-02 16:30:00,453.19,0.334649,1.39
3,2021-09-03 16:30:00,453.08,0.334268,-0.11
4,2021-09-07 16:30:00,451.46,0.328658,-1.62


In [None]:
from statsmodels.tsa.stattools import adfuller

# Verify stationarity of the differenced series
# Test di Dickey-Fuller
result = adfuller(df['SPY_close_differenced'].dropna())
print(f'Test statistic: {result[0]}')
print(f'P-value: {result[1]}')

# If the p-value is < 0.05, the series is stationary!
if result[1] < 0.05:
    print("The series is stationary!")
else:
    print("The series is not stationary!")


Statistica di test: -2.4331050121192876
P-value: 0.13260989493578812
