# Normalization and Stationarity Analysis of S&P 500 (SPY) Time Series

## 📊 **Overview**
This notebook demonstrates two fundamental preprocessing techniques for financial time series data: **normalization** and **stationarity transformation**. These techniques are essential for preparing financial data for machine learning models and statistical analysis.

---

## 🎯 **Objectives**
1. **Data Normalization**: Scale S&P 500 price data to a standardized range (0-1) using MinMaxScaler
2. **Stationarity Transformation**: Remove trends and make the series stationary using differencing
3. **Statistical Validation**: Verify stationarity using the Augmented Dickey-Fuller (ADF) test

---

## 📈 **Dataset**
- **Source**: S&P 500 ETF (SPY) historical data
- **File**: `spy_historical_data.csv` from the data collection module
- **Features**: Datetime index and SPY closing prices
- **Purpose**: Demonstrate preprocessing techniques on real financial data

---

## 🔧 **Key Techniques Implemented**

### 1️⃣ **Normalization with MinMaxScaler**
- **Purpose**: Scale price data to range [0, 1] for improved model performance
- **Method**: `sklearn.preprocessing.MinMaxScaler`
- **Benefits**: 
  - Reduces impact of scale variations
  - Prevents larger values from dominating model training
  - Essential for neural networks and gradient-based algorithms

### 2️⃣ **Stationarity via Differencing**
- **Purpose**: Remove trends and achieve constant mean/variance over time
- **Method**: First-order differencing (`df.diff()`)
- **Benefits**:
  - Eliminates non-stationary behavior
  - Makes time series suitable for ARIMA modeling
  - Focuses on price changes rather than absolute levels

### 3️⃣ **Statistical Testing**
- **Test**: Augmented Dickey-Fuller (ADF) Test
- **Null Hypothesis**: Series has unit root (non-stationary)
- **Interpretation**: p-value < 0.05 indicates stationarity

---

## 📊 **Expected Results**
- **Original SPY Data**: Non-stationary with upward trend over time
- **Normalized Data**: Values scaled between 0 and 1, maintaining original trend pattern
- **Differenced Data**: Stationary series representing daily price changes
- **ADF Test**: Confirms statistical stationarity of differenced series

---

## 🚀 **Applications**
- **Machine Learning**: Preprocessed data ready for ML models
- **Risk Management**: Stationary returns for volatility modeling
- **Algorithmic Trading**: Normalized features for strategy development
- **Statistical Analysis**: Foundation for time series forecasting

---

## 📚 **Key Learning Points**
1. **Why Normalize**: Financial data often has different scales; normalization ensures equal treatment
2. **Why Stationarity**: Most statistical models assume stationarity for valid inference
3. **Practical Implementation**: Real-world application of preprocessing techniques
4. **Statistical Validation**: Importance of testing assumptions before modeling

Let's dive into the implementation! 👇

In [6]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [7]:
import os

# Construct the path to the CSV file
path = os.path.abspath(os.path.join("..", "01_get_the_data", "spy_historical_data.csv"))

# Validate file exists
if not os.path.exists(path):
    raise FileNotFoundError(f"Data file not found: {path}")
    
print(f"Data file located: {path}")

Data file located: c:\Users\calli\OneDrive\Programmazione\github\FromZeroToQuant\FromZeroToQuant-\01_get_the_data\spy_historical_data.csv


In [8]:
# Load the CSV with error handling
try:
    df = pd.read_csv(path)
    print(f"Successfully loaded {len(df)} rows of data")
except Exception as e:
    raise ValueError(f"Failed to load CSV file: {e}")

# Drop any rows that are completely NaN (e.g. 'Date' row)
df = df.dropna(how='all')

# Validate required columns exist
required_columns = ['datetime', 'SPY']
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
    raise ValueError(f"Missing required columns: {missing_columns}")

# Convert datetime column to proper datetime format
try:
    df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M:%S')
except Exception as e:
    # Try alternative datetime parsing if specific format fails
    df['datetime'] = pd.to_datetime(df['datetime'])

# Show the result
print(f"Final dataset shape: {df.shape}")
df.head()

Successfully loaded 1000 rows of data
Final dataset shape: (1000, 2)


Unnamed: 0,datetime,SPY
0,2021-08-31 16:30:00,451.56
1,2021-09-01 16:30:00,451.8
2,2021-09-02 16:30:00,453.19
3,2021-09-03 16:30:00,453.08
4,2021-09-07 16:30:00,451.46


In [9]:
# Validate SPY data before normalization
if df['SPY'].isna().any():
    print(f"Warning: Found {df['SPY'].isna().sum()} NaN values in SPY column")
    df = df.dropna(subset=['SPY'])

# Normalize the close prices column using MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
df['SPY_close_normalized'] = scaler.fit_transform(df[['SPY']])

print(f"Normalization complete. Range: [{df['SPY_close_normalized'].min():.4f}, {df['SPY_close_normalized'].max():.4f}]")
df.head()

Normalization complete. Range: [0.0000, 1.0000]


Unnamed: 0,datetime,SPY,SPY_close_normalized
0,2021-08-31 16:30:00,451.56,0.329004
1,2021-09-01 16:30:00,451.8,0.329835
2,2021-09-02 16:30:00,453.19,0.334649
3,2021-09-03 16:30:00,453.08,0.334268
4,2021-09-07 16:30:00,451.46,0.328658


In [10]:
# Calculate first difference to achieve stationarity
df['SPY_close_differenced'] = df['SPY'].diff()

# Remove the first row which will be NaN due to differencing
df = df.dropna(subset=['SPY_close_differenced'])

print(f"Differencing complete. {len(df)} observations remaining after removing NaN values")
print(f"Mean of differenced series: {df['SPY_close_differenced'].mean():.6f}")
df.head()

Differencing complete. 999 observations remaining after removing NaN values
Mean of differenced series: 0.192167


Unnamed: 0,datetime,SPY,SPY_close_normalized,SPY_close_differenced
1,2021-09-01 16:30:00,451.8,0.329835,0.24
2,2021-09-02 16:30:00,453.19,0.334649,1.39
3,2021-09-03 16:30:00,453.08,0.334268,-0.11
4,2021-09-07 16:30:00,451.46,0.328658,-1.62
5,2021-09-08 16:30:00,450.91,0.326753,-0.55


In [11]:
from statsmodels.tsa.stattools import adfuller

# Verify stationarity of the differenced series using Augmented Dickey-Fuller test
print("=== AUGMENTED DICKEY-FULLER STATIONARITY TEST ===")
print("Null Hypothesis: Series has unit root (non-stationary)")
print("Alternative Hypothesis: Series is stationary\n")

# Perform ADF test (no need for additional dropna as data is already cleaned)
result = adfuller(df['SPY_close_differenced'])

# Display detailed results
print(f'ADF Test Statistic: {result[0]:.6f}')
print(f'P-value: {result[1]:.6f}')
print(f'Critical Values:')
for key, value in result[4].items():
    print(f'\t{key}: {value:.6f}')

# Interpret results
print("\n=== INTERPRETATION ===")
if result[1] < 0.05:
    print("✅ CONCLUSION: The differenced series IS STATIONARY")
    print("   (p-value < 0.05: Reject null hypothesis)")
else:
    print("❌ CONCLUSION: The differenced series is NOT STATIONARY")
    print("   (p-value >= 0.05: Fail to reject null hypothesis)")
    
print(f"\nConfidence level: {(1-result[1])*100:.2f}%")

=== AUGMENTED DICKEY-FULLER STATIONARITY TEST ===
Null Hypothesis: Series has unit root (non-stationary)
Alternative Hypothesis: Series is stationary

ADF Test Statistic: -17.412612
P-value: 0.000000
Critical Values:
	1%: -3.436939
	5%: -2.864449
	10%: -2.568319

=== INTERPRETATION ===
✅ CONCLUSION: The differenced series IS STATIONARY
   (p-value < 0.05: Reject null hypothesis)

Confidence level: 100.00%
