# Class 3: Time Series Analysis with ML Approach

This notebook explores Machine Learning approaches for time series forecasting, focusing on Facebook Prophet and XGBoost. We will also discuss appropriate train-test split strategies for time series data and compare the results with traditional statistical methods.

## 0. Setup and Imports

In [None]:

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yfinance as yf
import warnings
from prophet import Prophet
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (14, 7)
plt.rcParams['axes.grid'] = True


## 1. Data Loading and Train-Test Split for Time Series

Train-Test Split concept not found

In [None]:

# Download 10 years of Apple stock data
start_date = '2015-01-01'
end_date = '2025-01-01'
ticker = 'AAPL'

# Fetch data using yfinance
df = yf.download(ticker, start=start_date, end=end_date)
print(f"Downloaded {ticker} stock data from {start_date} to {end_date}")
print(f"Shape of data: {df.shape}")

# Display the first few rows
print("\nFirst 5 rows of the data:")
display(df.head())

# We'll focus on the 'Adj Close' price for our analysis
ts = df['Adj Close']
print(f"\nFocusing on Adjusted Close price for {ticker}")

# Split data into training and testing sets (80% train, 20% test)
train_size = int(len(ts) * 0.8)
train_ts = ts[:train_size]
test_ts = ts[train_size:]
print(f"\nSplit data into training ({len(train_ts)} samples) and testing ({len(test_ts)} samples) sets")

# Plot the training and testing data
plt.figure(figsize=(14, 7))
plt.plot(train_ts, label='Training Data')
plt.plot(test_ts, label='Testing Data')
plt.title(f'{ticker} Stock Price - Train/Test Split')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.tight_layout()
plt.show()

print("\n--- 1. Loading Data ---")
print(f"Train set size: {len(train_ts)}")
print(f"Test set size: {len(test_ts)}")


## 2. Facebook Prophet for Time Series Forecasting

Prophet/ML Approach concept not found

In [None]:

print("\n--- 2. Fitting Prophet Model ---")

# Prepare data for Prophet (requires 'ds' for dates and 'y' for values)
prophet_train = pd.DataFrame({
    'ds': train_ts.index,
    'y': train_ts.values
})

# Create and fit Prophet model
try:
    model = Prophet(
        yearly_seasonality=True,
        weekly_seasonality=True,
        daily_seasonality=False,
        changepoint_prior_scale=0.05  # Flexibility of the trend
    )
    model.fit(prophet_train)
    
    # Create future dataframe for prediction (using the test period)
    future_dates = model.make_future_dataframe(
        periods=len(test_ts),
        freq='B'  # Business days frequency
    )
    
    # Ensure future_dates aligns with test_ts index
    future_dates = future_dates[future_dates['ds'] <= test_ts.index[-1]]
    
    # Make predictions
    forecast = model.predict(future_dates)
    
    # Extract predictions for the test period
    prophet_forecast = forecast[forecast['ds'] >= test_ts.index[0]]['yhat'].values
    
    # Ensure forecast length matches test set
    if len(prophet_forecast) != len(test_ts):
        # Reindex to match test dates exactly
        forecast_df = pd.DataFrame({'ds': forecast['ds'], 'yhat': forecast['yhat']})
        forecast_df = forecast_df.set_index('ds')
        prophet_forecast = forecast_df.reindex(test_ts.index).values.flatten()
    
    # Calculate error metrics
    prophet_rmse = math.sqrt(mean_squared_error(test_ts, prophet_forecast))
    prophet_mae = mean_absolute_error(test_ts, prophet_forecast)
    print(f"Prophet RMSE: {prophet_rmse:.4f}")
    print(f"Prophet MAE: {prophet_mae:.4f}")
    
    # Plot the forecast
    plt.figure(figsize=(14, 7))
    plt.plot(train_ts, label='Training Data')
    plt.plot(test_ts, label='Actual Test Data')
    plt.plot(test_ts.index, prophet_forecast, label='Prophet Forecast', color='red')
    plt.title('Facebook Prophet Forecast')
    plt.xlabel('Date')
    plt.ylabel('Price ($)')
    plt.legend()
    plt.tight_layout()
    plt.savefig('plot_15_prophet_forecast.png')
    plt.show()
    
    # Plot Prophet components
    fig = model.plot_components(forecast)
    plt.savefig('plot_16_prophet_components.png')
    plt.show()
    
except Exception as e:
    print(f"Error fitting Prophet: {e}")
    prophet_rmse = float('nan')
    prophet_mae = float('nan')


**Interpretation:**


Facebook Prophet is designed for forecasting time series data with strong seasonal patterns and multiple seasonality levels:

Key Features:
1. Automatic decomposition into trend, seasonality, and holidays
2. Handles missing data and outliers robustly
3. Can incorporate holiday effects and special events
4. Provides uncertainty intervals for forecasts
5. Designed to be easy to use with minimal parameter tuning

The model components plot shows:
- Trend: The overall direction of the time series
- Yearly Seasonality: Patterns that repeat annually
- Weekly Seasonality: Patterns that repeat weekly
- Daily Seasonality (if enabled): Patterns that repeat daily

Prophet works well for:
- Business forecasting tasks with strong seasonality
- Time series with multiple seasonal patterns
- Data with missing values or outliers
- Forecasts that need to incorporate known future events

Note: For financial time series like stock prices, Prophet may not always outperform other models since stock prices often don't follow regular seasonal patterns and are influenced by many external factors.


## 3. XGBoost for Time Series Forecasting

In [None]:

print("\n--- 3. Fitting XGBoost Model ---")

# Feature engineering for XGBoost
def create_features(df):
    # Create time series features based on time series index
    df = df.copy()
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    df['weekofyear'] = df.index.isocalendar().week
    return df

# Create a dataframe with the time series
df_full = pd.DataFrame({'y': ts})

# Add time-based features
df_full = create_features(df_full)

# Add lag features (must be done before train-test split to avoid lookahead bias)
for lag in [1, 5, 10, 21]:  # 1 day, 1 week, 2 weeks, 1 month
    df_full[f'lag_{lag}'] = df_full['y'].shift(lag)

# Add rolling mean features
for window in [5, 21]:  # 1 week, 1 month
    df_full[f'rolling_mean_{window}'] = df_full['y'].rolling(window=window).mean()

# Split into train and test sets
train_df = df_full[:train_size].copy()
test_df = df_full[train_size:].copy()

# Drop NaN values (created by lag and rolling features)
train_df = train_df.dropna()

# Define features and target
feature_columns = [col for col in train_df.columns if col != 'y']
X_train = train_df[feature_columns]
y_train = train_df['y']
X_test = test_df[feature_columns]
y_test = test_df['y']

# Train XGBoost model
xgb_model = xgb.XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Use early stopping to prevent overfitting
eval_set = [(X_train, y_train)]
xgb_model.fit(
    X_train, y_train,
    eval_set=eval_set,
    eval_metric='rmse',
    early_stopping_rounds=50,
    verbose=False
)

# Make predictions
xgb_forecast = xgb_model.predict(X_test)

# Calculate error metrics
xgb_rmse = math.sqrt(mean_squared_error(y_test, xgb_forecast))
xgb_mae = mean_absolute_error(y_test, xgb_forecast)
print(f"XGBoost RMSE: {xgb_rmse:.4f}")
print(f"XGBoost MAE: {xgb_mae:.4f}")

# Plot the forecast
plt.figure(figsize=(14, 7))
plt.plot(train_ts, label='Training Data')
plt.plot(test_ts, label='Actual Test Data')
plt.plot(test_ts.index, xgb_forecast, label='XGBoost Forecast', color='green')
plt.title('XGBoost Forecast')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.tight_layout()
plt.savefig('plot_17_xgboost_forecast.png')
plt.show()

# Feature importance
plt.figure(figsize=(12, 6))
xgb.plot_importance(xgb_model, max_num_features=10)
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.show()


**Interpretation:**


XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm that can be adapted for time series forecasting:

Key Features:
1. Gradient boosting framework that builds an ensemble of decision trees
2. Can capture complex non-linear relationships
3. Handles a mix of numerical and categorical features
4. Provides feature importance rankings
5. Regularization to prevent overfitting

For time series forecasting with XGBoost:
- Feature engineering is crucial (time-based features, lag features, rolling statistics)
- The model treats forecasting as a supervised learning problem
- Early stopping helps prevent overfitting
- Feature importance helps identify which factors most influence the predictions

XGBoost works well for:
- Complex time series with non-linear relationships
- Forecasting problems where external variables are important
- Cases where interpretability of feature importance is valuable
- Situations where traditional time series models underperform

The feature importance plot shows which features contribute most to the predictions, which can provide valuable insights about the time series.


## 4. Comparing Statistical and ML Approaches

Comparison concept not found

In [None]:

print("\n--- 4. ML Model Performance Comparison (Test Set) ---")
print(f"Prophet RMSE: {prophet_rmse:.4f}, MAE: {prophet_mae:.4f}")
print(f"XGBoost RMSE: {xgb_rmse:.4f}, MAE: {xgb_mae:.4f}")
print("Compare these metrics with those from Class 2 (Statistical Models).")

# Note: In a real-world scenario, you would load the results from Class 2 models
# and create a comprehensive comparison here. For this notebook, we'll just
# compare the ML models we've implemented.

# Determine the best ML model
if not np.isnan(prophet_rmse) and not np.isnan(xgb_rmse):
    best_model = "Prophet" if prophet_rmse < xgb_rmse else "XGBoost"
    print(f"\nBest performing ML model based on RMSE: {best_model}")
elif not np.isnan(xgb_rmse):
    print("\nXGBoost is the only successfully trained ML model")
elif not np.isnan(prophet_rmse):
    print("\nProphet is the only successfully trained ML model")
else:
    print("\nNo ML models were successfully trained and evaluated")

print("\nClass 3 Demonstrations Complete.")


**Comprehensive Comparison:**


Comparing Statistical and Machine Learning approaches for time series forecasting reveals important insights:

### Statistical Models (ARIMA/SARIMAX):
- **Strengths:**
  - Strong theoretical foundation based on time series properties
  - Interpretable parameters with clear statistical meaning
  - Effective for data with clear autocorrelation patterns
  - Relatively simple to implement for basic cases
  - Provide confidence intervals with statistical guarantees

- **Limitations:**
  - Rely on stationarity assumptions
  - May struggle with complex non-linear patterns
  - Limited ability to incorporate many external variables
  - Manual order selection can be time-consuming
  - May not handle multiple seasonality well

### Machine Learning Models (Prophet/XGBoost):
- **Strengths:**
  - Can capture complex non-linear relationships
  - Flexible incorporation of many features and external variables
  - Prophet handles multiple seasonality patterns automatically
  - XGBoost provides feature importance for interpretability
  - Often require less manual parameter tuning

- **Limitations:**
  - May require more data for effective training
  - Risk of overfitting with too many features
  - Less theoretical foundation in time series properties
  - May be computationally more intensive
  - Feature engineering is crucial for good performance

### When to Use Each Approach:
- **Statistical Models:** When you have clear autocorrelation patterns, need interpretability, have limited data, or when the time series follows traditional patterns.
- **Prophet:** When dealing with time series with multiple seasonality patterns, missing data, or when you need an easy-to-use forecasting tool.
- **XGBoost:** When you have many potential predictive features, complex non-linear relationships, or when traditional models underperform.

### Hybrid Approaches:
- Combining statistical and ML models can leverage the strengths of both
- Use statistical models for baseline forecasts and ML for residual modeling
- Ensemble methods can combine predictions from multiple model types
- Feature engineering informed by statistical analysis can improve ML models

The best approach often depends on the specific characteristics of your time series data, the forecasting horizon, and your specific requirements for interpretability versus accuracy.


## Final Recommendations


Based on our exploration of both statistical and machine learning approaches for time series forecasting, here are key recommendations:

1. **Start Simple, Then Increase Complexity:**
   - Begin with simple models like ARIMA as a baseline
   - Add complexity only if simpler models don't perform adequately
   - Compare performance metrics across model types

2. **Proper Evaluation:**
   - Always use a proper time series train-test split (no future leakage)
   - Consider multiple error metrics (RMSE, MAE)
   - Evaluate models across different forecast horizons
   - Test model performance during different market conditions

3. **Feature Engineering:**
   - For ML models, feature engineering is crucial
   - Include calendar features (day of week, month, holidays)
   - Create lag features at appropriate intervals
   - Add rolling statistics (means, standard deviations)
   - Consider domain-specific external variables

4. **Model Selection Considerations:**
   - For data with clear seasonal patterns: SARIMAX or Prophet
   - For data with many potential predictive features: XGBoost
   - For volatility forecasting: GARCH models
   - For interpretability needs: Statistical models or simpler ML models

5. **Practical Implementation:**
   - Regularly retrain models as new data becomes available
   - Implement monitoring to detect when model performance degrades
   - Consider ensemble approaches combining multiple models
   - Balance complexity with maintainability for production systems

6. **For Stock Price Forecasting Specifically:**
   - Recognize the inherent unpredictability of financial markets
   - Focus on probabilistic forecasts rather than point estimates
   - Consider incorporating sentiment data and market indicators
   - Combine price forecasts with volatility forecasts for risk assessment
   - Remember that even the best models have limitations in financial forecasting

The most effective approach often combines elements from both statistical and machine learning methods, leveraging the strengths of each while mitigating their weaknesses.
