# Lab: PJM Hourly Energy Forecasting with SageMaker DeepAR

## Overview
In this lab, we will use the **PJM Hourly Energy Consumption Dataset** to predict future electricity demand. We will use **Amazon SageMaker's built-in DeepAR algorithm**, which is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using Recurrent Neural Networks (RNNs).

### Lab Objectives:
1. **Load and Explore Data**: Read the CSV data and visualize the energy consumption.
2. **Preprocess for DeepAR**: Convert the data into the JSON format required by DeepAR.
3. **Train a Model**: Use SageMaker Estimators to train a forecasting model.
4. **Deploy and Predict**: Deploy an endpoint and visualize predictions against actuals.
5. **Reproduce in Canvas**: Learn how to achieve the same results using the No-Code SageMaker Canvas interface.

---

## 1. Setup and Environment

First, we import the necessary libraries. We will use `boto3` for AWS services, `sagemaker` for model training, and `pandas`/`matplotlib` for data manipulation.

In [None]:
import sagemaker
import boto3
import pandas as pd
import matplotlib.pyplot as plt
import json
import os
from sagemaker import image_uris

# Set up SageMaker session and role
session = sagemaker.Session()
region = session.boto_region_name
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = 'sagemaker/pjm-energy-forecasting'

print(f"Region: {region}")
print(f"S3 Bucket: {bucket}")
print(f"Role: {role}")

## 2. Load Sample Data

In a real scenario, you would have the full dataset in a folder named `data/hourly_energy_consumption`. 

**Note:** To ensure this notebook runs for you immediately, the cell below creates a dummy `DAYTON_hourly.csv` file if it doesn't exist. If you have the actual dataset, ensure it is placed in `data/hourly_energy_consumption/DAYTON_hourly.csv`.

In [None]:
# Create a directory for data
os.makedirs('data/hourly_energy_consumption', exist_ok=True)
file_path = 'data/hourly_energy_consumption/DAYTON_hourly.csv'

# CHECK: If file doesn't exist, create a synthetic one for the lab to function
if not os.path.exists(file_path):
    print("Dataset not found. Generating sample data for the lab...")
    # Generate date range
    dates = pd.date_range(start='2004-12-01', end='2005-02-01', freq='H')
    import numpy as np
    # Generate synthetic load data (sine wave + random noise)
    values = 1500 + 300 * np.sin(np.linspace(0, 100, len(dates))) + np.random.normal(0, 50, len(dates))
    df_dummy = pd.DataFrame({'Datetime': dates, 'DAYTON_MW': values.astype(int)})
    df_dummy.to_csv(file_path, index=False)
    print(f"Created sample file at {file_path}")
else:
    print(f"Found dataset at {file_path}")

# Read the data
df = pd.read_csv(file_path)
print(f"Data Shape: {df.shape}")
df.head()

## 3. Exploratory Data Analysis (EDA)

We need to clean the data and visualize it to understand the patterns (seasonality, trends). 

1. Convert `Datetime` to a proper datetime object.
2. Set `Datetime` as the index.
3. Handle duplicates or missing values.

In [None]:
# 1. Convert to Datetime
df['Datetime'] = pd.to_datetime(df['Datetime'])

# 2. Set Index
df = df.set_index('Datetime')

# 3. Sort index to ensure chronological order
df = df.sort_index()

# 4. Plotting a subset
plt.figure(figsize=(15, 5))
plt.plot(df.index, df['DAYTON_MW'], label='Energy Consumption (MW)', color='blue', alpha=0.7)
plt.title('Hourly Energy Consumption - Dayton Region')
plt.ylabel('Megawatts (MW)')
plt.xlabel('Date')
plt.legend()
plt.grid(True)
plt.show()

### Preparing Data for DeepAR

DeepAR requires the data to be in a specific JSON format. We also need to decide on:
- **Context Length**: How far back the model looks (e.g., 1 week = 168 hours).
- **Prediction Length**: How far forward we predict (e.g., 24 hours).

We will split the data into **Training** (past data) and **Testing** (data including the future we want to predict).

In [None]:
freq = 'H' # Hourly data
prediction_length = 24 # Predict next 24 hours
context_length = 24 * 7 # Look back 1 week

# Ensure regular frequency (fill missing hours if any)
ts_data = df['DAYTON_MW'].resample(freq).mean().fillna(method='ffill')

# Split Train and Test
# Train ends 24 hours before the end of the dataset
train_series = ts_data.iloc[:-prediction_length]
test_series = ts_data # Test includes the whole series (DeepAR uses the end to evaluate)

print(f"Train end date: {train_series.index[-1]}")
print(f"Test end date: {test_series.index[-1]}")

In [None]:
# Helper function to convert to DeepAR JSON format
def write_json_dataset(series, filename):
    with open(filename, 'wb') as f:
        # DeepAR expects: start timestamp and target array
        json_obj = {
            "start": str(series.index[0]),
            "target": list(series.values)
        }
        f.write(json.dumps(json_obj).encode('utf-8'))
        f.write(b'\n')
    print(f"Created {filename}")

# Create local JSON files
write_json_dataset(train_series, 'train.json')
write_json_dataset(test_series, 'test.json')

# Upload to S3
train_path = session.upload_data('train.json', bucket=bucket, key_prefix=f'{prefix}/train')
test_path = session.upload_data('test.json', bucket=bucket, key_prefix=f'{prefix}/test')

print(f"Training data uploaded to: {train_path}")
print(f"Test data uploaded to: {test_path}")

## 4. Train DeepAR Model

We retrieve the built-in DeepAR Docker image and configure the Estimator.

In [None]:
# Retrieve DeepAR Image URI
image_uri = image_uris.retrieve(region=region, framework='forecasting-deepar')

# Define the Estimator
estimator = sagemaker.estimator.Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type='ml.c5.xlarge', # Cost-effective instance for training
    output_path=f's3://{bucket}/{prefix}/output',
    sagemaker_session=session
)

# Set Hyperparameters
estimator.set_hyperparameters(
    time_freq=freq,
    context_length=str(context_length),
    prediction_length=str(prediction_length),
    epochs='20',           # Low epochs for quick lab execution
    early_stopping_patience='10',
    num_layers='2'
)

# Train the model
print("Starting training... this may take 5-10 minutes.")
estimator.fit(inputs={'train': train_path, 'test': test_path})

## 5. Deploy and Predict

Once training is complete, we deploy the model to an endpoint for real-time inference.

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# Deploy the model
predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

print("Endpoint deployed.")

### Make Predictions
We send the time series data to the endpoint. DeepAR will look at the provided data and forecast the next 24 hours (as defined in `prediction_length`).

In [None]:
# Prepare request
# We send the 'train' series. DeepAR will predict the continuation.
request_data = {
    "instances": [
        {"start": str(train_series.index[0]), "target": list(train_series.values)}
    ],
    "configuration": {"num_samples": 50, "output_types": ["quantiles"], "quantiles": ["0.5", "0.9"]}
}

# Get prediction
prediction = predictor.predict(request_data)

# Extract forecasts
forecast_key = prediction['predictions'][0]
p50 = forecast_key['quantiles']['0.5'] # Median prediction
p90 = forecast_key['quantiles']['0.9'] # 90th percentile (upper bound)

# Create a DataFrame for plotting
forecast_date_range = pd.date_range(start=train_series.index[-1] + pd.Timedelta(hours=1), periods=prediction_length, freq='H')

plt.figure(figsize=(15, 7))

# Plot actual historical data (Zoom in on last week for visibility)
zoom_start = -168 
plt.plot(test_series.index[zoom_start:], test_series.values[zoom_start:], label='Actual Data', color='black')

# Plot forecast
plt.plot(forecast_date_range, p50, label='Predicted (P50)', color='orange', linestyle='--')
plt.fill_between(forecast_date_range, p50, p90, color='orange', alpha=0.3, label='Confidence Interval (P90)')

plt.title('PJM Energy Forecast: Actual vs DeepAR Prediction')
plt.ylabel('MW')
plt.legend()
plt.grid()
plt.show()

## 6. Clean-up

**Important:** Delete the endpoint to avoid incurring charges.

In [None]:
predictor.delete_endpoint()
print("Endpoint deleted.")

# Optional: Remove uploaded data from S3
# sagemaker.Session().delete_object(bucket=bucket, key=prefix)

## 7. Instructions for Reproducing in SageMaker Canvas

SageMaker Canvas allows you to build this model without writing any code. Follow these steps to reproduce the results:

### Step 1: Import Data
1. Open **SageMaker Canvas** from the AWS Console.
2. Navigate to **Data** -> **Import**.
3. Upload the `DAYTON_hourly.csv` file (you can download it from the `data/` folder in the Jupyter file browser).
4. Preview the data and choose **Import**.

### Step 2: Build a Model
1. Go to **My Models** -> **New Model**.
2. Name it `PJM-Energy-Forecast`.
3. Select **Predictive Analysis**.

### Step 3: Configure Training
1. Select your imported dataset.
2. **Target Column**: Select `DAYTON_MW`.
3. Canvas will automatically detect this is a Time Series problem. If not, click **Configure**.
    - **Item ID**: If you had multiple regions in one file, you would select the region name here. Since we only have one, you can leave it blank or add a dummy column.
    - **Time Stamp**: Select `Datetime`.
    - **Forecast Horizon**: Set to `24` (to match our code).
4. Click **Preview model** to see data quality insights.

### Step 4: Train
1. Click **Standard Build** (takes 1-2 hours but is more accurate) or **Quick Build** (15 mins).
2. Wait for training to complete.

### Step 5: Analyze and Predict
1. Once trained, view the **Overview** tab to see accuracy metrics (wQL, RMSE).
2. Go to the **Predict** tab.
3. You can generate a **Single Prediction** (what-if analysis) or verify against the held-out test set.