# Homework 3

You need to edit this `.ipynb` file by replacing "# Your code", "# Your answer", etc., click "Restart & Run All" in Jupyter Notebook to generate your results, and download it as an `.html` file. Please submit your `.ipynb` and `.html` files (instead of a `.zip` file) on Moodle. If you have questions regarding the homework, please email the TA Saumil Shah (sashah8@ncsu.edu), or attend our office hours.

In this homework, we will implement [Tao's vanilla model](https://doi.org/10.1109/PES.2011.6038881) using `sklearn`, for which you need to explicitly create the $284$ predictor variables. Your results should be the same as in the lecture notes (where `statsmodels` is used).

First, we create a function to calculate the MAPE, and import the data:

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
def mape(a, f):
    return np.mean(np.abs((a - f) / a))
df = pd.read_csv('/content/bse_clean.csv', parse_dates=['Date'])
df['Trend'] = df.index
df['Day'] = df['Date'].dt.dayofweek
df

Unnamed: 0,Date,Hour,Load,T,Month,Trend,Day
0,2003-03-01,1,12863.0,23,3,0,5
1,2003-03-01,2,12389.0,22,3,1,5
2,2003-03-01,3,12155.0,21,3,2,5
3,2003-03-01,4,12072.0,21,3,3,5
4,2003-03-01,5,12160.0,22,3,4,5
...,...,...,...,...,...,...,...
51187,2008-12-31,20,18297.0,15,12,51187,2
51188,2008-12-31,21,17571.0,13,12,51188,2
51189,2008-12-31,22,16813.0,10,12,51189,2
51190,2008-12-31,23,15996.0,9,12,51190,2


Your task is to create the `X` matrix, with $284$ columns and $51192$ rows. The $284$ columns (excluding the `Intercept` column, which you don't need to create) should exactly match the coefficients shown below the `results.summary()` in the lecture notes (but the order and the column names can vary).

In [2]:
# Your code

In [3]:
import warnings
warnings.filterwarnings("ignore", category=pd.errors.PerformanceWarning)

In [4]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(drop='first', sparse_output=False)

encoded_hour = encoder.fit_transform(df[['Hour']])
hour_df = pd.DataFrame(encoded_hour, columns=encoder.get_feature_names_out(['Hour']))

encoded_month = encoder.fit_transform(df[['Month']])
month_df = pd.DataFrame(encoded_month, columns=encoder.get_feature_names_out(['Month']))

encoded_day = encoder.fit_transform(df[['Day']])
day_df = pd.DataFrame(encoded_day, columns=encoder.get_feature_names_out(['Day']))

T_group = pd.DataFrame({'T': df['T'],'T2': df['T'] ** 2,'T3': df['T'] ** 3})

In [5]:
month_T = pd.DataFrame()
for col_month in month_df.columns:
    for col_T in T_group.columns:
        month_T[f'{col_month}:{col_T}'] = month_df[col_month] * T_group[col_T]

hour_T = pd.DataFrame()
for col_hour in hour_df.columns:
    for col_T in T_group.columns:
        hour_T[f'{col_hour}:{col_T}'] = hour_df[col_hour] * T_group[col_T]

day_hour = pd.DataFrame()
for col_day in day_df.columns:
    for hour_value in range(1, 25):
        day_hour[f'{col_day}:Hour_{hour_value}'] = day_df[col_day] * (df['Hour'] == hour_value).astype(int)

In [6]:
trend_df = df[['Trend']]
X = pd.concat([T_group, hour_df, month_df, hour_T, month_T, day_hour, trend_df], axis=1)

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51192 entries, 0 to 51191
Columns: 284 entries, T to Trend
dtypes: float64(280), int64(4)
memory usage: 110.9 MB


Once `X` is created, you can uncomment the following, which should generate the same results (in particular, intercept and MAPE) as in the lecture notes.

In [8]:
train = (df['Date'].dt.year >= 2004) & (df['Date'].dt.year <= 2006)
test = (df['Date'].dt.year >= 2007) & (df['Date'].dt.year <= 2008)
X_train = X[train]
y_train = df['Load'][train]
X_test = X[test]
y_test = df['Load'][test]
reg = LinearRegression().fit(X_train, y_train)
reg.intercept_, mape(y_test, reg.predict(X_test))

(15189.999119845836, 0.030606989429779293)