# Milestone 2: Feature Engineering & Baseline Model Development

## Objectives
- üîß **Feature Engineering**: Create lag features, rolling windows, and time-based features.
- üìä **EDA**: Analyze correlations and feature importance.
- ü§ñ **Baseline Modeling**: Train initial models (Linear Regression, Random Forest, LSTM).
- üìâ **Evaluation**: Assess model performance using RMSE, MAE, R¬≤.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

In [2]:
# Load Processed Data from Milestone 1
try:
    df = pd.read_csv('../Dataset/cleaned_household_power_consumption.csv', index_col='DateTime', parse_dates=True)
    print("‚úÖ Loaded processed data successfully!")
    print(f"Shape: {df.shape}")
except FileNotFoundError:
    print("‚ùå Processed data not found. Please run Milestone 1 notebook first.")

‚úÖ Loaded processed data successfully!
Shape: (34589, 14)


## 1Ô∏è‚É£ Feature Engineering

In [3]:
# Create Temporal Features
df['Hour'] = df.index.hour
df['DayOfWeek'] = df.index.dayofweek
df['Month'] = df.index.month

# Create Lag Features (Past 1 hour, 24 hours)
# Data is minutely in raw, but df_hourly is hourly resampling
# So lag_1h is shift(1), lag_24h is shift(24)
df['lag_1h'] = df['Global_active_power'].shift(1)
df['lag_24h'] = df['Global_active_power'].shift(24)

# Rolling Statistics (Moving Average)
df['rolling_mean_24h'] = df['Global_active_power'].rolling(window=24).mean()
df['rolling_std_24h'] = df['Global_active_power'].rolling(window=24).std()

# Drop NaN values created by shifting
df.dropna(inplace=True)
print("‚úÖ Feature Engineering Complete")
df.head()

‚úÖ Feature Engineering Complete


Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3,hour,day,month,...,dayofweek,is_weekend,season,Hour,DayOfWeek,Month,lag_1h,lag_24h,rolling_mean_24h,rolling_std_24h
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006-12-17 17:00:00,3.406767,0.166633,234.306167,14.51,0.0,0.466667,16.816667,17,17,12,...,6,1,0,17,6,12,3.322167,3.9605,2.441256,0.72739
2006-12-17 18:00:00,3.650733,0.135067,234.607833,15.55,0.0,0.0,16.833333,18,17,12,...,6,1,0,18,6,12,3.406767,3.504733,2.447339,0.737215
2006-12-17 19:00:00,2.908333,0.263733,233.376,12.506667,0.0,0.516667,16.683333,19,17,12,...,6,1,0,19,6,12,3.650733,3.400233,2.426843,0.716106
2006-12-17 20:00:00,3.3615,0.2715,236.4265,14.276667,0.0,1.116667,17.116667,20,17,12,...,6,1,0,20,6,12,2.908333,3.268567,2.430715,0.72109
2006-12-17 21:00:00,3.040767,0.267967,239.104167,12.716667,0.0,1.2,17.5,21,17,12,...,6,1,0,21,6,12,3.3615,3.056467,2.430061,0.720504


## 2Ô∏è‚É£ Baseline Modeling

In [4]:
# Define Features and Target
features = ['lag_1h', 'lag_24h', 'rolling_mean_24h', 'rolling_std_24h', 'Hour', 'DayOfWeek', 'Month']
target = 'Global_active_power'

X = df[features]
y = df[target]

# Train-Test Split (Time-based split, not random)
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

print(f"Train Shape: {X_train.shape}, Test Shape: {X_test.shape}")

Train Shape: (27652, 7), Test Shape: (6913, 7)


In [5]:
# Linear Regression Baseline
model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Evaluation
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R¬≤ Score: {r2:.4f}")

RMSE: 0.5066
MAE: 0.3604
R¬≤ Score: 0.5017
