# 1.1 Load and inspect the dataset

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/Users/yecao/Downloads/FuelEconomy.csv')

# Display basic information
print("Column Names:", df.columns.tolist())
print("Dataset Shape:", df.shape)
print("\nSummary Statistics:")
display(df.describe())

# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing Values:\n", missing_values)

Column Names: ['Horse Power', 'Fuel Economy (MPG)']
Dataset Shape: (100, 2)

Summary Statistics:


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0



Missing Values:
 Horse Power           0
Fuel Economy (MPG)    0
dtype: int64


# 1.2 Train/Test split (70% / 30% random)

In [2]:
from sklearn.model_selection import train_test_split

X = df[['Fuel Economy (MPG)']]
y = df['Horse Power']

# 70% Training / 30% Testing split with a fixed random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

Training set size: 70
Testing set size: 30


# 1.3-1.4 Model training: Linear + Polynomial regression + Model evaluation (train and test)

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

results = []

def evaluate_model(name, model, X_tr, X_ts):
    model.fit(X_tr, y_train)
    y_tr_pred = model.predict(X_tr)
    y_ts_pred = model.predict(X_ts)
    
    return {
        "Model": name,
        "Train MSE": mean_squared_error(y_train, y_tr_pred),
        "Train MAE": mean_absolute_error(y_train, y_tr_pred),
        "Train R²": r2_score(y_train, y_tr_pred),
        "Test MSE": mean_squared_error(y_test, y_ts_pred),
        "Test MAE": mean_absolute_error(y_test, y_ts_pred),
        "Test R²": r2_score(y_test, y_ts_pred)
    }

# (a) Linear Regression
results.append(evaluate_model("Linear Regression", LinearRegression(), X_train, X_test))

# (b, c, d) Polynomial Regression (Degrees 2, 3, 4)
for degree in [2, 3, 4]:
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    results.append(evaluate_model(f"Poly (deg={degree})", LinearRegression(), X_train_poly, X_test_poly))

# Present results
results_table = pd.DataFrame(results)
display(results_table)

Unnamed: 0,Model,Train MSE,Train MAE,Train R²,Test MSE,Test MAE,Test R²
0,Linear Regression,357.69918,16.061689,0.90632,318.561087,14.940628,0.912561
1,Poly (deg=2),350.879731,15.995824,0.908106,331.105434,15.14833,0.909118
2,Poly (deg=3),345.108668,15.746762,0.909618,318.404012,14.764973,0.912604
3,Poly (deg=4),339.700171,15.508465,0.911034,313.798757,14.735471,0.913868


# 1.5 Discussion and interpretation

## Which model performs best on the test set and why?
The Polynomial Regression (degree 4) model is the best performer for this dataset. It achieved the highest Test $R^{2}$ of 0.9139 and the lowest Test MSE of 313.801. This indicates that the relationship between Fuel Economy (MPG) and Horsepower has a non-linear component that the higher-degree polynomial is able to capture more effectively than a standard linear fit.

## Does increasing polynomial degree always improve performance?
No, increasing the degree does not always lead to better performance. In this specific experiment, the Polynomial (degree 2) model actually performed worse than the Linear Regression model on the test set. Specifically, the Linear model had a Test MSE of 318.56, while the Degree 2 model's error increased to 331.115. This suggests that a simple quadratic curve was a poorer fit for the general trend in the test data than a straight line.

## If a model performs unexpectedly poorly, propose at least two plausible reasons.
For the degree 2 model, which performed worse than the linear model, two plausible reasons are:
### Underfitting / Model Mismatch: 
A quadratic curve may be a poor representation of the actual relationship, potentially creating a "mismatch" that fits the data less accurately than a straight line.
### Outliers or Noise: 
If the dataset contains specific outliers, a degree 2 model may be more sensitive to them than a linear model, while higher-degree models (3 and 4) have the additional flexibility to "bend" around them without sacrificing overall performance.

# 2.1 Load and inspect the dataset

In [4]:
import pandas as pd
import numpy as np

# Load the dataset
df_elec = pd.read_csv('/Users/yecao/Downloads/electricity_consumption_based_weather_dataset.csv')

# Print basic information
print("Column Names:", df_elec.columns.tolist())
print("Dataset Shape:", df_elec.shape)
print("\nSummary Statistics:")
display(df_elec.describe())

# Identify dependent variable
print("\nDependent Variable: daily_consumption")

# Identify and handle missing values
missing_counts = df_elec.isnull().sum()
print("\nMissing Values per Column:\n", missing_counts)

# Handling missing values
df_elec['AWND'] = df_elec['AWND'].fillna(df_elec['AWND'].mean())
print("\nMissing values after handling:", df_elec.isnull().sum().sum())

Column Names: ['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption']
Dataset Shape: (1433, 6)

Summary Statistics:


Unnamed: 0,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1418.0,1433.0,1433.0,1433.0,1433.0
mean,2.642313,3.800488,17.187509,9.141242,1561.078061
std,1.140021,10.973436,10.136415,9.028417,606.819667
min,0.0,0.0,-8.9,-14.4,14.218
25%,1.8,0.0,8.9,2.2,1165.7
50%,2.4,0.0,17.8,9.4,1542.65
75%,3.3,1.3,26.1,17.2,1893.608
max,10.2,192.3,39.4,27.2,4773.386



Dependent Variable: daily_consumption

Missing Values per Column:
 date                  0
AWND                 15
PRCP                  0
TMAX                  0
TMIN                  0
daily_consumption     0
dtype: int64

Missing values after handling: 0


# 2.2 Train/Test split (70% / 30% random)

In [5]:
from sklearn.model_selection import train_test_split

X_elec = df_elec[['AWND', 'PRCP', 'TMAX', 'TMIN']]
y_elec = df_elec['daily_consumption']

# 70% Training / 30% Testing split
X_train, X_test, y_train, y_test = train_test_split(X_elec, y_elec, test_size=0.30, random_state=42)

print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")

Training features shape: (1003, 4)
Testing features shape: (430, 4)


# 2.3-2.4 Model training: Linear + Polynomial regression + Model evaluation (train and test)

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

results_elec = []

def run_experiment(name, model_obj):
    # Fit the model
    model_obj.fit(X_train, y_train)
    
    # Predictions
    y_tr_pred = model_obj.predict(X_train)
    y_ts_pred = model_obj.predict(X_test)
    
    # Metrics
    return {
        "Model": name,
        "Train MSE": mean_squared_error(y_train, y_tr_pred),
        "Train MAE": mean_absolute_error(y_train, y_tr_pred),
        "Train R²": r2_score(y_train, y_tr_pred),
        "Test MSE": mean_squared_error(y_test, y_ts_pred),
        "Test MAE": mean_absolute_error(y_test, y_ts_pred),
        "Test R²": r2_score(y_test, y_ts_pred)
    }

# (a) Linear Regression
results_elec.append(run_experiment("Linear Regression", LinearRegression()))

# (b, c, d) Polynomial Regression (Degrees 2, 3, 4)
for d in [2, 3, 4]:
    poly_pipeline = Pipeline([
        ("poly_features", PolynomialFeatures(degree=d, include_bias=False)),
        ("regressor", LinearRegression())
    ])
    results_elec.append(run_experiment(f"Poly (deg={d})", poly_pipeline))

# Present results
results_df = pd.DataFrame(results_elec)
display(results_df)

Unnamed: 0,Model,Train MSE,Train MAE,Train R²,Test MSE,Test MAE,Test R²
0,Linear Regression,274826.312092,387.047361,0.272945,237216.888723,365.583197,0.311496
1,Poly (deg=2),268041.444572,382.087143,0.290894,234831.853379,362.904541,0.318418
2,Poly (deg=3),261191.080538,377.738782,0.309017,238445.641493,369.093689,0.307929
3,Poly (deg=4),253602.686954,374.729584,0.329092,408466.181527,415.287189,-0.185542


# 2.5 Discussion and interpretation

## Which model generalizes best, and what does that tell you about the relationship?
The Polynomial Regression (degree 2) model generalizes best on this dataset. It achieved the highest Test R² of 0.318 and the lowest Test MSE of 234,831.85. This suggests that the relationship between weather features and daily electricity consumption has a mild nonlinear component—a quadratic model captures slightly more of the underlying pattern than a purely linear one, indicating that consumption may respond non-proportionally to changes in temperature or other weather variables.

## Do polynomial models improve the fit compared to linear regression? If yes, why might electricity consumption have nonlinear dependence on weather?
Yes, but only marginally and only for degree 2. The degree 2 polynomial improved Test R² from 0.312 (Linear) to 0.318, and reduced Test MSE from 237,216.89 to 234,831.85. However, degrees 3 and 4 did not improve generalization and their test performance was equal to or worse than linear regression. Electricity consumption often has nonlinear dependence on weather because energy demand typically increases at both temperature extremes: high cooling loads during hot weather and high heating loads during cold weather. This creates a U-shaped (quadratic) relationship between temperature and consumption, which a degree 2 polynomial can partially capture.

## If higher-degree models perform worse on the test set, explain this behavior using evidence from metrics.
As polynomial degree increases, Train MSE consistently decreases (from 274,826 to 253,603), showing the model fits training data better. However, Test MSE increases for degrees 3 and 4, with degree 4 showing catastrophic failure: Test MSE jumps to 408,466 and Test R² becomes negative (-0.186). A negative R² means the model performs worse than simply predicting the mean, which is a clear evidence that the degree 4 model has memorized training noise rather than learning generalizable patterns.

## If none of the models achieve good test performance, provide at least two reasons supported by your outputs.
Even the best model (Poly deg=2) only achieves Test R² = 0.318, meaning it explains only ~32% of the variance in daily electricity consumption. Two plausible reasons for this limited performance:
### 1. Limited Feature Set / Unmodeled Drivers
The dataset includes only four weather features (AWND, PRCP, TMAX, TMIN). Electricity consumption is influenced by many factors not captured here, such as building occupancy and usage patterns, holiday effects and industrial/commercial activity schedules. The low R² values across all models suggest that weather alone is insufficient to predict consumption accurately.
### 2. High Noise and Behavioral Variability
The target variable (daily_consumption) has high variability: its standard deviation (606.82) is large relative to its mean (1561.08), indicating substantial day-to-day fluctuation. Much of this variability likely stems from human behavioral factors that create noise uncorrelated with weather. The large residual errors (Test MAE ≈ 363–415) across all models support this interpretation—even with correct weather data, predictions deviate from actual consumption by hundreds of units on average.