Import block with all imports used in this code

In [None]:
import pandas as pd
import numpy as np
import kagglehub
import os

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import matplotlib.pyplot as plt


Data set download, reaches the dataset by using kagglehub

In [None]:
DATA_DIR = kagglehub.dataset_download(
    "ohiedulhaquemdasad/fuel-consumption-based-on-hp-linear-regression"
)

print("Downloaded to:", DATA_DIR)
print("Files:", os.listdir(DATA_DIR))
csv_path = os.path.join(DATA_DIR, "FuelEconomy.csv")
df = pd.read_csv(csv_path)


Display of all columns and summary statistics found in the code. No missing values in either column so no handling of missing needed.

In [None]:
print("Columns:", df.columns.tolist())

print("Summary Statistics")
display(df.describe(include="all"))

print("Shape:", df.shape)

print("Missing Values:")
display(df.isna().sum())

All functions called in analysis listed below and commented

In [None]:
#Returns split of data based on parameters. random_state = 1 ensures the split will be the same each time
#X is data input, y is dependent output, test_size defaults to .3 and random state defined for consistency
def split_data(X, y, test_size=0.30, random_state=1):
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

#Plots the predicted data against the actual data to view accuracy.
#Inputs include the y_test data from the original split_data, y_pred taken from regression, title for the graph, and a max_points integer
def plot_actual_vs_predicted_test(y_test, y_pred, title, max_points=300):
    y_test = np.array(y_test)
    y_pred = np.array(y_pred)

    n = len(y_test)
    if n > max_points:
        rng = np.random.default_rng(0)
        sel = rng.choice(n, size=max_points, replace=False)
        y_test = y_test[sel]
        y_pred = y_pred[sel]

    x = np.arange(len(y_test))

    plt.figure(figsize=(12, 4))
    plt.scatter(x, y_test, marker="o", alpha=0.8, label="Actual (Test)")
    plt.scatter(x, y_pred, marker="x", alpha=0.8, label="Predicted (Test)")
    plt.title(title)
    plt.xlabel("Test sample index (subset)")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.legend()
    plt.show()

#Computes default metrics for assessing accuracy of models
#Inputs are y_true, test data and the y_pred, the predicted from regression
def compute_metrics(y_true, y_pred):
    """Return MSE, MAE, R^2."""
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R^2": r2_score(y_true, y_pred),
    }

Data seperated into two sections of independent and dependent

In [None]:
x_fuel_economy = df[['Fuel Economy (MPG)']]
y_fuel_economy = df['Horse Power']

x_train, x_test, y_train, y_test = split_data(x_fuel_economy, y_fuel_economy)

Linear regression model is fit to data

In [None]:
model = LinearRegression()
model.fit(x_train, y_train)


yhat_train = model.predict(x_train)
yhat_test  = model.predict(x_test)

linear_stats_train = compute_metrics(y_train, yhat_train)
linear_stats_test = compute_metrics(y_test, yhat_test)

Polynomial regression of degree 2 is fit to data

In [None]:
model = Pipeline([
                ("poly", PolynomialFeatures(degree=2, include_bias=False)),
                ("lr", LinearRegression())
            ])

model.fit(x_train, y_train)

yhat_train = model.predict(x_train)
yhat_test  = model.predict(x_test)

degree_2_stats_train = compute_metrics(y_train, yhat_train)
degree_2_stats_test = compute_metrics(y_test, yhat_test)


Polynomial regression of degree 3 is fit to data

In [None]:
model = Pipeline([
                ("poly", PolynomialFeatures(degree=3, include_bias=False)),
                ("lr", LinearRegression())
            ])

model.fit(x_train, y_train)

yhat_train = model.predict(x_train)
yhat_test  = model.predict(x_test)

degree_3_stats_train = compute_metrics(y_train, yhat_train)
degree_3_stats_test = compute_metrics(y_test, yhat_test)


Polynomial regression of degree 4 is fit to data

In [None]:
model = Pipeline([
                ("poly", PolynomialFeatures(degree=4, include_bias=False)),
                ("lr", LinearRegression())
            ])

model.fit(x_train, y_train)

yhat_train = model.predict(x_train)
yhat_test  = model.predict(x_test)

degree_4_stats_train = compute_metrics(y_train, yhat_train)
degree_4_stats_test = compute_metrics(y_test, yhat_test)


Plots and stats are displayed below

In [None]:
plot_actual_vs_predicted_test(y_test, yhat_test, "Linear")
plot_actual_vs_predicted_test(y_test, yhat_test, "Polynomial Degree 2")
plot_actual_vs_predicted_test(y_test, yhat_test, "Polynomial Degree 3")
plot_actual_vs_predicted_test(y_test, yhat_test, "Polynomial Degree 4")

results = [
    {
        "Model": "Linear Regression",
        "Train MSE": linear_stats_train["MSE"],
        "Train MAE": linear_stats_train["MAE"],
        "Train R2":  linear_stats_train["R^2"],
        "Test MSE":  linear_stats_test["MSE"],
        "Test MAE":  linear_stats_test["MAE"],
        "Test R2":   linear_stats_test["R^2"],
    },
    {
        "Model": "Poly (deg=2)",
        "Train MSE": degree_2_stats_train["MSE"],
        "Train MAE": degree_2_stats_train["MAE"],
        "Train R2":  degree_2_stats_train["R^2"],
        "Test MSE":  degree_2_stats_test["MSE"],
        "Test MAE":  degree_2_stats_test["MAE"],
        "Test R2":   degree_2_stats_test["R^2"],
    },
    {
        "Model": "Poly (deg=3)",
        "Train MSE": degree_3_stats_train["MSE"],
        "Train MAE": degree_3_stats_train["MAE"],
        "Train R2":  degree_3_stats_train["R^2"],
        "Test MSE":  degree_3_stats_test["MSE"],
        "Test MAE":  degree_3_stats_test["MAE"],
        "Test R2":   degree_3_stats_test["R^2"],
    },
    {
        "Model": "Poly (deg=4)",
        "Train MSE": degree_4_stats_train["MSE"],
        "Train MAE": degree_4_stats_train["MAE"],
        "Train R2":  degree_4_stats_train["R^2"],
        "Test MSE":  degree_4_stats_test["MSE"],
        "Test MAE":  degree_4_stats_test["MAE"],
        "Test R2":   degree_4_stats_test["R^2"],
    },
]

results_df = pd.DataFrame (results)
display(results_df)

Discussion of results:

The linear regression model performs the best on this data's test set. Reading the summary statistics it can be observed that the test R2 value is higher than all other models as well as MSE and MAE having the smallest value of the set. These 3 comparisons lead to a conclusion that the linear regression model is best for this current test set.

Increasing polynomial degree is not always proven to improve performance as is observed here. With the current test results it is observed that a linear regression model is better in this scenario. Polynomial regression can overcomplicate data when a simpler would approach in some situations as it would appear to be here.

The model is not performing unexpectdley poorly but might be showing small signs of over fitting. At higher degrees it might be fitting to some noise in the dataset especially with the dataset's slightly smaller sample size. Overall, all models are performing well with R2 values over .85 and low errors.

Data is imported and read

In [None]:
DATA_DIR = kagglehub.dataset_download("sudhirsingh27/electricity-consumption-based-on-weather-data")

print("Downloaded to:", DATA_DIR)
print("Files:", os.listdir(DATA_DIR))
csv_path = os.path.join(DATA_DIR, "electricity_consumption_based_weather_dataset.csv")
df = pd.read_csv(csv_path)


Initial summary statistics are shown and missing values are noted

In [None]:
print("Columns:", df.columns.tolist())

print("Summary Statistics")
display(df.describe(include="all"))

print("Shape:", df.shape)

print("Missing Values:")
display(df.isna().sum())

Data is updated to exclude columns with null data

In [None]:
df = df.dropna()
print("Missing values after cleaning:")
display(df.isna().sum())

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)

Dependent data is seperated from 4 independent variables

In [None]:
x_weather_data = df[['AWND', 'PRCP', 'TMAX', 'TMIN']]
y_daily_consumption = df['daily_consumption']

x_train, x_test, y_train, y_test = split_data(x_weather_data, y_daily_consumption)


Linear regression model is fit to data

In [None]:
model = LinearRegression()
model.fit(x_train, y_train)


yhat_train = model.predict(x_train)
yhat_test  = model.predict(x_test)

linear_stats_train = compute_metrics(y_train, yhat_train)
linear_stats_test = compute_metrics(y_test, yhat_test)

Polynomial regression model of degree 2 is fit to data

In [None]:
model = Pipeline([
                ("poly", PolynomialFeatures(degree=2, include_bias=False)),
                ("lr", LinearRegression())
            ])

model.fit(x_train, y_train)

yhat_train = model.predict(x_train)
yhat_test  = model.predict(x_test)

degree_2_stats_train = compute_metrics(y_train, yhat_train)
degree_2_stats_test = compute_metrics(y_test, yhat_test)


Polynomial regression model of degree 3 is fit to data

In [None]:
model = Pipeline([
                ("poly", PolynomialFeatures(degree=3, include_bias=False)),
                ("lr", LinearRegression())
            ])

model.fit(x_train, y_train)

yhat_train = model.predict(x_train)
yhat_test  = model.predict(x_test)

degree_3_stats_train = compute_metrics(y_train, yhat_train)
degree_3_stats_test = compute_metrics(y_test, yhat_test)


Polynomial regression model of degree 4 is fit to data

In [None]:
model = Pipeline([
                ("poly", PolynomialFeatures(degree=4, include_bias=False)),
                ("lr", LinearRegression())
            ])

model.fit(x_train, y_train)

yhat_train = model.predict(x_train)
yhat_test  = model.predict(x_test)

degree_4_stats_train = compute_metrics(y_train, yhat_train)
degree_4_stats_test = compute_metrics(y_test, yhat_test)


Plots and stats are displayed below

In [None]:
plot_actual_vs_predicted_test(y_test, yhat_test, "Linear")
plot_actual_vs_predicted_test(y_test, yhat_test, "Polynomial Degree 2")
plot_actual_vs_predicted_test(y_test, yhat_test, "Polynomial Degree 3")
plot_actual_vs_predicted_test(y_test, yhat_test, "Polynomial Degree 4")

results = [
    {
        "Model": "Linear Regression",
        "Train MSE": linear_stats_train["MSE"],
        "Train MAE": linear_stats_train["MAE"],
        "Train R2":  linear_stats_train["R^2"],
        "Test MSE":  linear_stats_test["MSE"],
        "Test MAE":  linear_stats_test["MAE"],
        "Test R2":   linear_stats_test["R^2"],
    },
    {
        "Model": "Poly (deg=2)",
        "Train MSE": degree_2_stats_train["MSE"],
        "Train MAE": degree_2_stats_train["MAE"],
        "Train R2":  degree_2_stats_train["R^2"],
        "Test MSE":  degree_2_stats_test["MSE"],
        "Test MAE":  degree_2_stats_test["MAE"],
        "Test R2":   degree_2_stats_test["R^2"],
    },
    {
        "Model": "Poly (deg=3)",
        "Train MSE": degree_3_stats_train["MSE"],
        "Train MAE": degree_3_stats_train["MAE"],
        "Train R2":  degree_3_stats_train["R^2"],
        "Test MSE":  degree_3_stats_test["MSE"],
        "Test MAE":  degree_3_stats_test["MAE"],
        "Test R2":   degree_3_stats_test["R^2"],
    },
    {
        "Model": "Poly (deg=4)",
        "Train MSE": degree_4_stats_train["MSE"],
        "Train MAE": degree_4_stats_train["MAE"],
        "Train R2":  degree_4_stats_train["R^2"],
        "Test MSE":  degree_4_stats_test["MSE"],
        "Test MAE":  degree_4_stats_test["MAE"],
        "Test R2":   degree_4_stats_test["R^2"],
    },
]

results_df = pd.DataFrame (results)
display(results_df)

Discussion of results:

The linear regression model performs the best on this data's test set. Reading the summary statistics it can be observed that the test R2 value is higher than all other models as well as MSE and MAE having the smallest value of the set. These 3 comparisons lead to a conclusion that the linear regression model is best for this current test set. This would suggest there is a linear relationship between weather factors could be found to predict electricity usage. However the R2 value and higher MSE and MAE values on the test data suggest that further research is needed.

Polynomial models and specifically the higher-degree modules tended to perform worse on this data set. The R2 values suggest some degree of overfitting might be occuring as well as the spiking values of MSE and MAE. The models may be too sensitive to noise from outlier data leading to poor performance.

None of the models acheive particulaly great performance and this can be for a number of reasons. One reason, the variables tempature min and tempature max can be correlated with each other and not truly independent. Another is that household output of electricity can depend on numerous more factors, number of occupants, season, region, and a location's HVAC capabilities. Not accounting for these factors and with only mild drivers of precipitation and wind speed there is likely not the correct the data to create a system relating weather and electricity. However with other factors present, a better model could likely be created.