# Multiple Linear Regression
The dataset consist with three columns.
• Home Size (SqFt) Representing the total area of the house in square feet.
• Heating Type Indicating the primary heating system used in the household (e.g., Electric, Gas, Heat Pump, Solar Hybrid).
• Monthly kWh Target variable representing the monthly household energy con sumption.
Since we have three variable so we are interested in building a Multiple Linear Regression model.

In [40]:

%pip install pandas numpy matplotlib seaborn scikit-learn scipy statsmodels

Note: you may need to restart the kernel to use updated packages.


### 1. **Problem Understanding**

* **Goal**: The primary research task is to apply Multiple Linear Regres sion (MLR) to model the relationship between home size, heating type, and monthly energy consumption.
The goals of this work are as follows
1. To analyze the influence of home size and heating type on monthly electricity usage.
2. To construct a regression equation that predicts energy consumption (ˆy) based on the explanatory variables.
3. To evaluate the model’s predictive performance using appropriate regression metrics such as R2, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).
* **Type**: Supervised learning — Regression problem.

### 2. **Exploratory Data Analysis (EDA)**

* Data shape and types
* Null values
* Descriptive statistics
* Correlation
* Outliers
* Visual trends

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn import linear_model
from sklearn.impute import SimpleImputer

In [42]:


file_path = "electricityConsumption - electricityConsumption.csv"  
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Home_Size_SqFt,Heating_Type,Monthly_kWh
0,1250.0,Electric,875.0
1,,Gas,720.0
2,1680.0,Electric,1120.0
3,950.0,Heat_Pump,580.0
4,1850.0,Gas,


### Exploratory Data Analysis (EDA) Summary



In [43]:
print("\nDataset Info:")
print(df.info())
print("\nMissing Values:")
print(df.isnull().sum())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Home_Size_SqFt  92 non-null     float64
 1   Heating_Type    97 non-null     object 
 2   Monthly_kWh     90 non-null     float64
dtypes: float64(2), object(1)
memory usage: 2.5+ KB
None

Missing Values:
Home_Size_SqFt     9
Heating_Type       4
Monthly_kWh       11
dtype: int64


In [44]:
## Step 3: Data Preprocessing
# 1. Drop rows where target variable (Monthly_kWh) is missing
df = df.dropna(subset=['Monthly_kWh'])

In [45]:
# Define numeric and categorical features
numeric_features = ['Home_Size_SqFt']
categorical_features = ['Heating_Type']


# Impute numeric with median, categorical with most frequent
numeric_transformer = SimpleImputer(strategy='median')
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])


# Combine preprocessing
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])

## Step 3: Data Preprocessing

1. Split Data into Train/Test

In [46]:

X = df[['Home_Size_SqFt', 'Heating_Type']]
y = df['Monthly_kWh']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


## Step 4: Linear Regression Model Training

In [75]:
model = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', LinearRegression())])
model.fit(X_train, y_train)
print('Model trained.')



Model trained.


## Step6. Extract Coefficients

In [73]:
# Retrieve regression coefficients after preprocessing
model_reg = model.named_steps['regressor']
feature_names = numeric_features + list(model.named_steps['preprocessor']
.named_transformers_['cat']
.named_steps['encoder']
.get_feature_names_out(categorical_features))
coef = pd.DataFrame({'Feature': feature_names, 'Coefficient': model_reg.coef_})
print("\nRegression Coefficients:")
print(coef)





Regression Coefficients:
                     Feature  Coefficient
0             Home_Size_SqFt     0.610914
1           Heating_Type_Gas  -120.053944
2     Heating_Type_Heat_Pump  -156.462037
3  Heating_Type_Solar_Hybrid  -252.643730


## Step 5: Model Evaluation

1. Predict salaries on both training and test sets
2. Compute key metrics:

   * R² Score
   * MAE (Mean Absolute Error)
   * MSE (Mean Squared Error)
   * RMSE (Root Mean Squared Error)

In [71]:
y_pred = model.predict(X_test)
R2 = r2_score(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)
RMSE = np.sqrt(mean_squared_error(y_test, y_pred))


print("\nModel Performance:")
print(f"R² Score: {R2:.4f}")
print(f"Mean Absolute Error: {MAE:.2f}")
print(f"Root Mean Squared Error: {RMSE:.2f}")




Model Performance:
R² Score: 0.9081
Mean Absolute Error: 59.32
Root Mean Squared Error: 88.42
