<h1 style="text-align:center">Build and Evaluate Time Series Models</h1>
<h2 style="text-align:center">House Pricing- House and Lot Size</h2>
<p style="text-align:center">Robert Evans</p>
<p style="text-align:center">School of Technology & Engineering, National University</p>
<p style="text-align:center">DDS-8555: Predictive Analysis</p>
<p style="text-align:center">Dr. Mohammad Yavarimanesh</p>
<p style="text-align:center">March 2, 2025</p>

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

## Load Data

In [2]:
kaggle = False

if kaggle:
    path = ""
else:
    path = "Week8_Data_Houses"

train = pd.read_csv(f"{path}/train.csv")
test = pd.read_csv(f"{path}/test.csv")

## Build Model

In [13]:
# Define features and target variable
features = ['LotFrontage', 'LotArea', 'TotalBsmtSF', 'GarageArea', 'GrLivArea']
target = 'SalePrice'

In [14]:
# Extract features and target from the training data
X = train[features]
y = train[target]

# Fill missing values in the numeric features with their respective medians from training
X = X.fillna(X.median())

In [7]:
# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Add a constant to the training and testing sets to account for the intercept.
X_train_const = sm.add_constant(X_train, has_constant='add')
X_test_const = sm.add_constant(X_test, has_constant='add')

In [8]:
# Fit the OLS model on the training data.
model = sm.OLS(y_train, X_train_const).fit()

In [9]:

# Predict on the test set.
y_pred = model.predict(X_test_const)

In [10]:
# Evaluate model performance using R² and RMSE.
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Test Set Performance:")
print(f"R²: {r2:.4f}")
print(f"RMSE: {rmse:.2f}")

Test Set Performance:
R²: 0.7206
RMSE: 46295.22


In [11]:
# Print the detailed OLS regression summary (includes variable names, coefficients, and p-values).
print("\nOLS Regression Summary:")
print(model.summary())


OLS Regression Summary:
                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.643
Model:                            OLS   Adj. R-squared:                  0.642
Method:                 Least Squares   F-statistic:                     418.7
Date:                Tue, 04 Mar 2025   Prob (F-statistic):          6.01e-257
Time:                        11:46:19   Log-Likelihood:                -14201.
No. Observations:                1168   AIC:                         2.841e+04
Df Residuals:                    1162   BIC:                         2.844e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -1.007e+04   

## Kaggle Submission

In [15]:
# Extract features from test.csv
X_new = test[features]

# Fill missing values in the numeric features with their respective medians from training
X_new = X_new.fillna(X_new.median())


# Add a constant term for the intercept to match the model's design matrix.
X_new = sm.add_constant(X_new, has_constant='add')

# Predict SalePrice using the trained OLS model.
predictions = model.predict(X_new)

print("Predicted SalePrice values for test.csv:")
predictions

Predicted SalePrice values for test.csv:


0       157622.023956
1       162565.870770
2       182910.325261
3       178166.446942
4       180868.110971
            ...      
1454     83710.841372
1455    114106.626803
1456    168520.193460
1457     88638.402835
1458    226938.704977
Length: 1459, dtype: float64

In [16]:
# Create a DataFrame with the Id and predicted SalePrice.
submission_df = pd.DataFrame({
    'id': test['Id'],
    'SalePrice': predictions
})

# Save the submission DataFrame to a CSV file.
submission_df.to_csv("house_prediction_2.csv", index=False)
print("Submission file saved as house_prediction_2.csv")

Submission file saved as house_prediction_2.csv
