## MOOC Econometrics Case Project – House Prices

### Introduction
The case project focuses on modeling and predicting house prices using data from 546 residential properties. The dataset includes characteristics such as sale price, lot size, number of bedrooms and bathrooms, structural features, and location indicators. The analysis aims to understand how these factors influence house prices and to construct an accurate predictive model. Linear regression and log-linear transformations are applied to examine relationships, test linearity, and improve model specification.

### Objectives

* Explore the relationship between house characteristics and sale price.
* Test for linearity in the regression model using the RESET test.
* Examine whether the lot size should be included as a level or logarithmic variable.
* Investigate interaction effects between lot size and other explanatory variables.
* Apply a general-to-specific approach to refine the model.
* Assess potential endogeneity issues due to omitted variables like house condition.
* Evaluate the predictive power of the final model using out-of-sample data.

### Methodology

The analysis follows a structured approach to model and predict house prices:

* #### Data Preparation

  * Collected data for 546 observations including sale price, lot size, bedrooms, bathrooms, structural features, and location dummies.
    Created new variables such as log_sell (logarithm of sale price) and log_lot (logarithm of lot size) to stabilize variance and linearize               relationships.

* #### Model Estimation

  * Estimated a linear regression model with sale price as the dependent variable.
  * Applied a log-linear model using the log of sale price to improve model fit.
  * Tested inclusion of both lot size and log(lot) to determine which specification better explains house prices.

* #### Model Diagnostics

  * Conducted the RESET test to check for model linearity.
  * Evaluated statistical significance of individual coefficients and overall model fit using R-squared, t-tests, and F-tests.

* #### Interaction Effects

  * Created interaction terms between log_lot and other explanatory variables to capture varying effects of lot size on house prices.
  * Tested both individual and joint significance of interactions using t-tests and F-tests.

* #### Model Selection

  * Applied a general-to-specific approach, iteratively removing non-significant interaction terms while keeping main effects.
  * Ensured the final model retained only significant predictors and relevant interaction effects.

* #### Predictive Analysis

  * Split the data into a training set (first 400 observations) and test set (remaining 146 observations).
  * Estimated the final model on the training set and predicted log sale prices for the test set.
  * Measured predictive accuracy using Mean Absolute Error (MAE) relative to the standard deviation of log sale price.

* #### Consideration of Endogeneity

  * Discussed potential bias in coefficients for variables like central air due to omitted variables (e.g., house condition) affecting both house features and prices.

In [None]:
# CAPSTONE PROJECT: HOUSE PRICE ANALYSIS

# Import required libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.diagnostic import linear_reset
import matplotlib.pyplot as plt
import seaborn as sns

# Load Dataset

file_path = r"E:\project.csv"   # Change path if needed
df = pd.read_csv(file_path)

print(df.head())
print(df.describe())

### Findings

#### (a) OLS Regression – Level Model

The initial OLS regression used the **house sale price (`sell`)** as the dependent variable. The model explained **67.3% of the variance** in sale prices (R² = 0.673).  

- **Significant variables (p < 0.05):** `lot`, `fb`, `sty`, `drv`, `rec`, `ffin`, `ghw`, `ca`, `gar`, `reg`  
- **Non-significant variables:** `bdms` (p = 0.081) and the constant term  

**Diagnostics:**  
- RESET test: p = 2.92e-07 → model fails linearity assumption.  
- Residuals vs fitted plot shows heteroscedasticity and non-linearity.  

**Conclusion:**  
The level model does not satisfy linearity, indicating that **transformations (e.g., log transformation)** are needed for better model performance.

In [None]:
# Linear model: sell ~ all other variables
y = df['sell']
X = df.drop(columns=['sell', 'obs'])
X = sm.add_constant(X)

model_a = sm.OLS(y, X).fit()
print(model_a.summary())

# Linear RESET Test
reset_test_a = linear_reset(model_a, power=2, use_f=True)
print("RESET Test (Level Model)")
print("F-statistic:", reset_test_a.fvalue)
print("p-value:", reset_test_a.pvalue)
if reset_test_a.pvalue < 0.05:
    print("Reject H0 → Model is NOT linear at 5% level.")
else:
    print("Fail to reject H0 → Model is linear at 5% level.")

In [None]:
plt.figure(figsize=(6, 4), dpi=80)  # smaller size
plt.scatter(model_a.fittedvalues, model_a.resid, s=20)  # s=20 reduces marker size
plt.axhline(0, color='red', linestyle='--', linewidth=1)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted (Level Model)")
plt.tight_layout()  # prevents clipping of labels
plt.show()

#### (b) Log-Transformed Sale Price
A **log transformation of the dependent variable (`log_sell = log(sell)`)** was applied to stabilize variance and improve linearity.  

- **Model fit:** R² = 0.677, slightly improved over the level model.  
- **Significant variables:** `lot`, `bdms`, `fb`, `sty`, `drv`, `rec`, `ffin`, `ghw`, `ca`, `gar`, `reg`  

**Diagnostics:**  
- RESET test: p = 0.603 → linearity assumption now satisfied.  
- Distribution of `log_sell` is approximately normal.  

**Conclusion:**  
Log transformation improved linearity, stabilized variance, and ensured that all main predictors are significant.

In [None]:
# Log transformation of sale price
df['log_sell'] = np.log(df['sell'])
y_log = df['log_sell']
X_log = df.drop(columns=['sell','log_sell','obs'])
X_log = sm.add_constant(X_log)

model_b = sm.OLS(y_log, X_log).fit()
print(model_b.summary())

# RESET Test for log model
reset_test_b = linear_reset(model_b, power=2, use_f=True)
print("RESET Test (Log Model)")
print("F-statistic:", reset_test_b.fvalue)
print("p-value:", reset_test_b.pvalue)

In [None]:
plt.figure(figsize=(6, 4), dpi=80)  # smaller figure
sns.histplot(df['log_sell'], kde=True, bins=30, color='skyblue')
plt.title("Distribution of Log(Sale Price)")
plt.xlabel("Log(Sale Price)")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

#### (c) Including Lot and Log(Lot)
Both **`lot`** and **`log_lot`** were included to determine whether a linear or log specification better explains house prices.  

- `log_lot` is significant (p < 0.001)  
- `lot` alone is not significant (p = 0.359)  
- Model fit improved: R² = 0.687  

**Conclusion:**  
The **log transformation of lot size** better explains variation in house prices than raw lot size.

In [None]:
# (c) Include lot and log(lot)

df['log_lot'] = np.log(df['lot'])

X_c = df[['lot', 'log_lot', 'bdms', 'fb', 'sty', 'drv',
          'rec', 'ffin', 'ghw', 'ca', 'gar', 'reg']]

X_c = sm.add_constant(X_c)

model_c = sm.OLS(df['log_sell'], X_c).fit()
print(model_c.summary())

print("Check significance of lot and log_lot to decide which to include.")

#### (d) Interaction Effects
Interaction terms between **`log_lot`** and other predictors were created to explore whether the effect of lot size varies with other house features.  

- Only `loglot_drv` (driveway) and `loglot_rec` (recreation area) were individually significant.  
- Model fit: R² = 0.695, showing a small improvement over the main-effects model.  

**Conclusion:**  
Interaction effects exist but are limited to specific features. Most interactions are not statistically significant.

In [None]:
# (d) Interaction Effects

X_d = df[['log_lot', 'bdms', 'fb', 'sty', 'drv',
          'rec', 'ffin', 'ghw', 'ca', 'gar', 'reg']].copy()

# Create interaction terms
for col in X_d.columns:
    if col != 'log_lot':
        X_d[f'loglot_{col}'] = df['log_lot'] * df[col]

X_d = sm.add_constant(X_d)

model_d = sm.OLS(df['log_sell'], X_d).fit()
print(model_d.summary())

# Count individually significant interactions
interaction_cols = [col for col in X_d.columns if 'loglot_' in col]
significant = [col for col in interaction_cols if model_d.pvalues[col] < 0.05]

print("Number of individually significant interactions:", len(significant))
print("Significant interactions:", significant)

#### (e) Joint F-Test for Interactions
The joint significance of all interaction terms was tested.  

- **Result:** p = 0.424 → fail to reject null hypothesis  

**Conclusion:**  
Overall, the interaction terms **do not jointly improve the model significantly**, though some individual interactions remain meaningful.

In [None]:
# (e) Joint F-test for interaction effects

hypothesis = ' + '.join(interaction_cols) + ' = 0'
f_test = model_d.f_test(hypothesis)

print("Joint F-test results:")
print(f_test)

#### (f) General-to-Specific Approach
A **stepwise elimination of insignificant interaction terms** was applied to reach a parsimonious model.  

- Final model retained **main effects + loglot_rec interaction**.  
- Model fit: R² = 0.689  

**Conclusion:**  
The simplified model improves interpretability while retaining predictive power by keeping only **meaningful variables**.

In [None]:
# (f) General-to-Specific approach

current_model = model_d
current_X = X_d.copy()

while True:
    pvals = current_model.pvalues
    interaction_pvals = pvals[interaction_cols]
    max_p = interaction_pvals.max()

    if max_p > 0.05:
        remove_var = interaction_pvals.idxmax()
        print("Removing:", remove_var)

        current_X = current_X.drop(columns=[remove_var])
        interaction_cols.remove(remove_var)

        current_model = sm.OLS(df['log_sell'], current_X).fit()
    else:
        break

print("Final Selected Model:")
print(current_model.summary())

#### (g) Potential Endogeneity
Some explanatory variables may be **endogenous** or influenced by **omitted factors** that affect house prices.  

- **Example:** The overall **condition of a house** is not directly measured but influences the sale price and correlates with other variables, such as `ca` (air conditioning).  
- **Implication:** The effect of air conditioning on log(sale price) could be **overestimated**, as it partially captures the impact of the omitted condition variable.  
- **Conclusion:** Recognizing potential **omitted variable bias** is essential for accurate interpretation of coefficients and causal inference.

#### (h) Predictive Power Analysis
The model’s predictive ability was tested on a holdout sample.  

- **Data split:** 400 observations for training, 146 for testing  
- **Predictions on test set:**  
  - MAE = 0.128  
  - Standard deviation of log price = 0.288  
  - MAE / Std Dev = 0.444  

**Conclusion:**  
Predictions are reasonably accurate, with about **44% of variability unpredicted**. The model is still useful for practical **house price estimation**.

In [None]:
# (h) Predictive Power Analysis


X_final = df[['log_lot', 'bdms', 'fb', 'sty', 'drv',
              'rec', 'ffin', 'ghw', 'ca', 'gar', 'reg']]

X_final = sm.add_constant(X_final)
y_final = df['log_sell']

# Split data
X_train = X_final.iloc[:400]
y_train = y_final.iloc[:400]

X_test = X_final.iloc[400:]
y_test = y_final.iloc[400:]

# Fit model
model_h = sm.OLS(y_train, X_train).fit()

# Predictions
y_pred = model_h.predict(X_test)

# Calculate MAE
MAE = np.mean(np.abs(y_test - y_pred))
std_log_price = np.std(y_test)

print("MAE:", MAE)
print("Standard Deviation of log price:", std_log_price)
print("MAE / Std Dev:", MAE / std_log_price)

In [None]:
plt.figure(figsize=(6, 4), dpi=80)  # smaller size for GitHub
plt.plot(y_test.values, label="Actual", linewidth=1)
plt.plot(y_pred.values, label="Predicted", linewidth=1)
plt.xlabel("Observation")
plt.ylabel("Log(Sale Price)")
plt.title("Actual vs Predicted Log Prices")
plt.legend()
plt.tight_layout()  # prevent clipping
plt.show()

### Conclusion

This project analyzed the determinants of house prices using data from 546 residential properties. Linear and log-linear regression models were applied to examine the relationships between house characteristics and sale prices, test assumptions, and develop a predictive model.

#### Key Findings:

* **Linearity:** The original level model (sale price) failed the linearity test, while the log-transformed model satisfied linearity and stabilized variance.

* **Lot Size:** Logarithm of lot size (log_lot) is a better predictor than raw lot size, improving model fit and interpretability.

* **Significant Predictors:** Bedrooms, bathrooms, full bathrooms, structural style, driveway, recreation area, finished floors, garage, air conditioning, and region all significantly affect house prices.

* **Interaction Effects:** Only specific interactions (loglot_drv and loglot_rec) were individually significant; most interaction terms were not important, and joint tests confirmed limited collective effect.

* **Model Refinement:** Using the general-to-specific approach, a parsimonious model was selected that retained meaningful main effects and significant interaction terms, balancing simplicity and predictive power.

* **Endogeneity Consideration:** Omitted variables such as house condition may bias coefficients (e.g., effect of air conditioning may be overestimated).

* **Predictive Accuracy:** The final model provides reasonably accurate predictions on out-of-sample data (MAE / Std Dev = 0.444), making it useful for practical price estimation.

The log-linear regression model with selected predictors and limited interaction effects effectively explains variation in house prices and provides reliable predictive ability. Transformations, careful variable selection, and awareness of potential endogeneity are crucial for robust modeling in real estate pricing analysis.