<a href="https://colab.research.google.com/github/yuri-spizhovyi-mit/housing-insights-risk-dashboard/blob/main/ml/notebooks/forecasting_analysis.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Housing Insights & Risk Dashboard  
## Data Engineering & Forecasting Notebook  
### Models: ARIMA, Prophet, LSTM
#### Author: Yuri Spizhovyi
#### Environment: Google Colab + Python + Pandas + Statsmodels + TensorFlow
#### Objective:
- Load datasets (HPI, rent, demographics, macro, metrics)
- Explore trends, seasonality, missingness
- Define feature engineering strategy
- Prepare feature tables for ARIMA, Prophet, LSTM

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("seaborn-v0_8")
pd.set_option("display.max_columns", None)

In [None]:
!git clone https://github.com/yuri-spizhovyi-mit/housing-insights-risk-dashboard.git
%cd housing-insights-risk-dashboard/data/raw

In [None]:
# Load datasets
df_hpi = pd.read_csv("house_price_index.csv")
df_rent = pd.read_csv("rent_index.csv")
df_demo = pd.read_csv("demographics.csv")
df_macro = pd.read_csv("macro_economic_data.csv")
df_metrics = pd.read_csv("metrics.csv")

dfs = {
    "house_price_index": df_hpi,
    "rent_index": df_rent,
    "demographics": df_demo,
    "macro_economic": df_macro,
    "metrics": df_metrics,
}

for name, df in dfs.items():
    print(f"\n===== {name.upper()} =====")
    print(df.head())
    print(df.info())

In [None]:
for name, df in dfs.items():
    if "date" in df.columns:
        df["date"] = pd.to_datetime(df["date"])
        df.sort_values("date", inplace=True)
        dfs[name] = df

print("Date columns converted and sorted.")

In [None]:
for name, df in dfs.items():
    print(f"\n{name}:")
    print("Date range:", df["date"].min(), "→", df["date"].max())
    print("Missing values:\n", df.isna().sum())

In [None]:
# Plot HPI trend
plt.figure(figsize=(12, 5))
sns.lineplot(data=df_hpi, x="date", y="benchmark_price")
plt.title("House Price Index Trend")
plt.show()

In [None]:
# Plot rent trend
plt.figure(figsize=(12, 5))
sns.lineplot(data=df_rent, x="date", y="rent_value")
plt.title("Rent Index Trend")
plt.show()

SECTION 2

Exploratory Data Analysis (EDA) + Feature Engineering Plan**    

In [None]:
#Imports for EDA
%pip install statsmodels

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import STL


In [None]:
# Detect Stationarity (ADF Test) - ARIMA and partially Prophet

In [None]:
def adf_test(series, name="Series"):
    print(f"\nADF Test for {name}")
    result = adfuller(series.dropna())
    labels = ["ADF statistic", "p-value", "# lags used", "# observations"]
    out = pd.Series(result[0:4], index=labels)
    for key, val in result[4].items():
        out[f"critical value ({key})"] = val
    print(out)
    if result[1] <= 0.05:
        print("=> Stationary")
    else:
        print("=> Non-stationary")

adf_test(df_hpi["benchmark_price"], "House Price Index")
adf_test(df_rent["rent_value"], "Rent Index")


Interpretation rules (AI will do this for you in Colab):

    p-value > 0.05 → non-stationary → ARIMA requires differencing

    p-value < 0.05 → stationary → ARIMA can train directly

Housing prices and rents are almost always non-stationary.

In [None]:
# Detect Seasonality (STL Decomposition)
from matplotlib import pyplot as plt

def stl_plot(df, value_col, title):
    df = df.set_index("date")
    df = df.asfreq("MS")  # monthly start frequency
    stl = STL(df[value_col], robust=True)
    res = stl.fit()

    fig = res.plot()
    fig.set_size_inches(12, 8)
    fig.suptitle(title, fontsize=16)
    plt.show()

stl_plot(df_hpi.copy(), "value", "STL Decomposition – HPI")
stl_plot(df_rent.copy(), "value", "STL Decomposition – Rent Index")


This reveals:

long-term trend

seasonal cycle

residual (noise)

LSTM and Prophet use this implicitly.
ARIMA needs you to remove it manually using differencing.

4. Correlation With Macro & Demographics

In [None]:
# Merge macro + HPI on date. Clarify which variables may help LSTM & Prophet.
df_hpi_merged = df_hpi.merge(df_macro, on="date", how="left")

plt.figure(figsize=(14,6))
sns.heatmap(df_hpi_merged.corr(numeric_only=True), annot=False, cmap="coolwarm")
plt.title("Correlation Heatmap (HPI + Macro)")
plt.show()

In [None]:
# the same for rent + macro:
df_rent_merged = df_rent.merge(df_macro, on="date", how="left")

plt.figure(figsize=(14,6))
sns.heatmap(df_rent_merged.corr(numeric_only=True), annot=False, cmap="coolwarm")
plt.title("Correlation Heatmap (Rent + Macro)")
plt.show()


This reveals drivers like:

interest rate

unemployment

CPI

GDP growth

population growth

migration flows

These are critical features for LSTM forecasting.

# Feature Engineering Strategy

## 1. ARIMA (Univariate Only)
- Uses only the target series (price or rent)
- Requires:
  - Stationarity (perform differencing)
  - Seasonal differencing (12-month)
- No external regressors recommended
- Table structure:
  | date | value |

## 2. Prophet (Supports Regressors)
- Good with:
  - seasonality
  - holidays
  - changepoints
- Regressors recommended:
  - interest_rate
  - inflation (CPI)
  - unemployment
- Table structure:
  | ds | y | regressor1 | regressor2 | ... |

## 3. LSTM (Multivariate, Deep Learning)
- Uses many features from:
  - macro data
  - demographics
  - listings activity
  - migration
  - risk anomalies
  - lag features
- Recommended features:
  - target lag-1, lag-3, lag-6, lag-12
  - moving averages
  - YoY % change
  - macro variables
- Table structure:
  | date | price | rent | interest_rate | CPI | population | deaths | ... |


In [None]:
# ARIMA feature table
def create_arima_features(df):
    feat = df[["date", "value"]].copy()
    feat = feat.set_index("date").asfreq("MS")
    return feat

arima_features = create_arima_features(df_hpi)
arima_features.head()


In [None]:
# Prophet feature table
def create_prophet_features(df, regressors=None):
    feat = df[["date", "value"]].copy()
    feat.rename(columns={"date":"ds", "value":"y"}, inplace=True)

    if regressors:
        for reg_name, reg_df in regressors.items():
            feat = feat.merge(reg_df, on="ds", how="left")

    return feat

prophet_features = create_prophet_features(df_hpi)
prophet_features.head()


In [None]:
# LSTM multivariate table
def create_lstm_features(df_price, df_rent, df_macro, df_demo):
    df = df_price.merge(df_rent, on="date", how="left", suffixes=("_price","_rent"))
    df = df.merge(df_macro, on="date", how="left")
    df = df.merge(df_demo, on="date", how="left")

    df = df.set_index("date").asfreq("MS")

    # sample lag features
    df["lag_1"] = df["value_price"].shift(1)
    df["lag_12"] = df["value_price"].shift(12)

    return df

lstm_features = create_lstm_features(df_hpi, df_rent, df_macro, df_demo)
lstm_features.head()


# Summary of EDA Findings

- HPI and Rent series are non-stationary (ADF)
- Strong seasonal structure detected (STL)
- Macro variables like interest rates, inflation, unemployment show meaningful correlation
- ARIMA requires differencing and univariate structure
- Prophet benefits from a few macro regressors
- LSTM requires a rich multivariate feature matrix
- Feature tables must be created separately for each model
