# Predicting GDP Using Regression Models in Python

This notebook demonstrates a regression workflow inspired by your R code that uses:
- **Linear Regression**
- **Ridge Regression**
- **Lasso Regression**
- **Elastic Net**

We'll compare these models and choose the best one based on cross-validation performance.

---

This notebook demonstrates how to predict **GDP** using multiple economic indicators:
- Unemployment Rate (UNRATE)
- Inflation (FPCPITOTLZGUSA)
- Federal Funds Rate (FEDFUNDS)

* https://fred.stlouisfed.org/series/GDPC1 
* https://fred.stlouisfed.org/series/UNRATE 
* https://fred.stlouisfed.org/series/FPCPITOTLZGUSA 
* https://fred.stlouisfed.org/series/FEDFUNDS 




## 1. Import Required Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn for modeling
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# For saving/loading the final model
import joblib

sns.set_style("whitegrid")


## 2. Load and Prepare Data

We'll assume you have four CSV files:
- `GDP.csv` containing columns `[observation_date, GDPC1]`
- `UNRATE.csv` containing columns `[observation_date, UNRATE]`
- `FPCPITOTLZGUSA.csv` containing columns `[observation_date, FPCPITOTLZGUSA]`
- `FEDFUNDS.csv` containing columns `[observation_date, FEDFUNDS]`

We will:
1. Read each CSV
2. Parse `observation_date` as datetime
3. Merge them into a single DataFrame (inner join on Date)
4. Drop missing values
5. Create train/test splits

In [3]:
# -------------------------
# 1) Load GDP data
# -------------------------
gdp = pd.read_csv("GDPC1.csv", parse_dates=["observation_date"])
gdp.rename(columns={"observation_date": "Date", "GDPC1": "GDP"}, inplace=True)
gdp.sort_values("Date", inplace=True)

# -------------------------
# 2) Load UNRATE data
# -------------------------
unrate = pd.read_csv("UNRATE.csv", parse_dates=["observation_date"])
unrate.rename(columns={"observation_date": "Date"}, inplace=True)
unrate.sort_values("Date", inplace=True)

# -------------------------
# 3) Load Inflation data
# -------------------------
infl = pd.read_csv("FPCPITOTLZGUSA.csv", parse_dates=["observation_date"])
infl.rename(columns={"observation_date": "Date"}, inplace=True)
infl.sort_values("Date", inplace=True)

# -------------------------
# 4) Load FEDFUNDS data
# -------------------------
fedfunds = pd.read_csv("FEDFUNDS.csv", parse_dates=["observation_date"])
fedfunds.rename(columns={"observation_date": "Date"}, inplace=True)
fedfunds.sort_values("Date", inplace=True)

# -------------------------
# Merge everything
# -------------------------
df_merged = pd.merge(gdp, unrate, on="Date", how="inner")
df_merged = pd.merge(df_merged, infl, on="Date", how="inner")
df_merged = pd.merge(df_merged, fedfunds, on="Date", how="inner")

# Rename columns for clarity
df_merged.rename(columns={
    "UNRATE": "UnemploymentRate",
    "FPCPITOTLZGUSA": "Inflation",
    "FEDFUNDS": "FedFundsRate"
}, inplace=True)

# Sort final DataFrame
df_merged.sort_values("Date", inplace=True)
df_merged.reset_index(drop=True, inplace=True)

print("Data after merging:")
print(df_merged.head(10))
print("\nData shape:", df_merged.shape)

# Drop any rows with missing values
df_merged.dropna(inplace=True)
print("\nData shape after dropping NA:", df_merged.shape)

# Check final coverage
print("\nDate range:", df_merged["Date"].min(), "to", df_merged["Date"].max())


Data after merging:
        Date       GDP  UnemploymentRate  Inflation  FedFundsRate
0 1960-01-01  3517.181               5.2   1.457976          3.99
1 1961-01-01  3493.703               6.6   1.070724          1.45
2 1962-01-01  3758.147               5.8   1.198773          2.15
3 1963-01-01  3893.482               5.7   1.239669          2.92
4 1964-01-01  4135.553               5.6   1.278912          3.48
5 1965-01-01  4362.111               4.9   1.585169          3.90
6 1966-01-01  4731.888               4.0   3.015075          4.42
7 1967-01-01  4870.299               3.9   2.772786          4.94
8 1968-01-01  5057.553               3.7   4.271796          4.61
9 1969-01-01  5283.597               3.4   5.462386          6.30

Data shape: (64, 5)

Data shape after dropping NA: (64, 5)

Date range: 1960-01-01 00:00:00 to 2023-01-01 00:00:00


%% [markdown]
### 2.1 Feature and Target Selection

We will predict `GDP` using:
- `UnemploymentRate`
- `Inflation`
- `FedFundsRate`

In a real project, you might do more feature engineering or alignment.

In [4]:
# Define target (y) and features (X)
X = df_merged[["UnemploymentRate", "Inflation", "FedFundsRate"]]
y = df_merged["GDP"]

# Quick look at descriptive statistics
print("Feature summary:")
print(X.describe())

print("\nTarget summary (GDP):")
print(y.describe())


Feature summary:
       UnemploymentRate  Inflation  FedFundsRate
count         64.000000  64.000000     64.000000
mean           5.881250   3.770332      4.712500
std            1.599988   2.762789      3.748391
min            3.400000  -0.355546      0.070000
25%            4.700000   1.876588      1.685000
50%            5.700000   3.021948      4.375000
75%            6.625000   4.399866      6.497500
max           10.400000  13.549202     19.080000

Target summary (GDP):
count       64.000000
mean     11319.421438
std       5624.681697
min       3493.703000
25%       6267.051250
50%      10141.910500
75%      16411.053750
max      22403.435000
Name: GDP, dtype: float64


## 3. Data Partition

We'll do a simple 70/30 train-test split. 
Set a random seed for reproducibility.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=222
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


Training set size: (44, 3)
Test set size: (20, 3)


## 4. Model Training with Cross-Validation

We'll create pipelines for:
1. **Linear Regression** (no regularization)
2. **Ridge** (L2 regularization)
3. **Lasso** (L1 regularization)
4. **Elastic Net** (mix of L1 & L2)

Each pipeline will include:
- `StandardScaler()` (scales features)
- The regression model

We'll use **5-fold cross-validation** to select the best hyperparameters
(e.g., `alpha`) for ridge, lasso, and elastic net.

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV

def evaluate_model(model, X, y, cv=5):
    """
    Utility to compute cross-validated RMSE for a model.
    """
    neg_mse = cross_val_score(model, X, y, 
                              scoring='neg_mean_squared_error', 
                              cv=cv)
    rmse = np.sqrt(-neg_mse)
    return rmse.mean()

# 1) Linear Regression
from sklearn.linear_model import LinearRegression
lr_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LinearRegression())
])
lr_cv_rmse = evaluate_model(lr_pipeline, X_train, y_train, cv=5)
print("Linear Regression CV RMSE:", lr_cv_rmse)

# 2) Ridge Regression
from sklearn.linear_model import Ridge
ridge_params = {'ridge__alpha': [0.01, 0.1, 1, 10, 100]}
ridge_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", Ridge())
])
ridge_grid = GridSearchCV(ridge_pipeline, ridge_params, 
                          scoring='neg_mean_squared_error', 
                          cv=5)
ridge_grid.fit(X_train, y_train)
ridge_cv_rmse = np.sqrt(-ridge_grid.best_score_)
print("Best Ridge alpha:", ridge_grid.best_params_)
print("Ridge Regression CV RMSE:", ridge_cv_rmse)

# 3) Lasso Regression
from sklearn.linear_model import Lasso
lasso_params = {'lasso__alpha': [0.001, 0.01, 0.1, 1, 10]}
lasso_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso", Lasso(max_iter=10000))
])
lasso_grid = GridSearchCV(lasso_pipeline, lasso_params, 
                          scoring='neg_mean_squared_error', 
                          cv=5)
lasso_grid.fit(X_train, y_train)
lasso_cv_rmse = np.sqrt(-lasso_grid.best_score_)
print("Best Lasso alpha:", lasso_grid.best_params_)
print("Lasso Regression CV RMSE:", lasso_cv_rmse)

# 4) Elastic Net
from sklearn.linear_model import ElasticNet
en_params = {
    'en__alpha': [0.001, 0.01, 0.1, 1, 10],
    'en__l1_ratio': [0.2, 0.5, 0.8]
}
en_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("en", ElasticNet(max_iter=10000))
])
en_grid = GridSearchCV(en_pipeline, en_params, 
                       scoring='neg_mean_squared_error', 
                       cv=5)
en_grid.fit(X_train, y_train)
en_cv_rmse = np.sqrt(-en_grid.best_score_)
print("Best Elastic Net alpha & l1_ratio:", en_grid.best_params_)
print("Elastic Net CV RMSE:", en_cv_rmse)


Linear Regression CV RMSE: 5270.744709566352
Best Ridge alpha: {'ridge__alpha': 10}
Ridge Regression CV RMSE: 5200.923829669878
Best Lasso alpha: {'lasso__alpha': 10}
Lasso Regression CV RMSE: 5305.46654031659
Best Elastic Net alpha & l1_ratio: {'en__alpha': 1, 'en__l1_ratio': 0.5}
Elastic Net CV RMSE: 5190.645414731823


### 4.1 Compare Models

We'll pick the model with the lowest cross-validated RMSE as our "best" model in-sample.

In [7]:

models_cv = {
    "Linear": lr_cv_rmse,
    "Ridge": ridge_cv_rmse,
    "Lasso": lasso_cv_rmse,
    "ElasticNet": en_cv_rmse
}

best_model_name = min(models_cv, key=models_cv.get)
print("CV RMSE Scores:", models_cv)
print("Best model in cross-validation:", best_model_name)


CV RMSE Scores: {'Linear': np.float64(5270.744709566352), 'Ridge': np.float64(5200.923829669878), 'Lasso': np.float64(5305.46654031659), 'ElasticNet': np.float64(5190.645414731823)}
Best model in cross-validation: ElasticNet


## 5. Final Model and Testing

We refit the best model on the entire training set, then evaluate on the hold-out test set.


In [8]:
if best_model_name == "Linear":
    final_model = lr_pipeline
    final_model.fit(X_train, y_train)
elif best_model_name == "Ridge":
    final_model = ridge_grid.best_estimator_
elif best_model_name == "Lasso":
    final_model = lasso_grid.best_estimator_
else:
    final_model = en_grid.best_estimator_

final_model.fit(X_train, y_train)

# Predictions on test set
y_pred_test = final_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
print(f"Test RMSE ({best_model_name}):", test_rmse)


Test RMSE (ElasticNet): 5381.665213271924


### 5.1 Variable Importance

For linear, ridge, lasso, and elastic net, we can interpret the model coefficients (after scaling).
Note that we used a `StandardScaler`, so these coefficients are in standardized form.


In [9]:
if best_model_name == "Linear":
    reg_step = final_model.named_steps["lr"]
elif best_model_name == "Ridge":
    reg_step = final_model.named_steps["ridge"]
elif best_model_name == "Lasso":
    reg_step = final_model.named_steps["lasso"]
else:
    reg_step = final_model.named_steps["en"]

coefs = reg_step.coef_
feature_names = X_train.columns

coef_df = pd.DataFrame({
    "Feature": feature_names,
    "Coefficient": coefs,
    "AbsCoefficient": np.abs(coefs)
}).sort_values("AbsCoefficient", ascending=False)

print("Coefficients (standardized):")
print(coef_df)


Coefficients (standardized):
            Feature  Coefficient  AbsCoefficient
2      FedFundsRate -1838.736721     1838.736721
1         Inflation  -193.316744      193.316744
0  UnemploymentRate  -119.753631      119.753631


## 6. Save the Final Model for Later Use

In [10]:
joblib.dump(final_model, "final_model.pkl")
print("Model saved to final_model.pkl")


Model saved to final_model.pkl


### 6.1 Load and Use the Saved Model


In [11]:
loaded_model = joblib.load("final_model.pkl")
y_pred_loaded = loaded_model.predict(X_test)
rmse_loaded = np.sqrt(mean_squared_error(y_test, y_pred_loaded))
print("RMSE using loaded model:", rmse_loaded)


RMSE using loaded model: 5381.665213271924


## 7. Conclusion

In this notebook, we:

1. **Merged** GDP, Unemployment, Inflation, and Fed Funds data on their common dates (naively).
2. **Split** the data into training (70%) and testing (30%).
3. **Trained** four regression models:
   - Linear Regression
   - Ridge
   - Lasso
   - Elastic Net  
   using **5-fold cross-validation**.
4. **Selected** the best model based on cross-validated RMSE and tested on the hold-out set.
5. **Saved** the final model to disk for future use.

**Next Steps**:
- Align data frequencies properly (quarterly vs. monthly).
- Add more features or transformations (e.g., lags, differencing).
- Conduct a more thorough hyperparameter search.
- Consider time-series-specific techniques if you need to account for autocorrelation and future predictions more rigorously.

**End of Notebook**


# The End