# Macro to Micro Model Demo

## Introduction

#### Connect to ValidMind Project

In [1]:
# Load API key and secret from environment variables
%load_ext dotenv
%dotenv .env

import pandas as pd

%matplotlib inline

**Connect to ValidMind Project**

In [2]:

import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "2494c3838f48efe590d531bfe225d90b",
  api_secret = "4f692f8161f128414fef542cab2a4e74834c75d01b3a8e088a1834f2afcfe838",
  project = "clk2jf1yy0005o5y6u8a30v6l"
)

2023-07-14 14:56:19,833 - INFO(validmind.api_client): Connected to ValidMind. Project: Macro-to-Micro Model - Initial Validation (clk2jf1yy0005o5y6u8a30v6l)


**Check Available Tests**

In [3]:
vm.test_plans.describe_plan("time_series_data_quality")

ID,Name,Description,Required Context,Tests
time_series_data_quality,TimeSeriesDataQuality,Test plan for data quality on time series datasets,['dataset'],TimeSeriesOutliers (ThresholdTest) TimeSeriesMissingValues (ThresholdTest) TimeSeriesFrequency (ThresholdTest)


## Data Description

#### Import Dataset

In [4]:
from validmind.datasets.regression import fred as fred

# Define target and feature columns
target_column = 'DRSFRMACBS'
feature_columns = ['GDPC1', 'CSUSHPISA', 'UNRATE', 'CPIAUCSL', 'FEDFUNDS']

# Load FRED data
df = fred.load_all_data()

# Select columns for analysis
df = df[[target_column] + feature_columns]

df.tail(10)

Unnamed: 0_level_0,DRSFRMACBS,GDPC1,CSUSHPISA,UNRATE,CPIAUCSL,FEDFUNDS
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-03-02,,,,,,
2023-03-09,,,,,,
2023-03-16,,,,,,
2023-03-23,,,,,,
2023-03-30,,,,,,
2023-04-01,,,299.715,,,
2023-04-06,,,,,,
2023-04-13,,,,,,
2023-04-20,,,,,,
2023-04-27,,,,,,


#### Missing Values

In [5]:
from validmind.vm_models.test_context import TestContext
from validmind.tests.data_validation.TimeSeriesMissingValues import TimeSeriesMissingValues

vm_df = vm.init_dataset(dataset=df)
test_context = TestContext(dataset=vm_df)

params = {"min_threshold": 2}

metric = TimeSeriesMissingValues(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

2023-07-14 14:56:19,944 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
2023-07-14 14:56:19,944 - INFO(validmind.vm_models.dataset): Inferring dataset types...


VBox(children=(HTML(value='\n            <h2>Time Series Missing Values ❌</h2>\n            <p>Test that the n…

#### Outliers 

In [6]:
from validmind.tests.data_validation.TimeSeriesOutliers import TimeSeriesOutliers

params = {"zscore_threshold": 3}

metric = TimeSeriesOutliers(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.


VBox(children=(HTML(value='\n            <h2>Time Series Outliers ❌</h2>\n            <p>Test that find outlie…

#### Frequency

In [7]:
from validmind.tests.data_validation.TimeSeriesFrequency import TimeSeriesFrequency

metric = TimeSeriesFrequency(test_context)
metric.run()
# await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='\n            <h2>Time Series Frequency ❌</h2>\n            <p>Test that detect fre…

## Data Preparation

In [8]:
# Resample to quarterly data (end of October)
df = df.resample('QS-OCT').mean()

# Remove all missing values
df = df.dropna()

# Take the first difference across all variables
df = df.diff().dropna()

# Remove data from 2020 onwards
df = df[df.index.year < 2020]

In [9]:
from validmind.vm_models.test_context import TestContext
from validmind.tests.data_validation.TimeSeriesMissingValues import TimeSeriesMissingValues

vm_df = vm.init_dataset(dataset=df)
test_context = TestContext(dataset=vm_df)

params = {"min_threshold": 2}

metric = TimeSeriesMissingValues(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

2023-07-14 14:56:21,581 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
2023-07-14 14:56:21,582 - INFO(validmind.vm_models.dataset): Inferring dataset types...


VBox(children=(HTML(value='\n            <h2>Time Series Missing Values ✅</h2>\n            <p>Test that the n…

In [10]:
from validmind.tests.data_validation.TimeSeriesOutliers import TimeSeriesOutliers

params = {"zscore_threshold": 3}

metric = TimeSeriesOutliers(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.


VBox(children=(HTML(value='\n            <h2>Time Series Outliers ❌</h2>\n            <p>Test that find outlie…

In [11]:
from validmind.tests.data_validation.TimeSeriesFrequency import TimeSeriesFrequency

metric = TimeSeriesFrequency(test_context)
metric.run()
# await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='\n            <h2>Time Series Frequency ✅</h2>\n            <p>Test that detect fre…

## Data Sampling

#### Sampling Method

We use time series sampling to create our training and testing sets, a crucial step in our macro-to-micro model. This method maintains the temporal order of the data, preserving the inherent dependencies in our time series of macroeconomic indicators and default rates.

In [12]:
# Define the split date
split_date = '2018-01-01'

# Split data into train and test 
df_train = df.loc[df.index < split_date]
df_test = df.loc[df.index >= split_date]

# Split the train and test sets into X and y
X_train = df_train.drop(target_column, axis=1)
y_train = df_train[target_column]
X_test = df_test.drop(target_column, axis=1)
y_test = df_test[target_column]

# Concatenate X_train with y_train to form df_train
df_train = pd.concat([X_train, y_train], axis=1)

# Concatenate X_test with y_test to form df_test
df_test = pd.concat([X_test, y_test], axis=1)

## Univariate Analysis

In [13]:
vm.test_plans.describe_plan("time_series_univariate")

ID,Name,Description,Required Context,Tests
time_series_univariate,TimeSeriesUnivariate,Test plan to perform time series univariate analysis.,['dataset'],TimeSeriesLinePlot (Metric) TimeSeriesHistogram (Metric) ACFandPACFPlot (Metric) SeasonalDecompose (Metric) AutoSeasonality (Metric) AutoStationarity (Metric) RollingStatsPlot (Metric) AutoAR (Metric) AutoMA (Metric)


In [14]:
from validmind.tests.data_validation.TimeSeriesLinePlot import TimeSeriesLinePlot

vm_df_train = vm.init_dataset(dataset=df_train)
test_context = TestContext(dataset=vm_df_train)

metric = TimeSeriesLinePlot(test_context)
metric.run()
# await metric.result.log()
metric.result.show()

2023-07-14 14:56:23,145 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
2023-07-14 14:56:23,145 - INFO(validmind.vm_models.dataset): Inferring dataset types...


VBox(children=(HTML(value='<p>Generates a visual analysis of time series data by plotting the raw time series.…

In [15]:
from validmind.tests.data_validation.TimeSeriesHistogram import TimeSeriesHistogram

metric = TimeSeriesHistogram(test_context)
metric.run()
# await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of time series data by plotting the histogram. The i…

In [16]:
from validmind.tests.data_validation.ACFandPACFPlot import ACFandPACFPlot

metric = ACFandPACFPlot(test_context)
metric.run()
# await metric.result.log()
metric.result.show()



VBox(children=(HTML(value='<p>Plots ACF and PACF for a given time series dataset.</p>'), HTML(value='<h3>Plots…

In [17]:
from validmind.tests.data_validation.SeasonalDecompose import SeasonalDecompose

params = {"seasonal_model": 'additive'}

metric = SeasonalDecompose(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

2023-07-14 14:56:26,514 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of GDPC1: QS-OCT
2023-07-14 14:56:26,514 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of GDPC1: QS-OCT
2023-07-14 14:56:26,652 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of CSUSHPISA: QS-OCT
2023-07-14 14:56:26,652 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of CSUSHPISA: QS-OCT
2023-07-14 14:56:26,891 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of UNRATE: QS-OCT
2023-07-14 14:56:26,891 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of UNRATE: QS-OCT
2023-07-14 14:56:27,088 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of CPIAUCSL: QS-OCT
2023-07-14 14:56:27,088 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of CPIAUCSL: QS-OCT
2023-07-14 14:56:27,315 - INFO(validmind.tests.data_validation.SeasonalDecompose): Frequency of FEDFUNDS

VBox(children=(HTML(value='<p>Calculates seasonal_decompose metric for each of the dataset features</p>'), HTM…

In [18]:
from validmind.tests.data_validation.AutoSeasonality import AutoSeasonality

params = {"min_period": 1,
          "min_period": 3}

metric = AutoSeasonality(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Automatically detects the optimal seasonal order for a time series dataset using…

In [19]:
from validmind.tests.data_validation.AutoStationarity import AutoStationarity

params = {"max_order": 3,
          "threshold": 0.05}

metric = AutoStationarity(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Automatically detects stationarity for each time series in a DataFrame using the…

In [20]:
from validmind.tests.data_validation.RollingStatsPlot import RollingStatsPlot

params = {"window_size": 4}

metric = RollingStatsPlot(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()


No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.


VBox(children=(HTML(value='<p>This class provides a metric to visualize the stationarity of a given time serie…

In [21]:
from validmind.tests.data_validation.AutoAR import AutoAR

params = {"max_ar_order": 2}

metric = AutoAR(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()



VBox(children=(HTML(value='<p>Automatically detects the AR order of a time series using both BIC and AIC.</p>'…

In [22]:
from validmind.tests.data_validation.AutoMA import AutoMA

params = {"max_ar_order": 2}

metric = AutoMA(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

  warn('Non-invertible starting MA parameters found.'
  warn('Non-invertible starting MA parameters found.'


VBox(children=(HTML(value='<p>Automatically detects the MA order of a time series using both BIC and AIC.</p>'…

## Multivariate Analysis

In [23]:
vm.test_plans.describe_plan("time_series_multivariate")

ID,Name,Description,Required Context,Tests
time_series_multivariate,TimeSeriesMultivariate,Test plan to perform time series multivariate analysis.,['dataset'],ScatterPlot (Metric) LaggedCorrelationHeatmap (Metric) EngleGrangerCoint (Metric) SpreadPlot (Metric)


In [24]:
from validmind.tests.data_validation.ScatterPlot import ScatterPlot

metric = ScatterPlot(test_context)
metric.run()
# await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of data by plotting a scatter plot matrix for all co…

In [25]:
from validmind.tests.data_validation.LaggedCorrelationHeatmap import LaggedCorrelationHeatmap

params = {"target_col": target_column,
          "independent_vars": feature_columns}

metric = LaggedCorrelationHeatmap(test_context, params)
#metric.run()
# await metric.result.log()
#metric.result.show()

In [26]:
from validmind.tests.data_validation.EngleGrangerCoint import EngleGrangerCoint

params = {"threshold": 0.05}

metric = EngleGrangerCoint(test_context, params)
metric.run()
# await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Test for cointegration between pairs of time series variables in a given dataset…

In [27]:
from validmind.tests.data_validation.SpreadPlot import SpreadPlot

metric = SpreadPlot(test_context)
metric.run()
# await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>This class provides a metric to visualize the spread between pairs of time serie…

## Feature Selection

## Feature Engineering

## Model Training

#### Fit Linear Regression Model

In [28]:
import statsmodels.api as sm

# Create X_train, y_train 
y_train = df_train[target_column]
X_train = df_train.drop(target_column, axis=1)

# Add constant to X_train for intercept term
X_train = sm.add_constant(X_train)
df_train = pd.concat([X_train, y_train], axis=1)

# Define the model
model = sm.OLS(y_train, X_train)

# Fit the model
model_fit = model.fit()

# Print out the statistics
print(model_fit.summary())


                            OLS Regression Results                            
Dep. Variable:             DRSFRMACBS   R-squared:                       0.570
Model:                            OLS   Adj. R-squared:                  0.549
Method:                 Least Squares   F-statistic:                     26.82
Date:                Fri, 14 Jul 2023   Prob (F-statistic):           3.65e-17
Time:                        15:11:33   Log-Likelihood:                 6.8781
No. Observations:                 107   AIC:                            -1.756
Df Residuals:                     101   BIC:                             14.28
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0929      0.043      2.181      0.0

#### Remove Non-Significant Features

In [29]:
features_to_drop = ['GDPC1', 'CPIAUCSL']
df_train.drop(columns = features_to_drop, inplace=True)

#### Update Model Fit 

In [30]:
# Create X_train and y_train
y_train = df_train[target_column]
X_train = df_train.drop(target_column, axis=1)

# Define the model
model = sm.OLS(y_train, X_train)

# Fit the model
model_fit = model.fit()

# Print out the statistics
print(model_fit.summary())

                            OLS Regression Results                            
Dep. Variable:             DRSFRMACBS   R-squared:                       0.569
Model:                            OLS   Adj. R-squared:                  0.556
Method:                 Least Squares   F-statistic:                     45.27
Date:                Fri, 14 Jul 2023   Prob (F-statistic):           9.62e-19
Time:                        15:15:12   Log-Likelihood:                 6.6709
No. Observations:                 107   AIC:                            -5.342
Df Residuals:                     103   BIC:                             5.350
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0923      0.026      3.531      0.0

##### Create ValidMInd Models

In [None]:
# Update VM datasets
vm_train_ds = vm.init_dataset(dataset=df_train, type="generic", target_column=target_column)
# vm_test_ds = vm.init_dataset(dataset=df_test, type="generic", target_column=target_column)

# Create VM model
vm_model = vm.init_model(
    model = model_fit, 
    train_ds=vm_train_ds, 
    test_ds=vm_test_ds)