# Time Series Data Validation Full Suite

## 1. Introduction

The Time Series Data Validation Demo notebook aims to demonstrate the application of various data validation tests using the **ValidMind MRM Platform** and **Developer Framework**. Ensuring the quality and an a robust exploratory data analysis of time series data is essential for accurate model predictions and robust decision-making processes.

In this demo, we will walk through different **data validation suites of tests** tailored for time series data, showcasing how these tools can assist you in identifying potential issues and inconsistencies in the data. 



## 2. Setup 

Prepare the environment for our analysis. First, **import** all necessary libraries and modules required for our analysis. Next, **connect** to the ValidMind MRM platform, which provides a comprehensive suite of tools and services for model validation.

Finally, define and **configure** the specific use case we are working on by setting up any required parameters, data sources, or other settings that will be used throughout the analysis. 

### Import Libraries

In [1]:
# Load API key and secret from environment variables
%load_ext dotenv
%dotenv .env

# ValidMind libraries 
import validmind as vm
from validmind.datasets.regression import (
    identify_frequencies, 
    resample_to_common_frequency
)

### Connect to the ValidMind Library

In [2]:
vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  project = "clhhz04x40000wcy6shay2oco"
)

Connected to ValidMind. Project: Customer Churn Model - Initial Validation (clhhz04x40000wcy6shay2oco)


### Find All Test Suites and Plans Available in the Developer Framework

We can find all the **test suites** and **test plans** available in the developer framework by calling the following functions:

- All test suites: `vm.test_suites.list_suites()`
- All test plans: `vm.test_plans.list_plans()`
- Describe a test plan: `vm.test_plans.describe_plan("time_series_data_quality")`
- List all available tests: `vm.test_plans.list_tests()`

In [3]:
vm.test_suites.list_suites()

ID,Name,Description,Test Plans
binary_classifier_full_suite,BinaryClassifierFullSuite,Full test suite for binary classification models.,"tabular_dataset_description, tabular_data_quality, binary_classifier_metrics, binary_classifier_validation, binary_classifier_model_diagnosis"
binary_classifier_model_validation,BinaryClassifierModelValidation,Test suite for binary classification models.,"binary_classifier_metrics, binary_classifier_validation, binary_classifier_model_diagnosis"
tabular_dataset,TabularDataset,Test suite for tabular datasets.,"tabular_dataset_description, tabular_data_quality"
time_series_dataset,TimeSeriesDataset,Test suite for time series datasets.,"time_series_data_quality, time_series_univariate, time_series_multivariate"
time_series_model_validation,TimeSeriesModelValidation,Test suite for time series model validation.,"regression_model_performance, regression_models_comparison, time_series_forecast"


In [4]:
vm.test_plans.list_plans()

ID,Name,Description
binary_classifier_metrics,BinaryClassifierMetrics,Test plan for sklearn classifier metrics
binary_classifier_validation,BinaryClassifierPerformance,Test plan for sklearn classifier models
binary_classifier_model_diagnosis,BinaryClassifierDiagnosis,Test plan for sklearn classifier model diagnosis tests
tabular_dataset_description,TabularDatasetDescription,Test plan to extract metadata and descriptive  statistics from a tabular dataset
tabular_data_quality,TabularDataQuality,Test plan for data quality on tabular datasets
time_series_data_quality,TimeSeriesDataQuality,Test plan for data quality on time series datasets
time_series_univariate,TimeSeriesUnivariate,Test plan to perform time series univariate analysis.
time_series_multivariate,TimeSeriesMultivariate,Test plan to perform time series multivariate analysis.
time_series_forecast,TimeSeriesForecast,Test plan to perform time series forecast tests.
regression_model_performance,RegressionModelPerformance,Test plan for performance metric of regression model of statsmodels library


## 3. Load Data

Conigure your use case.

In [5]:
# from validmind.datasets.classification import lending_club as demo_dataset
from validmind.datasets.regression import fred as demo_dataset

target_column = demo_dataset.target_column
feature_columns = demo_dataset.feature_columns

# Split the dataset into test and training 
df = demo_dataset.load_data()

## 4. Data Description

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3551 entries, 1947-01-01 to 2023-04-27
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MORTGAGE30US  2718 non-null   float64
 1   FEDFUNDS      825 non-null    float64
 2   GS10          841 non-null    float64
 3   UNRATE        903 non-null    float64
dtypes: float64(4)
memory usage: 138.7 KB


## 5. Data Validation

### User Configuration of Test Suite 

Users can input the configuration to a test suite using **`config`**, allowing fine-tuning the suite according to their specific data requirements. 

**Time Series Data Quality params**
- `time_series_outliers` is set to identify outliers using a specific Z-score threshold
- `time_series_missing_values` defines a minimum threshold to identify missing data points.

**Time Series Univariate params**
- *Visualization*: `time_series_line_plot` and `time_series_histogram` are designed to generate line and histogram plots respectively for each column in a DataFrame.

- *Seasonality*:  `seasonal_decompose` and `auto_seasonality` are dedicated to analyzing the seasonal component of the time series. `seasonal_decompose` performs a seasonal decomposition of the data, while `auto_seasonality` aids in the automatic detection of seasonality.

- *Stationarity*: `window_size` determines the number of consecutive data points used for calculating the rolling mean and standard deviation.

- *ARIMA*: `acf_pacf_plot`, `auto_ar`, and `auto_ma` are part of the ARIMA (Autoregressive Integrated Moving Average) model analysis. `acf_pacf_plot` generates autocorrelation and partial autocorrelation plots, `auto_ar` determines the order of the autoregressive part of the model, and `auto_ma` does the same for the moving average part.


**Time Series Multivariate params**
- *Visualization*: `scatter_plot` is used to create scatter plots for each column in the DataFrame, offering a visual tool to understand the relationship between different variables in the dataset.

- *Correlation*: `lagged_correlation_heatmap` facilitates the creation of a heatmap, which visually represents the lagged correlation between the target column and the feature columns of a demo dataset. This provides a convenient way to examine the time-delayed correlation between different series.

- *Cointegration*: `engle_granger_coint` sets a threshold for conducting the Engle-Granger cointegration test, which is a statistical method used to identify the long-term correlation between two or more time series.

In [7]:
config={
    
    # TIME SERIES DATA QUALITY PARAMS
    "time_series_outliers": {
        "zscore_threshold": 3,
    },
    "time_series_missing_values":{
        "min_threshold": 2,
    },
    
    # TIME SERIES UNIVARIATE PARAMS 
    "rolling_stats_plot": {
        "window_size": 12    
    },
     "seasonal_decompose": {
        "seasonal_model": 'additive'
    },
     "auto_seasonality": {
        "min_period": 1,
        "max_period": 3
    },
      "auto_stationarity": {
        "max_order": 3,
        "threshold": 0.05
    },
    "auto_ar": {
        "max_ar_order": 4
    },
    "auto_ma": {
        "max_ma_order": 3
    },

    # TIME SERIES MULTIVARIATE PARAMS 
    "lagged_correlation_heatmap": {
        "target_col": demo_dataset.target_column,
        "independent_vars": demo_dataset.feature_columns
    },
    "engle_granger_coint": {
        "threshold": 0.05
    },
}

### Validation of Raw Dataset

#### **Run the Time Series Dataset Test Suite**

In [8]:
vm_dataset = vm.init_dataset(
    dataset=df,
    target_column=demo_dataset.target_column,
)

full_suite = vm.run_test_suite(
    "time_series_dataset",
    dataset=vm_dataset,
    config = config,
)

Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...


HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=32)))

The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
No frequency could be inferred for variable 'MORTGAGE30US'. Skipping seasonal decomposition and plots for this variable.


Frequency of FEDFUNDS: MS
Frequency of GS10: MS
Frequency of UNRATE: MS


iteritems is deprecated and will be removed in a future version. Use .items instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version



A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
A date index has been provid



A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be r



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
No frequency information was provided, so inferred frequency MS will be used.
No frequency information was provided, so inferred frequency MS will be used.
No frequency information was provided, so inferred frequency MS will be used.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
No frequency information was provided, so inferred frequency MS will be used.
No frequency information was provided, so inferred frequency MS will be used.
No frequency information was provided, so inferred frequency MS will be used.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
No frequency inf

VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Time Series Dataset</i></h2><hr>'…

### Handle Dataset Frequencies

Show the frequencies of each variable in the raw dataset.

In [9]:
frequencies = identify_frequencies(df)
display(frequencies)

Unnamed: 0,Variable,Frequency
0,MORTGAGE30US,
1,FEDFUNDS,MS
2,GS10,MS
3,UNRATE,MS


Handle frequencies by resampling all variables to a common frequency.

In [10]:
preprocessed_df = resample_to_common_frequency(df, common_frequency=demo_dataset.frequency)
frequencies = identify_frequencies(preprocessed_df)
display(frequencies)

Unnamed: 0,Variable,Frequency
0,MORTGAGE30US,MS
1,FEDFUNDS,MS
2,GS10,MS
3,UNRATE,MS


#### **Run the Time Series Dataset Test Suite**

Run the same suite again after handling frequencies.     

In [11]:
vm_dataset = vm.init_dataset(
    dataset=preprocessed_df,
    target_column=demo_dataset.target_column,
)

full_suite = vm.run_test_suite(
    "time_series_dataset",
    dataset=vm_dataset,
)

Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...


HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=32)))

The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.


Frequency of MORTGAGE30US: MS
Frequency of FEDFUNDS: MS
Frequency of GS10: MS
Frequency of UNRATE: MS


iteritems is deprecated and will be removed in a future version. Use .items instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be remove



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Usin

VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Time Series Dataset</i></h2><hr>'…

### Handle Missing Values

Handle the missing values by droping all the `nan` values. 

In [12]:
preprocessed_df = preprocessed_df.dropna()

#### **Run the Time Series Dataset Test Suite**

Run the same test suite to check there are no missing values and frequencies of all variables are the same.

In [13]:
vm_dataset = vm.init_dataset(
    dataset=preprocessed_df,
    target_column=demo_dataset.target_column,
)

full_suite = vm.run_test_suite(
    "time_series_dataset",
    dataset=vm_dataset,
)

Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...


HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=32)))

The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.


Frequency of MORTGAGE30US: MS
Frequency of FEDFUNDS: MS
Frequency of GS10: MS
Frequency of UNRATE: MS


iteritems is deprecated and will be removed in a future version. Use .items instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be remove



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Using zeros as starting parameters.




The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Non-invertible starting MA parameters found. Usin

VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Time Series Dataset</i></h2><hr>'…

### Handle Stationarity

Handle stationarity by taking the first difference. 

In [14]:
preprocessed_df = preprocessed_df.diff().fillna(method='bfill')

#### **Run the Time Series Dataset Test Suite**

In [15]:
vm_dataset = vm.init_dataset(
    dataset=preprocessed_df,
    target_column=demo_dataset.target_column,
)

full_suite = vm.run_test_suite(
    "time_series_dataset",
    dataset=vm_dataset,
)

Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...


HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=32)))

The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.


Frequency of MORTGAGE30US: MS
Frequency of FEDFUNDS: MS
Frequency of GS10: MS
Frequency of UNRATE: MS


iteritems is deprecated and will be removed in a future version. Use .items instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version

VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Time Series Dataset</i></h2><hr>'…