# Time Series Data Validation Demo

## Introduction

The Time Series Data Validation Demo notebook aims to demonstrate the application of various data validation tests using the **ValidMind MRM Platform** and **Developer Framework**. As model developers working in the financial sector, ensuring the quality and reliability of time series data is essential for accurate model predictions and robust decision-making processes.

In this demo, we will walk through different data validation suites of tests tailored for time series data, showcasing how these tools can assist you in identifying potential issues and inconsistencies in the data. By utilizing the ValidMind MRM platform and developer framework, you can streamline your data validation process, allowing you to focus on building and refining your models with confidence.

Let's get started! 

## Setup 

Prepare the environment for our analysis. First, import all necessary libraries and modules required for our analysis. Next, define and configure the specific use case we are working on by setting up any required parameters, data sources, or other settings that will be used throughout the analysis. Finally, establish a connection to the ValidMind MRM platform, which provides a comprehensive suite of tools and services for model validation.

### Import Libraries

In [1]:
# Load API key and secret from environment variables
%load_ext dotenv
%dotenv .env

# System libraries
import glob
import os
import pickle

# ML libraries
import pandas as pd

# ValidMind libraries 
import validmind as vm

### Use Case Configuration

In [2]:
from validmind.datasets.regression import fred
iris_df = fred.load_data()

In [3]:
dataset = 'fred'

if dataset == 'lending_club':
    target_column = ['loan_rate_A']
    feature_columns = ['loan_rate_B', 'loan_rate_C', 'loan_rate_D']
    from validmind.datasets.regression import lending_club
    raw_df = lending_club.load_data()
if dataset == 'fred':
    target_column = ['MORTGAGE30US']
    feature_columns = ['FEDFUNDS', 'GS10', 'UNRATE']
    from validmind.datasets.regression import fred
    raw_df = fred.load_data()
    selected_cols = target_column + feature_columns
    raw_df = raw_df[selected_cols]

### Connect to ValidMind MRM Platform

In [4]:
vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  project = "clhhz04x40000wcy6shay2oco"
)

Connected to ValidMind. Project: Customer Churn Model - Initial Validation (clhhz04x40000wcy6shay2oco)


## Data Description

In [5]:
display(raw_df)

Unnamed: 0_level_0,MORTGAGE30US,FEDFUNDS,GS10,UNRATE
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1947-01-01,,,,
1947-02-01,,,,
1947-03-01,,,,
1947-04-01,,,,
1947-05-01,,,,
...,...,...,...,...
2023-04-01,,,3.46,
2023-04-06,6.28,,,
2023-04-13,6.27,,,
2023-04-20,6.39,,,


In [6]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3551 entries, 1947-01-01 to 2023-04-27
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MORTGAGE30US  2718 non-null   float64
 1   FEDFUNDS      825 non-null    float64
 2   GS10          841 non-null    float64
 3   UNRATE        903 non-null    float64
dtypes: float64(4)
memory usage: 138.7 KB


## Data Preparation

### List of Available Test Plans

The `vm.test_plans.list_plans()` function is a part of the ValidMind (`vm`) library that provides a comprehensive list of available test plans. These test plans are pre-built sets of tests designed to perform automated data and model validation, such as data quality, exploratory data analysis, and model performance.

In [7]:
vm.test_plans.list_plans()

ID,Name,Description
sklearn_classifier_metrics,SKLearnClassifierMetrics,Test plan for sklearn classifier metrics
sklearn_classifier_validation,SKLearnClassifierPerformance,Test plan for sklearn classifier models
sklearn_classifier_model_diagnosis,SKLearnClassifierDiagnosis,Test plan for sklearn classifier model diagnosis tests
sklearn_classifier,SKLearnClassifier,Test plan for sklearn classifier models that includes  both metrics and validation tests
tabular_dataset,TabularDataset,Test plan for generic tabular datasets
tabular_dataset_description,TabularDatasetDescription,Test plan to extract metadata and descriptive  statistics from a tabular dataset
tabular_data_quality,TabularDataQuality,Test plan for data quality on tabular datasets
normality_test_plan,NormalityTestPlan,Test plan to perform normality tests.
autocorrelation_test_plan,AutocorrelationTestPlan,Test plan to perform autocorrelation tests.
seasonality_test_plan,SesonalityTestPlan,Test plan to perform seasonality tests.


### Data Quality

#### Run Data Quality Test Plan

Use the ValidMind (`vm`) library to perform data quality tests on a time series dataset. The process begins by describing a test plan specifically designed for time series data quality. This test plan contains a set of tests that evaluate the quality of the provided time series data.

Next, the raw DataFrame is used to initialize a dataset using the vm library. This newly created dataset object, `vm_dataset`, is then utilized for further processing. The test plan parameters are configured to define the z-score threshold for outlier detection and the minimum threshold for identifying missing values.

Finally, the test plan, `time_series_data_quality`, is executed using the `vm.run_test_plan()` function with the initialized dataset and the configuration settings provided. This function applies the specified tests to the dataset and generates a report on the quality of the time series data based on the configured parameters.

In [8]:
vm.test_plans.describe_plan("time_series_data_quality")

Attribute,Value
ID,time_series_data_quality
Name,TimeSeriesDataQuality
Description,Test plan for data quality on time series datasets
Required Context,['dataset']
Tests,"TimeSeriesOutliers (ThresholdTest), TimeSeriesMissingValues (ThresholdTest), TimeSeriesFrequency (ThresholdTest)"
Test Plans,[]


In [9]:
vm_dataset = vm.init_dataset(
    dataset=raw_df
)

config={
    "time_series_outliers": {
        "zscore_threshold": 3,

    },
    "time_series_missing_values":{
        "min_threshold": 2,
    }
}

vm.run_test_plan("time_series_data_quality", dataset=vm_dataset, config=config)

Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...


Running ThresholdTest: time_series_outliers:   0%|          | 0/3 [00:00<?, ?it/s]  

   Variable   z-score  Threshold       Date
0  FEDFUNDS  3.707038          3 1981-05-01


                                                                                                                                       

TimeSeriesDataQuality(test_context=TestContext(dataset=Dataset(raw_dataset=            MORTGAGE30US  FEDFUNDS  GS10  UNRATE
DATE                                            
1947-01-01           NaN       NaN   NaN     NaN
1947-02-01           NaN       NaN   NaN     NaN
1947-03-01           NaN       NaN   NaN     NaN
1947-04-01           NaN       NaN   NaN     NaN
1947-05-01           NaN       NaN   NaN     NaN
...                  ...       ...   ...     ...
2023-04-01           NaN       NaN  3.46     NaN
2023-04-06          6.28       NaN   NaN     NaN
2023-04-13          6.27       NaN   NaN     NaN
2023-04-20          6.39       NaN   NaN     NaN
2023-04-27          6.43       NaN   NaN     NaN

[3551 rows x 4 columns], fields=[{'id': 'MORTGAGE30US', 'type': 'Numeric'}, {'id': 'FEDFUNDS', 'type': 'Numeric'}, {'id': 'GS10', 'type': 'Numeric'}, {'id': 'UNRATE', 'type': 'Numeric'}], sample=[{'id': 'head', 'data': [{'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan},

Handling Frequencies.

In [10]:
def identify_frequencies(df):
    """
    Identify the frequency of each series in the DataFrame.

    :param df: Time-series DataFrame
    :return: DataFrame with two columns: 'Variable' and 'Frequency'
    """
    frequencies = []
    for column in df.columns:
        series = df[column].dropna()
        if not series.empty:
            freq = pd.infer_freq(series.index)
            if freq == 'MS' or freq == 'M':
                label = 'Monthly'
            elif freq == 'Q':
                label = 'Quarterly'
            elif freq == 'A':
                label = 'Yearly'
            else:
                label = freq
        else:
            label = None

        frequencies.append({'Variable': column, 'Frequency': label})

    freq_df = pd.DataFrame(frequencies)

    return freq_df

In [11]:
frequencies = identify_frequencies(raw_df)
display(frequencies)

Unnamed: 0,Variable,Frequency
0,MORTGAGE30US,
1,FEDFUNDS,Monthly
2,GS10,Monthly
3,UNRATE,Monthly


Resample.

In [12]:
preprocessed_df = raw_df.resample('MS').last()
frequencies = identify_frequencies(preprocessed_df)
display(frequencies)

Unnamed: 0,Variable,Frequency
0,MORTGAGE30US,Monthly
1,FEDFUNDS,Monthly
2,GS10,Monthly
3,UNRATE,Monthly


Run Data Quality Test Plan.

In [13]:
vm_dataset = vm.init_dataset(
    dataset=preprocessed_df
)
vm.run_test_plan("time_series_data_quality", dataset=vm_dataset, config=config)

Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...


Running ThresholdTest: time_series_outliers:   0%|          | 0/3 [00:00<?, ?it/s]  

        Variable   z-score  Threshold       Date
0       FEDFUNDS  3.106442          3 1980-03-01
1       FEDFUNDS  3.212296          3 1980-04-01
2       FEDFUNDS  3.537417          3 1980-12-01
3       FEDFUNDS  3.582783          3 1981-01-01
4       FEDFUNDS  3.441645          3 1981-05-01
5       FEDFUNDS  3.587823          3 1981-06-01
6       FEDFUNDS  3.572701          3 1981-07-01
7       FEDFUNDS  3.265222          3 1981-08-01
8   MORTGAGE30US  3.246766          3 1981-09-01
9   MORTGAGE30US  3.271251          3 1981-10-01
10  MORTGAGE30US  3.011098          3 1982-01-01
11        UNRATE  5.011303          3 2020-04-01
12        UNRATE  4.128421          3 2020-05-01


                                                                                                                                       

TimeSeriesDataQuality(test_context=TestContext(dataset=Dataset(raw_dataset=            MORTGAGE30US  FEDFUNDS  GS10  UNRATE
DATE                                            
1947-01-01           NaN       NaN   NaN     NaN
1947-02-01           NaN       NaN   NaN     NaN
1947-03-01           NaN       NaN   NaN     NaN
1947-04-01           NaN       NaN   NaN     NaN
1947-05-01           NaN       NaN   NaN     NaN
...                  ...       ...   ...     ...
2022-12-01          6.42      4.10  3.62     3.5
2023-01-01          6.13      4.33  3.53     3.4
2023-02-01          6.50      4.57  3.75     3.6
2023-03-01          6.32      4.65  3.66     3.5
2023-04-01          6.43       NaN  3.46     NaN

[916 rows x 4 columns], fields=[{'id': 'MORTGAGE30US', 'type': 'Numeric'}, {'id': 'FEDFUNDS', 'type': 'Numeric'}, {'id': 'GS10', 'type': 'Numeric'}, {'id': 'UNRATE', 'type': 'Numeric'}], sample=[{'id': 'head', 'data': [{'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, 

Remove missing values.

In [14]:
preprocessed_df = preprocessed_df.dropna()

Run Data Quality Test Plan. 

In [15]:
vm_dataset = vm.init_dataset(
    dataset=preprocessed_df,
    target_column=target_column
)
vm.run_test_plan("time_series_data_quality", dataset=vm_dataset, config=config)

Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...


Running ThresholdTest: time_series_outliers:   0%|          | 0/3 [00:00<?, ?it/s]  

        Variable   z-score  Threshold       Date
0       FEDFUNDS  3.106442          3 1980-03-01
1       FEDFUNDS  3.212296          3 1980-04-01
2       FEDFUNDS  3.537417          3 1980-12-01
3       FEDFUNDS  3.582783          3 1981-01-01
4       FEDFUNDS  3.441645          3 1981-05-01
5       FEDFUNDS  3.587823          3 1981-06-01
6       FEDFUNDS  3.572701          3 1981-07-01
7       FEDFUNDS  3.265222          3 1981-08-01
8   MORTGAGE30US  3.246766          3 1981-09-01
9   MORTGAGE30US  3.271251          3 1981-10-01
10  MORTGAGE30US  3.011098          3 1982-01-01
11        UNRATE  5.011303          3 2020-04-01
12        UNRATE  4.128421          3 2020-05-01


                                                                                                                                       

TimeSeriesDataQuality(test_context=TestContext(dataset=Dataset(raw_dataset=            MORTGAGE30US  FEDFUNDS  GS10  UNRATE
DATE                                            
1971-04-01          7.29      4.16  5.83     5.9
1971-05-01          7.46      4.63  6.39     5.9
1971-06-01          7.54      4.91  6.52     5.9
1971-07-01          7.69      5.31  6.73     6.0
1971-08-01          7.69      5.57  6.58     6.1
...                  ...       ...   ...     ...
2022-11-01          6.58      3.78  3.89     3.6
2022-12-01          6.42      4.10  3.62     3.5
2023-01-01          6.13      4.33  3.53     3.4
2023-02-01          6.50      4.57  3.75     3.6
2023-03-01          6.32      4.65  3.66     3.5

[624 rows x 4 columns], fields=[{'id': 'MORTGAGE30US', 'type': 'Numeric'}, {'id': 'FEDFUNDS', 'type': 'Numeric'}, {'id': 'GS10', 'type': 'Numeric'}, {'id': 'UNRATE', 'type': 'Numeric'}], sample=[{'id': 'head', 'data': [{'MORTGAGE30US': 7.29, 'FEDFUNDS': 4.16, 'GS10': 5.83, 'UNRATE': 5.9

## Exploratory Data Analysis

### Univariate Analysis

#### Run Time Series Univariate Test Plan

In [16]:
vm.test_plans.describe_plan("time_series_univariate")

Attribute,Value
ID,time_series_univariate
Name,TimeSeriesUnivariate
Description,Test plan to perform time series univariate analysis.
Required Context,['dataset']
Tests,"TimeSeriesLinePlot (Metric), TimeSeriesHistogram (Metric), ACFandPACFPlot (Metric), SeasonalDecompose (Metric), AutoSeasonality (Metric), AutoStationarity (Metric), RollingStatsPlot (Metric), AutoAR (Metric), AutoMA (Metric)"
Test Plans,[]


In [17]:
test_plan_config = {
    "time_series_line_plot": {
        "columns": target_column + feature_columns
    },
    "time_series_histogram": {
        "columns": target_column + feature_columns
    },
    "acf_pacf_plot": {
        "columns": target_column + feature_columns
    },
    "auto_ar": {
        "max_ar_order": 3
    },
    "auto_ma": {
        "max_ma_order": 3
    },
    "seasonal_decompose": {
        "seasonal_model": 'additive',
         "fig_size": (40,30)
    },
    "auto_seasonality": {
        "min_period": 1,
        "max_period": 3
    },
      "auto_stationarity": {
        "max_order": 3,
        "threshold": 0.05
    },
      "rolling_stats_plot": {
        "window_size": 12    
    },
}

vm_dataset = vm.init_dataset(
    dataset=preprocessed_df
)
vm.run_test_plan("time_series_univariate", config=test_plan_config, dataset=vm_dataset)

Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...


Running Metric: acf_pacf_plot:  22%|██▏       | 2/9 [00:00<00:01,  3.62it/s]        The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
Running Metric: seasonal_decompose:  33%|███▎      | 3/9 [00:01<00:02,  2.02it/s]The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
The default method 'yw'



Non-invertible starting MA parameters found. Using zeros as starting parameters.




Non-invertible starting MA parameters found. Using zeros as starting parameters.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
                                                                                                                                    

TimeSeriesUnivariate(test_context=TestContext(dataset=Dataset(raw_dataset=            MORTGAGE30US  FEDFUNDS  GS10  UNRATE
DATE                                            
1971-04-01          7.29      4.16  5.83     5.9
1971-05-01          7.46      4.63  6.39     5.9
1971-06-01          7.54      4.91  6.52     5.9
1971-07-01          7.69      5.31  6.73     6.0
1971-08-01          7.69      5.57  6.58     6.1
...                  ...       ...   ...     ...
2022-11-01          6.58      3.78  3.89     3.6
2022-12-01          6.42      4.10  3.62     3.5
2023-01-01          6.13      4.33  3.53     3.4
2023-02-01          6.50      4.57  3.75     3.6
2023-03-01          6.32      4.65  3.66     3.5

[624 rows x 4 columns], fields=[{'id': 'MORTGAGE30US', 'type': 'Numeric'}, {'id': 'FEDFUNDS', 'type': 'Numeric'}, {'id': 'GS10', 'type': 'Numeric'}, {'id': 'UNRATE', 'type': 'Numeric'}], sample=[{'id': 'head', 'data': [{'MORTGAGE30US': 7.29, 'FEDFUNDS': 4.16, 'GS10': 5.83, 'UNRATE': 5.9}

### Multivariate Analysis

#### Run Time Series Multivariate Test Plan

In [18]:
vm.test_plans.describe_plan("time_series_multivariate")

Attribute,Value
ID,time_series_multivariate
Name,TimeSeriesMultivariate
Description,Test plan to perform time series multivariate analysis.
Required Context,['dataset']
Tests,"ScatterPlot (Metric), LaggedCorrelationHeatmap (Metric), SpreadPlot (Metric)"
Test Plans,[]


In [19]:
test_plan_config = {
    "scatter_plot": {
        "columns": target_column + feature_columns
    },
    "lagged_correlation_heatmap": {
        "target_col": target_column,
        "independent_vars": feature_columns
    },
    "engle_granger_coint": {
        "threshold": 0.05
    },
}

vm.run_test_plan("time_series_multivariate", config=test_plan_config, dataset=vm_dataset)

                                                                                                                                  

TimeSeriesMultivariate(test_context=TestContext(dataset=Dataset(raw_dataset=            MORTGAGE30US  FEDFUNDS  GS10  UNRATE
DATE                                            
1971-04-01          7.29      4.16  5.83     5.9
1971-05-01          7.46      4.63  6.39     5.9
1971-06-01          7.54      4.91  6.52     5.9
1971-07-01          7.69      5.31  6.73     6.0
1971-08-01          7.69      5.57  6.58     6.1
...                  ...       ...   ...     ...
2022-11-01          6.58      3.78  3.89     3.6
2022-12-01          6.42      4.10  3.62     3.5
2023-01-01          6.13      4.33  3.53     3.4
2023-02-01          6.50      4.57  3.75     3.6
2023-03-01          6.32      4.65  3.66     3.5

[624 rows x 4 columns], fields=[{'id': 'MORTGAGE30US', 'type': 'Numeric'}, {'id': 'FEDFUNDS', 'type': 'Numeric'}, {'id': 'GS10', 'type': 'Numeric'}, {'id': 'UNRATE', 'type': 'Numeric'}], sample=[{'id': 'head', 'data': [{'MORTGAGE30US': 7.29, 'FEDFUNDS': 4.16, 'GS10': 5.83, 'UNRATE': 5.