In [None]:
%matplotlib inline

Multivariate Drift {#plot_tabular_multivariate_drift}
==================

This notebooks provides an overview for using and understanding the
multivariate drift check.

**Structure:**

-   [What Is Multivariate Drift?](#what-is-a-multivariate-drift)
-   [Loading the Data](#loading-the-data)
-   [Run the Check](#run-the-check)
-   [Define a Condition](#define-a-condition)

What Is Multivariate Drift?
---------------------------

Drift is simply a change in the distribution of data over time, and it
is also one of the top reasons why machine learning model\'s performance
degrades over time.

A multivariate drift is a drift that occurs in more than one feature at
a time, and may even affect the relationships between those features,
which are undetectable by univariate drift methods. The multivariate
drift check tries to detect multivariate drift between the two input
datasets.

For more information on drift, please visit our
`drift guide </user-guide/general/drift_guide>`{.interpreted-text
role="doc"}.

### How Deepchecks Detects Dataset Drift

This check detects multivariate drift by using
`a domain classifier <drift_detection_by_domain_classifier>`{.interpreted-text
role="ref"}. Other methods to detect drift include
`univariate measures <drift_detection_by_univariate_measure>`{.interpreted-text
role="ref"} which is used in other checks, such as
`Train Test Feature Drift check </checks_gallery/tabular/train_test_validation/plot_train_test_feature_drift>`{.interpreted-text
role="doc"}.


Loading the Data
================

The dataset is the adult dataset which can be downloaded from the UCI
machine learning repository.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
\[<http://archive.ics.uci.edu/ml>\]. Irvine, CA: University of
California, School of Information and Computer Science.


In [8]:
# from urllib.request import urlopen

from pathlib import Path
import shutil
from typing import List

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, Normalizer

from deepchecks.tabular import Dataset
from deepchecks.tabular.datasets.classification import adult

In [17]:

N_CHUNKS = 20
# COLORS = ["red", "blue"]
# COLORS = ["red", "blue", "green", "yellow", "black", "orange", "purple", "pink", "brown", "gray"]
COLORS = sns.color_palette("hls", N_CHUNKS)

COL_1 = 4  # prev_plan_cpu_1
COL_1_STR = "prev_plan_cpu_1"
COL_2 = 8  # prev_plan_mem_1
COL_2_STR = "prev_plan_mem_1"

# COL_2 = 12  # prev_instance_num_1
# COL_2_STR = "prev_instance_num_1"

non_feature_columns = [
    "name",
    # "task_type",
    "status",
    "start_time",
    "end_time",
    # "instance_num",
    # "plan_cpu",
    # "plan_mem",
    "instance_name",
    "instance_name.1",
    "instance_start_time",
    "instance_end_time",
    "machine_id",
    "seq_no",
    "total_seq_no",
    # "instance_name",
    "cpu_avg",
    # "cpu_max",
    "mem_avg",
    "mem_max",
]


In [18]:

def create_output_dir(path: str, clean: bool = True) -> Path:
    output_dir = Path(path)

    if clean and output_dir.exists() and output_dir.is_dir():
        shutil.rmtree(output_dir)

    # output_dir.unlink(missing_ok=True)
    output_dir.mkdir(parents=True, exist_ok=True)
    return output_dir

Create Dataset
==============


In [19]:
def load_raw_data(path: Path | str):
    raw_data = pd.read_csv(path)

    raw_data = (
        raw_data.sort_values(by=["instance_start_time"])
        .drop(columns=non_feature_columns)
        .dropna()
    )

    raw_data = raw_data[
        (raw_data.plan_cpu > 0) & (raw_data.plan_mem > 0)
    ]

    append_prev_feature(raw_data, 4, "plan_cpu")
    append_prev_feature(raw_data, 4, "plan_mem")
    append_prev_feature(raw_data, 4, "instance_num")

    raw_data = raw_data.dropna()


    cpu_max_pred = pd.cut(
        raw_data.cpu_max,
        bins=4,
        labels=[0, 1, 2, 3],
    )
    raw_data = raw_data.assign(
        cpu_max_pred=cpu_max_pred,
    )

    feature_column_names += [
        "cpu_max_pred",
    ]

    # scaler = StandardScaler()
    scaler = Normalizer()
    raw_data[raw_data.columns] = scaler.fit_transform(
        raw_data[raw_data.columns]
    )
    return raw_data


def generate_data_chunks(
    path: Path | str, out_path: Path | str
) -> List[pd.DataFrame]:
    path = Path(path)
    out_path = create_output_dir(out_path, clean=False)

    subsets: List[pd.DataFrame] = []
    for i in range(N_CHUNKS):
        if (out_path / f"chunk-{i}.csv").exists():
            print(f"Chunk {i} already exists, skip generating...")
            subsets.append(pd.read_csv(out_path / f"chunk-{i}.csv"))

    if len(subsets) == N_CHUNKS:
        return subsets

    print("Generating data chunks...")

    raw_data = load_raw_data(path)

    size = len(raw_data)
    split_size = size // N_CHUNKS

    for i in range(N_CHUNKS):
        if i == N_CHUNKS - 1:
            data = raw_data.iloc[i * split_size :]
        else:
            data = raw_data.iloc[
                i * split_size : (i + 1) * split_size
            ]
        data.to_csv(out_path / f"chunk-{i}.csv", index=False)
        subsets.append(data)

    return subsets

In [12]:
subsets = generate_data_chunks(
    "/lcrc/project/FastBayes/rayandrew/trace-utils/generated-task/chunk-0.csv",
    "./cov-shift/alibaba/chunks-norm-new",
)

Chunk 0 already exists, skip generating...
Chunk 1 already exists, skip generating...
Chunk 2 already exists, skip generating...
Chunk 3 already exists, skip generating...
Chunk 4 already exists, skip generating...
Chunk 5 already exists, skip generating...
Chunk 6 already exists, skip generating...
Chunk 7 already exists, skip generating...
Chunk 8 already exists, skip generating...
Chunk 9 already exists, skip generating...
Chunk 10 already exists, skip generating...
Chunk 11 already exists, skip generating...
Chunk 12 already exists, skip generating...
Chunk 13 already exists, skip generating...
Chunk 14 already exists, skip generating...
Chunk 15 already exists, skip generating...
Chunk 16 already exists, skip generating...
Chunk 17 already exists, skip generating...
Chunk 18 already exists, skip generating...
Chunk 19 already exists, skip generating...


In [None]:
# label_name = 'income'
# train_ds, test_ds = adult.load_data()
# encoder = LabelEncoder()
# train_ds.data[label_name] = encoder.fit_transform(train_ds.data[label_name])
# test_ds.data[label_name] = encoder.transform(test_ds.data[label_name])

In [None]:
# train_ds.label_name

Run the Check
=============


In [14]:
# from deepchecks.tabular.checks import MultivariateDrift

# check = MultivariateDrift()
# check.run(train_dataset=subsets[0], test_dataset=subsets[1])

We can see that there is almost no drift found between the train and the
test set of the raw adult dataset. In addition to the drift score the
check displays the top features that contibuted to the data drift.

Introduce drift to dataset
==========================

Now, let\'s try to add a manual data drift to the data by sampling a
biased portion of the training data


In [15]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier
from deepchecks.tabular import Dataset

In [16]:
model = Pipeline([
    # ('handle_cat', ColumnTransformer(
    #     transformers=[
    #         ('num', 'passthrough',
    #          ['numeric_with_drift', 'numeric_without_drift']),
    #         ('cat',
    #          Pipeline([
    #              ('encode', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
    #          ]),
    #          ['categorical_with_drift', 'categorical_without_drift'])
    #     ]
    # )),
    ('model', DecisionTreeClassifier(random_state=0, max_depth=2))]
)

In [None]:
# sample_size = 10000
# random_seed = 0

In [None]:
# train_drifted_df = pd.concat([train_ds.data.sample(min(sample_size, train_ds.n_samples) - 5000, random_state=random_seed), 
#                              train_ds.data[train_ds.data['sex'] == ' Female'].sample(5000, random_state=random_seed)])
# test_drifted_df = test_ds.data.sample(min(sample_size, test_ds.n_samples), random_state=random_seed)

# train_drifted_ds = Dataset(train_drifted_df, label=label_name, cat_features=train_ds.cat_features)
# test_drifted_ds = Dataset(test_drifted_df, label=label_name, cat_features=test_ds.cat_features)

In [None]:
# check = MultivariateDrift()
# check.run(train_dataset=train_drifted_ds, test_dataset=test_drifted_ds)

As expected, the check detects a multivariate drift between the train
and the test sets. It also displays the sex feature\'s distribution -
the feature that contributed the most to that drift. This is reasonable
since the sampling was biased based on that feature.

Define a Condition
==================

Now, we define a condition that enforce the multivariate drift score
must be below 0.1. A condition is deepchecks\' way to validate model and
data quality, and let you know if anything goes wrong.


In [None]:
check = MultivariateDrift()
check.add_condition_overall_drift_value_less_than(0.1)
check.run(train_dataset=train_drifted_ds, test_dataset=test_drifted_ds)

As we see, our condition successfully detects the drift score is above
the defined threshold.
