![](https://raw.githubusercontent.com/wandb/wandb/508982e50e82c54cbf0dd464a9959fee0e1740ad/.github/wb-logo-lightbg.png)
<!--- @wandbcode{dataval-course-02} -->

In [None]:
from dataval.dataset import WeatherDataset
from dataval.train import CatBoostTrainer
from dataval import dataset_extensions

import os
import matplotlib.pyplot as plt
import pandas as pd

import wandb

os.environ["WANDB_QUIET"] = "true" # Let's keep the output clean

# Let's start a new W&B run to track our work
run = wandb.init(project="ml-dataval-tutorial")

# Explore Corruptions

We are going to introduce some corruptions into our data, with varying degrees of errors. The purpose of this exercise is to learn that corruptions vary in:

* How hard they are to catch
* How much they actually impact downstream accuracy

We will use the continual training pipeline from the first notebook, and the corruption utility functions in `dataval/dataset.py`.

In [None]:
# Load dataset

ds = WeatherDataset(os.path.join(os.getcwd(), "canonical-paritioned-dataset"), sample_frac=0.2)

## Obvious Corruptions

Data is obviously corrupted if it is:

* Denoted with a missing value (i.e., nan)
* Violating nonnegativity constraints (e.g., negative pressure value)
* Doesn't type check

We will corrupt columns in the same sensor group.

In [None]:
train_df = ds.load(ds.get_partition_keys()[0])
test_df = ds.load(ds.get_partition_keys()[1])

In [None]:
# Establish baseline MSE with clean train and test data

t, _, _ = ds.train_and_test(train_df, test_df)

print()
print(t.get_feature_importance().head(5))

### Missing Value Corruption

First, we corrupt just 5% of the test data for the `cmc` sensor group. Note how much worse the test performance is!

In [None]:
corruption_results = []
corruption_results_by_feature = []

In [None]:
corrupted_test_df, _ = ds.corrupt_null(test_df, "cmc", corruption_rate=0.05)

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results.append({"name": "missing_value_0.05", "train_mse": train_mse, "test_mse": test_mse})

In [None]:
corrupted_test_df, _ = ds.corrupt_null_by_feature(test_df, "cmc", corruption_rate=0.05)

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results_by_feature.append({"name": "missing_value_0.05", "train_mse": train_mse, "test_mse": test_mse})

It gets even worse when increasing the corruption rate to 20%.

In [None]:
corrupted_test_df, _ = ds.corrupt_null(test_df, "cmc", corruption_rate=0.20)

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results.append({"name": "missing_value_0.2", "train_mse": train_mse, "test_mse": test_mse})

In [None]:
corrupted_test_df, _ = ds.corrupt_null_by_feature(test_df, "cmc", corruption_rate=0.20)

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results_by_feature.append({"name": "missing_value_0.2", "train_mse": train_mse, "test_mse": test_mse})

Instead of corrupting the test data, maybe we corrupt the train data. Note how the important feature values change, and the test performance is still worse than when training on clean data!

In [None]:
corrupted_train_df, _ = ds.corrupt_null_by_feature(train_df, "cmc", corruption_rate=0.2)

t, train_mse, test_mse = ds.train_and_test(corrupted_train_df, test_df)
print()
print(t.get_feature_importance().head(5))

### Violating Nonnegativity

We corrupt 5% of the test data for the `cmc` sensor group. Test performance is similarly bad.

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_nonnegative(test_df, "cmc", corruption_rate=0.05)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results.append({"name": "violate_nonnegative", "train_mse": train_mse, "test_mse": test_mse})

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_nonnegative_by_feature(test_df, "cmc", corruption_rate=0.05)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results_by_feature.append({"name": "violate_nonnegative", "train_mse": train_mse, "test_mse": test_mse})

### Violating Type Checks

We corrupt 5% of the test data for the `cmc` sensor group. Test performance is not as bad.

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_typecheck(test_df, "cmc", corruption_rate=0.05)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results.append({"name": "violate_typecheck", "train_mse": train_mse, "test_mse": test_mse})

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_typecheck_by_feature(test_df, "cmc", corruption_rate=0.05)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results_by_feature.append({"name": "violate_typecheck", "train_mse": train_mse, "test_mse": test_mse})

## Subtle Corruptions

This is by no means an exhaustive list, but we will explore what the following corruptions do to the model performance:

* Changing units (e.g., wind speed in m/s to km/s)
* For a row, average values within a sensor group (e.g., make all gfs sensors return the same value)
* Pin a value of a sensor for a fraction of rows (e.g., set climate_pressure to the 5th percentile)

### Changing `gfs_temperature` from Celsius to Fahrenheit

Suppose the units corruption changes the `gfs_temperature` sensor values from Celsius to Fahrenheit for 5% of rows. We can see that MSE gets noticeably worse.

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_units(test_df, "gfs_temperature", corruption_rate=0.05)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results.append({"name": "corrupt_units", "train_mse": train_mse, "test_mse": test_mse})

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_units_by_feature(test_df, "gfs_temperature", corruption_rate=0.05)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results_by_feature.append({"name": "corrupt_units", "train_mse": train_mse, "test_mse": test_mse})

### Averaging sensor values for some rows

Suppose we average `gfs` sensor values for 5% of rows. MSE also gets noticeably worse.

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_average(test_df, "gfs", corruption_rate=0.05)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results.append({"name": "average_values", "train_mse": train_mse, "test_mse": test_mse})

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_average_by_feature(test_df, "gfs", corruption_rate=0.05)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results_by_feature.append({"name": "average_values", "train_mse": train_mse, "test_mse": test_mse})

### Pinned Value Corruption

Suppose we pin `gfs` sensor values to 1.00 for 5% of rows. MSE increases still, but not as much as the other corruptions.

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_pinned(test_df, "gfs", corruption_rate=0.05, pinned_value=1.00)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results.append({"name": "pin_values", "train_mse": train_mse, "test_mse": test_mse})

In [None]:
corrupted_test_df, corrupted_cols = ds.corrupt_pinned_by_feature(test_df, "gfs", corruption_rate=0.05, pinned_value=1.00)

print(f"Corrupted columns: {corrupted_cols}")
print()

t, train_mse, test_mse = ds.train_and_test(train_df, corrupted_test_df)
print()
print(t.get_feature_importance().head(5))

corruption_results_by_feature.append({"name": "pin_values", "train_mse": train_mse, "test_mse": test_mse})

## Takeaways

How can we prevent against the corruptions demonstrated above? We'll want to run various data validation methods. The challenge is having methods work (1) _without knowledge of_ the specific corruption, since we can't anticipate and enumerate all possible corruptions, and (2) flag all corruptions precisely (i.e., no false positives). We'll explore this later.

In [None]:
import matplotlib.pyplot as plt

results_df = pd.DataFrame(corruption_results).rename(columns={"train_mse": "by_sample_train_mse", "test_mse": "by_sample_test_mse"})
results_df["by_feature_train_mse"] = pd.DataFrame(corruption_results_by_feature)["train_mse"]
results_df["by_feature_test_mse"] = pd.DataFrame(corruption_results_by_feature)["test_mse"]
# Log results to W&B table
run.log({"corruption_results": wandb.Table(dataframe=results_df)})
plt.plot(results_df["name"], results_df["by_sample_train_mse"], label="train_mse by sample")
plt.plot(results_df["name"], results_df["by_feature_train_mse"], label="train_mse by feature")
plt.plot(results_df["name"], results_df["by_sample_test_mse"], label="test_mse by sample")
plt.plot(results_df["name"], results_df["by_feature_test_mse"], label="test_mse by feature")
plt.xticks(rotation=90)
plt.legend()
plt.title("MSEs for different corruptions")
# Log plot to W&B
run.log({"corruption_plot": wandb.Image(plt)})
plt.show()

In [None]:
# We can finish now the W&B run
run.finish()