Imputation quality is worse than generation quality #8

calvinmccarter · 2023-12-24T15:05:35Z

I've noticed that the quality of imputed data is worse than that of generated data. Below is a minimal reproducible example, with Two Moons data. I generated a N=200 dataset, and then created a ForestDiffusionModel with N=400 samples, comprising both the N=200 dataset and a modified copy of that dataset with the second dimension set to NaN.

Sampling from the model produced nice-looking results, but the imputations for the samples with NaNs were much more noisy:

Code below:

import matplotlib.pyplot as plt
import numpy as np
import sklearn.datasets as skd

from ForestDiffusion import ForestDiffusionModel
from sklearn.utils import check_random_state

rix = 0
rng = check_random_state(rix)
n_upper = 100
n_lower = 100
n = n_upper + n_lower
data, labels = skd.make_moons(
    (n_upper, n_lower), shuffle=False, noise=0.1, random_state=rix)

data4impute = data.copy()
data4impute[:, 1] = np.nan
model = ForestDiffusionModel(
    X=np.concatenate([data, data4impute], axis=0),
    n_t=50, duplicate_K=100, diffusion_type='vp',
    bin_indexes=[], cat_indexes=[], int_indexes=[], n_jobs=-1)
data_fake = model.generate(batch_size=data.shape[0])

nimp = 1 # number of imputations needed
data_imputefast = model.impute(k=nimp) # regular (fast)
data_impute = model.impute(repaint=True, r=10, j=5, k=nimp) # REPAINT (slow, but better)

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(7, 5));
axes[0, 0].scatter(data[:, 0], data[:, 1]);
axes[0, 0].set_title('original');
axes[0, 1].scatter(data_fake[:, 0], data_fake[:, 1]);
axes[0, 1].set_title('generated');
axes[1, 0].scatter(data_imputefast[200:, 0], data_imputefast[200:, 1]);
axes[1, 0].set_title('imputed');
axes[1, 1].scatter(data_impute[200:, 0], data_impute[200:, 1]);
axes[1, 1].set_title('imputed - repainted');
plt.tight_layout();

I get similar results if I instead do the following, so that the imputed data don't exactly match on the first dimension with the original data:

data4impute = data.copy()
data4impute[:, 1] = np.nan
data4impute[:, 0] = rng.uniform(-1.2, 2.2, size=data4impute.shape[0])

The text was updated successfully, but these errors were encountered:

AlexiaJM · 2023-12-24T17:35:27Z

Hi Calvin,

We also observed worse results for imputation. In our paper, you can see that MissForest is the best imputation method, while we end up far from the first place. We are not quite sure why imputation is so much worse than generation considering that for images it works fine. I have tried a lot of things, but nothing improves imputation performance. Our method is best used for generation.

Alexia

AlexiaJM closed this as completed Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputation quality is worse than generation quality #8

Imputation quality is worse than generation quality #8

calvinmccarter commented Dec 24, 2023

AlexiaJM commented Dec 24, 2023

Imputation quality is worse than generation quality #8

Imputation quality is worse than generation quality #8

Comments

calvinmccarter commented Dec 24, 2023

AlexiaJM commented Dec 24, 2023