Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputation quality is worse than generation quality #8

Closed
calvinmccarter opened this issue Dec 24, 2023 · 1 comment
Closed

Imputation quality is worse than generation quality #8

calvinmccarter opened this issue Dec 24, 2023 · 1 comment

Comments

@calvinmccarter
Copy link

I've noticed that the quality of imputed data is worse than that of generated data. Below is a minimal reproducible example, with Two Moons data. I generated a N=200 dataset, and then created a ForestDiffusionModel with N=400 samples, comprising both the N=200 dataset and a modified copy of that dataset with the second dimension set to NaN.

Sampling from the model produced nice-looking results, but the imputations for the samples with NaNs were much more noisy:
image

Code below:

import matplotlib.pyplot as plt
import numpy as np
import sklearn.datasets as skd

from ForestDiffusion import ForestDiffusionModel
from sklearn.utils import check_random_state

rix = 0
rng = check_random_state(rix)
n_upper = 100
n_lower = 100
n = n_upper + n_lower
data, labels = skd.make_moons(
    (n_upper, n_lower), shuffle=False, noise=0.1, random_state=rix)

data4impute = data.copy()
data4impute[:, 1] = np.nan
model = ForestDiffusionModel(
    X=np.concatenate([data, data4impute], axis=0),
    n_t=50, duplicate_K=100, diffusion_type='vp',
    bin_indexes=[], cat_indexes=[], int_indexes=[], n_jobs=-1)
data_fake = model.generate(batch_size=data.shape[0])

nimp = 1 # number of imputations needed
data_imputefast = model.impute(k=nimp) # regular (fast)
data_impute = model.impute(repaint=True, r=10, j=5, k=nimp) # REPAINT (slow, but better)

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(7, 5));
axes[0, 0].scatter(data[:, 0], data[:, 1]);
axes[0, 0].set_title('original');
axes[0, 1].scatter(data_fake[:, 0], data_fake[:, 1]);
axes[0, 1].set_title('generated');
axes[1, 0].scatter(data_imputefast[200:, 0], data_imputefast[200:, 1]);
axes[1, 0].set_title('imputed');
axes[1, 1].scatter(data_impute[200:, 0], data_impute[200:, 1]);
axes[1, 1].set_title('imputed - repainted');
plt.tight_layout();

I get similar results if I instead do the following, so that the imputed data don't exactly match on the first dimension with the original data:

data4impute = data.copy()
data4impute[:, 1] = np.nan
data4impute[:, 0] = rng.uniform(-1.2, 2.2, size=data4impute.shape[0])
@AlexiaJM
Copy link
Collaborator

Hi Calvin,

We also observed worse results for imputation. In our paper, you can see that MissForest is the best imputation method, while we end up far from the first place. We are not quite sure why imputation is so much worse than generation considering that for images it works fine. I have tried a lot of things, but nothing improves imputation performance. Our method is best used for generation.

Alexia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants