Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] sample from a truncated normal instead of clipping samples from a normal #12177

Merged
merged 20 commits into from Oct 5, 2018

Conversation

Projects
None yet
5 participants
@benlawson
Copy link
Contributor

benlawson commented Sep 26, 2018

Reference Issues/PRs

Related to "Merge iterativeimputer branch into master" #11977

What does this implement/fix? Explain your changes.

When sampling from the posterior and boundary values are given, the current implementation clips values that are sampled from normal distribution. This can lead to undesired oversampling of the boundary values. For example, if our boundaries were [0, 2] and we have a mean of 0 and standard deviation of 1, the difference between sampling from a normal and clipping and a truncated normal is shown here:

from scipy.stats import norm, truncnorm
import matplotlib.pyplot as plt 
import numpy as np

norm_dis = norm(loc=0, scale=1)
truc_dis = truncnorm(a=0, b=2, loc=0, scale=1)
trucs = truc_dis.rvs(10000)
norms = norm_dis.rvs(10000)
norms = np.clip(norms, 0, 2)
plt.hist(norms, histtype='step')
plt.hist(trucs, histtype='step')
plt.legend(["clipped normal", "truncated normal"])
plt.show()

example

When sampling from the posterior, this PR samples from a truncated normal distribution instead of clipping values that have been sampled from a normal distribution.

Any other comments?

To impute values within boundaries, the MICE R package appears to clip values[1] while the mi R package uses the truncated normal to sample[2] within the user given range. Open to discussing which approach is best.

[1] page 149, https://cran.r-project.org/web/packages/mice/mice.pdf
[2] page 21, https://cran.r-project.org/web/packages/mi/mi.pdf

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 27, 2018

This looks nice :)

I assume there's no simple way to write a non-regression test :\ ?

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Sep 27, 2018

Some tests fail with older versions of scipy. I think we might need a backport in sklearn.utils.fixes:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/fixes.py

@benlawson benlawson changed the title sample from a truncated normal instead of clipping samples from a normal [WIP] sample from a truncated normal instead of clipping samples from a normal Sep 27, 2018

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Sep 27, 2018

This pull request introduces 2 alerts when merging 21d7238 into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import

Comment posted by LGTM.com

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Sep 27, 2018

This pull request introduces 4 alerts when merging 451b80d into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Missing call to __init__ during object initialization
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Sep 27, 2018

This pull request introduces 5 alerts when merging f5b5d94 into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Missing call to __init__ during object initialization
  • 1 for Module-level cyclic import
  • 1 for First argument to super() is not enclosing class

Comment posted by LGTM.com

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Sep 27, 2018

This pull request introduces 3 alerts when merging 97e2d37 into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

Ben Lawson
@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Sep 28, 2018

This pull request introduces 3 alerts when merging bf0e49f into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@@ -591,6 +592,28 @@ def test_iterative_imputer_clip():
assert_allclose(Xt[X != 0], X[X != 0])


def test_iterative_imputer_normal_posterior():
rng = np.random.RandomState(0)

This comment has been minimized.

@jnothman

jnothman Sep 29, 2018

Member

Add a comment here on what, intuitively, this test ensures

# generate multiple imputations for the single missing value
imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)])
mu, sigma = imputations.mean(), imputations.std()
ks_statistic, p_value = kstest((imputations-mu)/sigma, 'norm')

This comment has been minimized.

@jnothman

jnothman Sep 29, 2018

Member

Please put spaces around binary operators

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Oct 1, 2018

This pull request introduces 3 alerts when merging 2edb84c into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@benlawson

This comment has been minimized.

Copy link
Contributor Author

benlawson commented Oct 1, 2018

for the backport, I just took the relevant scipy code and added it to a new file sklearn/utils/_scipy_truncnorm_backport.py

  1. is this the proper way to do backports?
  2. if so, should I add the corresponding tests from scipy, for code-coverage check?
@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Oct 1, 2018

This pull request introduces 3 alerts when merging 00a4dae into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Oct 1, 2018

Actually, based on the consensus emerging from the discussion in #12184, I don't think we need to backport: scikit-learn 0.21 will require scipy >= 0.17.0.

So please remove the backport and instead skip the tests that fail because of the lack of scipy.stats.truncnorm:

# TODO remove the skipif marker as soon as scipy < 0.17 support is dropped
@pytest.mark.skipif(not hasattr(scipy.stats, 'truncnorm'), 'need scipy.stats.truncnorm')
def test_something():
    ...

Ben Lawson added some commits Oct 1, 2018

@benlawson

This comment has been minimized.

Copy link
Contributor Author

benlawson commented Oct 2, 2018

cool, sounds good. I used the following to skip tests as scipy 0.14 has truncnorm, but it doesn't accept random_state parameter

def test_something():
    pytest.importorskip("scipy", minversion="0.17.0")
    ...

@benlawson benlawson changed the title [WIP] sample from a truncated normal instead of clipping samples from a normal [MRG] sample from a truncated normal instead of clipping samples from a normal Oct 2, 2018

@jnothman
Copy link
Member

jnothman left a comment

Please add your name to the credits for IterativeImputer in doc/whats_new/v0.21.rst

Ben Lawson
Multiple modules
................

Changes to estimator checks
---------------------------

These changes mostly affect library developers.
These changes mo:tabstly affect library developers.

This comment has been minimized.

@amueller

This comment has been minimized.

@benlawson

benlawson Oct 3, 2018

Author Contributor

whoops, sorry about that!

Ben Lawson
@ogrisel
Copy link
Member

ogrisel left a comment

Here are some comments.

The new test is nice but it does really check that truncated normal != clipped normal right? But I guess this is fine. I am not sure what we could do to test better.

def test_iterative_imputer_normal_posterior():
# test that the values that are imputed using `sample_posterior=True`
# with boundaries (`min_value` and `max_value` are not None) are drawn
# from a distribution that looks gaussian via the Kolmogorov Smirnov test

This comment has been minimized.

@ogrisel

ogrisel Oct 5, 2018

Member

"that does not look Gaussian", right?

If the KS test p-value is larger than 0.1 the null hypothesis that the posterior samples are sampled from a standard normal distribution is rejected.

And this is expected because of the truncation by min_value and max_value that cut the tails of the posterior distribution.

This comment has been minimized.

@benlawson

benlawson Oct 5, 2018

Author Contributor

Perhaps, it should read "that looks roughly Gaussian."

We would want to fail to reject the null hypothesis. We don't want the truncation to make the posterior no longer roughly Gaussian-looking

This comment has been minimized.

@ogrisel

ogrisel Oct 5, 2018

Member

Oh alright sorry, it's just me having trouble with double negation reasoning.

# we want to fail to reject null hypothesis
# null hypothesis: distributions are the same
assert ks_statistic < 0.15 or p_value > 0.1, \
"The posterior does appear to be normal"

This comment has been minimized.

@ogrisel

ogrisel Oct 5, 2018

Member

Where does the 0.15 threshold for the KS-statistic come from? Isn't p_value > 0.1 enough?

This comment has been minimized.

@benlawson

benlawson Oct 5, 2018

Author Contributor

I made this assert statement based on my understanding of this line from scipy.stats.ks_2samp docs: "If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same."

Empirically, a slightly higher threshold might be needed (0.2) The p-values don't seem to differ much between the two branches (this branch/iterativeimputer branch). The threshold change is based on an mistake I made generating the test data: the values of variable X were supposed to normally distributed, centered at 0, but I mistakenly used .random_sample which is uniformly distributed from [0, 1), so it was unlikely min_value=0 would do much.

ksstats
pvalues

code used

experiment
from tqdm import tqdm
import joblib

import numpy as np
from scipy.stats import kstest
from sklearn.impute import IterativeImputer

stats, ps = [], []
for seed in tqdm(range(0, 100)):
    #  test that the values that are imputed using `sample_posterior=True`
    #  with boundaries (`min_value` and `max_value` are not None) are drawn
    #  from a distribution that looks gaussian via the Kolmogorov Smirnov test
    rng = np.random.RandomState(seed)

    X = rng.normal(size=(5, 5))
    X[0][0] = np.nan

    imputer = IterativeImputer(min_value=0,
                               max_value=0.5,
                               sample_posterior=True,
                               random_state=rng)

    imputer.fit_transform(X)
    # generate multiple imputations for the single missing value
    imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)])
    mu, sigma = imputations.mean(), imputations.std()
    if sigma == 0:
        sigma += 1e-12
    ks_statistic, p_value = kstest((imputations - mu) / sigma, 'norm')
    stats.append(ks_statistic)
    ps.append(p_value)
joblib.dump((stats, ps), 'master_branch.joblib')
plotting
import joblib
import numpy as np
import matplotlib.pyplot as plt

clip_stats, clip_ps = joblib.load("./master_branch.joblib")
truc_stats, truc_ps = joblib.load("./new_branch.joblib")

clip_ps = np.array(clip_ps)[~np.isnan(clip_ps)]
clip_stats = np.array(clip_stats)[~np.isnan(clip_stats)]
truc_ps = np.array(truc_ps)[~np.isnan(truc_ps)]
truc_stats = np.array(truc_stats)[~np.isnan(truc_stats)]

plt.boxplot([truc_stats, clip_stats], labels=["Truncated normal", "Clipped normal"])
plt.hlines(0.15, 0, 3, linestyles='dotted', colors='r')
plt.title("Distribution of KS stats")
plt.savefig("ksstats.png")
plt.close()
plt.boxplot([truc_ps, clip_ps], labels=["Truncated normal", "Clipped normal"])
plt.title("Distribution of p-values")
plt.savefig("pvalues.png")

This comment has been minimized.

@ogrisel

ogrisel Oct 5, 2018

Member

Thank you very much for the details, this is very helpful.

@@ -591,6 +592,32 @@ def test_iterative_imputer_clip():
assert_allclose(Xt[X != 0], X[X != 0])


def test_iterative_imputer_normal_posterior():

This comment has been minimized.

@ogrisel

ogrisel Oct 5, 2018

Member

I would rename this to test_iterative_imputer_truncated_normal_posterior

loc=mus[good_sigmas],
scale=sigmas[good_sigmas])
imputed_values[good_sigmas] = truncated_normal.rvs(
random_state=self.random_state_)

This comment has been minimized.

@ogrisel

ogrisel Oct 5, 2018

Member

nitpick: this indentation does not seem to follow PEP8. I wonder why it wasn't caught by our CI.

The following should work:

            imputed_values[good_sigmas] = truncated_normal.rvs(
                random_state=self.random_state_)
Show resolved Hide resolved sklearn/tests/test_impute.py
@@ -48,6 +48,10 @@ Support for Python 3.4 and below has been officially dropped.
function of other features in a round-robin fashion. :issue:`8478` by
:user:`Sergey Feldman <sergeyf>`.

- |Enhancement| :class:`impute.IterativeImputer` now samples from a truncated normal
distrubtion instead of a clipped normal distribution when ``sample_posterior=True``.

This comment has been minimized.

@ogrisel

ogrisel Oct 5, 2018

Member

typo: distribution

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Oct 5, 2018

LGTM, merging.

@ogrisel ogrisel merged commit 09a9a21 into scikit-learn:iterativeimputer Oct 5, 2018

8 checks passed

LGTM analysis: Python No alert changes
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 96% of diff hit (target 92.85%)
Details
codecov/project Absolute coverage decreased by -0.01% but relative coverage increased by +3.14% compared to a4f2a89
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

MLopez-Ibanez pushed a commit to MLopez-Ibanez/scikit-learn that referenced this pull request Feb 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.