# [MRG] sample from a truncated normal instead of clipping samples from a normal #12177

Merged
merged 20 commits into from Oct 5, 2018

## Conversation

Projects
None yet
5 participants
Contributor

### benlawson commented Sep 26, 2018

#### Reference Issues/PRs

Related to "Merge iterativeimputer branch into master" #11977

#### What does this implement/fix? Explain your changes.

When sampling from the posterior and boundary values are given, the current implementation clips values that are sampled from normal distribution. This can lead to undesired oversampling of the boundary values. For example, if our boundaries were [0, 2] and we have a mean of 0 and standard deviation of 1, the difference between sampling from a normal and clipping and a truncated normal is shown here:

```from scipy.stats import norm, truncnorm
import matplotlib.pyplot as plt
import numpy as np

norm_dis = norm(loc=0, scale=1)
truc_dis = truncnorm(a=0, b=2, loc=0, scale=1)
trucs = truc_dis.rvs(10000)
norms = norm_dis.rvs(10000)
norms = np.clip(norms, 0, 2)
plt.hist(norms, histtype='step')
plt.hist(trucs, histtype='step')
plt.legend(["clipped normal", "truncated normal"])
plt.show()```

When sampling from the posterior, this PR samples from a truncated normal distribution instead of clipping values that have been sampled from a normal distribution.

To impute values within boundaries, the `MICE` R package appears to clip values[1] while the `mi` R package uses the truncated normal to sample[2] within the user given range. Open to discussing which approach is best.

``` sample from the truncated normal instead of normal ```
``` e3e05ea ```
Member

### jnothman commented Sep 27, 2018

 This looks nice :) I assume there's no simple way to write a non-regression test :\ ?
Member

### ogrisel commented Sep 27, 2018 • edited

 Some tests fail with older versions of scipy. I think we might need a backport in `sklearn.utils.fixes`: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/fixes.py

### Ben Lawson added some commits Sep 27, 2018

``` add non-regression test for sampling from truncated norm ```
``` 0465f3a ```
``` use random number generator for random sample ```
``` 46c7c4f ```
``` add backport for scipy versions earlier than 0.17 ```
``` 21d7238 ```

### sklearn-lgtm commented Sep 27, 2018

 This pull request introduces 2 alerts when merging 21d7238 into a4f2a89 - view on LGTM.com new alerts: 2 for Unused import Comment posted by LGTM.com

### Ben Lawson added some commits Sep 27, 2018

``` check scipy version (for common tests)" ```
``` 451b80d ```
``` reduce scipy version to reflect when their api changed ```
``` 37db243 ```

### sklearn-lgtm commented Sep 27, 2018

 This pull request introduces 4 alerts when merging 451b80d into a4f2a89 - view on LGTM.com new alerts: 2 for Unused import 1 for Missing call to __init__ during object initialization 1 for Module-level cyclic import Comment posted by LGTM.com

### Ben Lawson added some commits Sep 27, 2018

``` fix super() call for python2 support ```
``` 441e2e2 ```
``` fix the scipy version; 0.14 introduced the new api not 0.16 ```
``` 03b5578 ```
``` fix flake8 ```
``` f5b5d94 ```

### sklearn-lgtm commented Sep 27, 2018

 This pull request introduces 5 alerts when merging f5b5d94 into a4f2a89 - view on LGTM.com new alerts: 2 for Unused import 1 for Missing call to __init__ during object initialization 1 for Module-level cyclic import 1 for First argument to super() is not enclosing class Comment posted by LGTM.com
``` fix call to super, was using the wrong object ```
``` 97e2d37 ```

### sklearn-lgtm commented Sep 27, 2018

 This pull request introduces 3 alerts when merging 97e2d37 into a4f2a89 - view on LGTM.com new alerts: 2 for Unused import 1 for Module-level cyclic import Comment posted by LGTM.com
``` random_state wasn't being used in the backport, because of missing 'r… ```
`…andom_state' parameter in freeze`
``` bf0e49f ```

### sklearn-lgtm commented Sep 28, 2018

 This pull request introduces 3 alerts when merging bf0e49f into a4f2a89 - view on LGTM.com new alerts: 2 for Unused import 1 for Module-level cyclic import Comment posted by LGTM.com

### jnothman reviewed Sep 29, 2018

 @@ -591,6 +592,28 @@ def test_iterative_imputer_clip(): assert_allclose(Xt[X != 0], X[X != 0]) def test_iterative_imputer_normal_posterior(): rng = np.random.RandomState(0)

#### jnothman Sep 29, 2018

Member

Add a comment here on what, intuitively, this test ensures

 # generate multiple imputations for the single missing value imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)]) mu, sigma = imputations.mean(), imputations.std() ks_statistic, p_value = kstest((imputations-mu)/sigma, 'norm')

#### jnothman Sep 29, 2018

Member

Please put spaces around binary operators

``` add description to test and fix whitespace for code style ```
``` 2edb84c ```

### sklearn-lgtm commented Oct 1, 2018

 This pull request introduces 3 alerts when merging 2edb84c into a4f2a89 - view on LGTM.com new alerts: 2 for Unused import 1 for Module-level cyclic import Comment posted by LGTM.com
``` remove non-ascii characters for python2 support ```
``` 00a4dae ```
Contributor Author

### benlawson commented Oct 1, 2018

 for the backport, I just took the relevant scipy code and added it to a new file ` sklearn/utils/_scipy_truncnorm_backport.py` is this the proper way to do backports? if so, should I add the corresponding tests from scipy, for code-coverage check?

### sklearn-lgtm commented Oct 1, 2018

 This pull request introduces 3 alerts when merging 00a4dae into a4f2a89 - view on LGTM.com new alerts: 2 for Unused import 1 for Module-level cyclic import Comment posted by LGTM.com
Member

### ogrisel commented Oct 1, 2018 • edited

 Actually, based on the consensus emerging from the discussion in #12184, I don't think we need to backport: scikit-learn 0.21 will require scipy >= 0.17.0. So please remove the backport and instead skip the tests that fail because of the lack of `scipy.stats.truncnorm`: ```# TODO remove the skipif marker as soon as scipy < 0.17 support is dropped @pytest.mark.skipif(not hasattr(scipy.stats, 'truncnorm'), 'need scipy.stats.truncnorm') def test_something(): ... ```

### Ben Lawson added some commits Oct 1, 2018

``` add description to test and fix whitespace for code style ```
``` 14060a9 ```
``` add skips for calls to IterativeImputer with `sample_posterior=True` … ```
`…and remove non-ascii character in description for python2 support`
``` d056204 ```
``` removing truncnorm backport ```
``` 287286b ```
Contributor Author

### benlawson commented Oct 2, 2018

 cool, sounds good. I used the following to skip tests as scipy 0.14 has truncnorm, but it doesn't accept `random_state` parameter ```def test_something(): pytest.importorskip("scipy", minversion="0.17.0") ...```

Member

### jnothman left a comment

``` adding credits ```
``` 3a44f4e ```

### amueller reviewed Oct 3, 2018

 Multiple modules ................ Changes to estimator checks --------------------------- These changes mostly affect library developers. These changes mo:tabstly affect library developers.

Member

?

#### benlawson Oct 3, 2018

Author Contributor

``` fix typo ```
``` c95d7a1 ```

Member

### ogrisel left a comment

 Here are some comments. The new test is nice but it does really check that truncated normal != clipped normal right? But I guess this is fine. I am not sure what we could do to test better.
 def test_iterative_imputer_normal_posterior(): # test that the values that are imputed using `sample_posterior=True` # with boundaries (`min_value` and `max_value` are not None) are drawn # from a distribution that looks gaussian via the Kolmogorov Smirnov test

#### ogrisel Oct 5, 2018

Member

"that does not look Gaussian", right?

If the KS test p-value is larger than 0.1 the null hypothesis that the posterior samples are sampled from a standard normal distribution is rejected.

And this is expected because of the truncation by `min_value` and `max_value` that cut the tails of the posterior distribution.

#### benlawson Oct 5, 2018

Author Contributor

Perhaps, it should read "that looks roughly Gaussian."

We would want to fail to reject the null hypothesis. We don't want the truncation to make the posterior no longer roughly Gaussian-looking

#### ogrisel Oct 5, 2018

Member

Oh alright sorry, it's just me having trouble with double negation reasoning.

 # we want to fail to reject null hypothesis # null hypothesis: distributions are the same assert ks_statistic < 0.15 or p_value > 0.1, \ "The posterior does appear to be normal"

#### ogrisel Oct 5, 2018

Member

Where does the 0.15 threshold for the KS-statistic come from? Isn't `p_value > 0.1` enough?

#### benlawson Oct 5, 2018

Author Contributor

I made this assert statement based on my understanding of this line from scipy.stats.ks_2samp docs: "If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same."

Empirically, a slightly higher threshold might be needed (`0.2`) The p-values don't seem to differ much between the two branches (this branch/iterativeimputer branch). The threshold change is based on an mistake I made generating the test data: the values of variable `X` were supposed to normally distributed, centered at 0, but I mistakenly used `.random_sample` which is uniformly distributed from [0, 1), so it was unlikely `min_value=0` would do much.

## code used

experiment
```from tqdm import tqdm
import joblib

import numpy as np
from scipy.stats import kstest
from sklearn.impute import IterativeImputer

stats, ps = [], []
for seed in tqdm(range(0, 100)):
#  test that the values that are imputed using `sample_posterior=True`
#  with boundaries (`min_value` and `max_value` are not None) are drawn
#  from a distribution that looks gaussian via the Kolmogorov Smirnov test
rng = np.random.RandomState(seed)

X = rng.normal(size=(5, 5))
X[0][0] = np.nan

imputer = IterativeImputer(min_value=0,
max_value=0.5,
sample_posterior=True,
random_state=rng)

imputer.fit_transform(X)
# generate multiple imputations for the single missing value
imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)])
mu, sigma = imputations.mean(), imputations.std()
if sigma == 0:
sigma += 1e-12
ks_statistic, p_value = kstest((imputations - mu) / sigma, 'norm')
stats.append(ks_statistic)
ps.append(p_value)
joblib.dump((stats, ps), 'master_branch.joblib')```
plotting
```import joblib
import numpy as np
import matplotlib.pyplot as plt

clip_ps = np.array(clip_ps)[~np.isnan(clip_ps)]
clip_stats = np.array(clip_stats)[~np.isnan(clip_stats)]
truc_ps = np.array(truc_ps)[~np.isnan(truc_ps)]
truc_stats = np.array(truc_stats)[~np.isnan(truc_stats)]

plt.boxplot([truc_stats, clip_stats], labels=["Truncated normal", "Clipped normal"])
plt.hlines(0.15, 0, 3, linestyles='dotted', colors='r')
plt.title("Distribution of KS stats")
plt.savefig("ksstats.png")
plt.close()
plt.boxplot([truc_ps, clip_ps], labels=["Truncated normal", "Clipped normal"])
plt.title("Distribution of p-values")
plt.savefig("pvalues.png")```

#### ogrisel Oct 5, 2018

Member

Thank you very much for the details, this is very helpful.

 @@ -591,6 +592,32 @@ def test_iterative_imputer_clip(): assert_allclose(Xt[X != 0], X[X != 0]) def test_iterative_imputer_normal_posterior():

#### ogrisel Oct 5, 2018

Member

I would rename this to `test_iterative_imputer_truncated_normal_posterior`

 loc=mus[good_sigmas], scale=sigmas[good_sigmas]) imputed_values[good_sigmas] = truncated_normal.rvs( random_state=self.random_state_)

#### ogrisel Oct 5, 2018

Member

nitpick: this indentation does not seem to follow PEP8. I wonder why it wasn't caught by our CI.

The following should work:

```            imputed_values[good_sigmas] = truncated_normal.rvs(
random_state=self.random_state_)```
sklearn/tests/test_impute.py
 @@ -48,6 +48,10 @@ Support for Python 3.4 and below has been officially dropped. function of other features in a round-robin fashion. :issue:`8478` by :user:`Sergey Feldman `. - |Enhancement| :class:`impute.IterativeImputer` now samples from a truncated normal distrubtion instead of a clipped normal distribution when ``sample_posterior=True``.

#### ogrisel Oct 5, 2018

Member

typo: distribution

### Ben Lawson added some commits Oct 5, 2018

``` fix logistical comments (pep8, typos, renaming) ```
``` 2d93667 ```
``` adjust threshold and use normal to generate data ```
``` 55304dd ```
Member

### ogrisel commented Oct 5, 2018

 LGTM, merging.

### ogrisel merged commit `09a9a21` into scikit-learn:iterativeimputer Oct 5, 2018 8 checks passed

#### 8 checks passed

LGTM analysis: Python No alert changes
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 96% of diff hit (target 92.85%)
Details
codecov/project Absolute coverage decreased by -0.01% but relative coverage increased by +3.14% compared to a4f2a89
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

### MLopez-Ibanez pushed a commit to MLopez-Ibanez/scikit-learn that referenced this pull request Feb 9, 2019

``` [MRG] sample from a truncated normal instead of clipping samples from… ```
`… a normal (scikit-learn#12177)`
``` dc71a86 ```