Missing values imputation implementation #712

jsalvatier · 2015-05-25T02:57:29Z

Right now I have missing values having a pior distribution from NoDistribution, which is a little weird to me, but it seems to fit naturally. Thoughts?

Also need to implement pandas missing values still.

jsalvatier · 2015-05-25T18:56:17Z

@fonnesbeck We should add a meaty missing values example. What should we add?

twiecki · 2015-05-25T21:14:27Z

This is awesome. Hope to be able to take a look soon.

fonnesbeck · 2015-05-27T21:05:06Z

I've added a version of the disasters model with missing data to the branch. Though it runs just fine, it does not impute the missing values. A summary of disasters_missing results in estimates of zero with no variance.

jsalvatier · 2015-05-27T21:33:48Z

Also, do we have a better example for missing values than the disasters or ecology models? It seems like in these cases its exactly the same as removing the data, so its not really adding anything. Do we have a nice snappy example where it actually adds something?

jsalvatier · 2015-05-27T22:49:47Z

I came up with this variant of the ecology model that actually uses missing values for something. Can you guys think of models where other RVs actually depend on missing values? Not sure if it makes a ton of sense:

Also, this makes me realize that we need a nice way of referring to the whole variable that has missing values.

import numpy as np
y = np.ma.masked_values([0, 2, 1, 0, 4, 2, 0, 0, -999, 0, 0, 0, 0, 0, -999, 0, 0, 6, 0, 0, 0, 2, 1,
       2, 0, 0, 0, 1, 0, 0, 0, 4, 2, 0, 0, 0, 1, 0, 2, 4, 0, 0, 1, 0, 0, 0,
       0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0,
       0, 0, 3, 0, 2, 0, 1, 2, 2, 2, 2, 3, 0, 0, 0, 0, 1, 0, 3, 1, 0, 0, 0,
       0, 0, 2, 0, 0, 1, 0, 0], value = -999)

x = np.array([0, 1, 1, 0, 2, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0,
 1, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 2, 0,
 1, 0, 1, 1, 1, 1, 1, 2, 0, 0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0])


pl.hist(y, bins=range(7))

from pymc3 import Beta, Bernoulli, ZeroInflatedPoisson, Uniform, Poisson

with Model() as zip_model:

    # Estimated occupancy
    p = Beta('p', 1, 1)

    # Latent variable for occupancy
    z = Bernoulli('z', p, y.shape)

    # Estimated mean count
    theta = Uniform('theta', 0, 100)

    # Poisson likelihood
    yd = ZeroInflatedPoisson('y', theta, z, observed=y)

    m = Normal('m', 1, 1)
    b = Normal('b', 1, 1)

    xd = Poisson('x', yd.data[0]*m+b, observed=x)

from pymc3 import Metropolis, BinaryMetropolis, sample 

with zip_model:

    start = {'p': 0.5, 'z': (y > 0).filled(1).astype(int), 'theta': 5, 'yd_missing': np.array([1,1])}


    step1 = Metropolis([theta, p, m, b] + yd.missing_values)

    step2 = BinaryMetropolis([z])

    trace = sample(10000, [step1, step2], start)

from pymc3 import traceplot, summary
traceplot(trace, ['p', 'theta', 'm', 'b']);

fonnesbeck · 2015-05-27T23:13:18Z

This looks different than my ZIP model. What do the two series of counts represent?

Regarding referencing the whole variable, that's one thing that is nice about the PyMC 2 implementation: the node with missing values could just be treated as an unobserved stochastic with respect to plotting and summarization; you don't have to refer to the missing part of the node explicitly.

fonnesbeck · 2015-05-28T01:20:41Z

I can get you a better missing data example.

Not sure what you mean by "adding something" to the model. In most cases, we are interested in imputing missing data because it avoids having to perform complete case analysis, which can induce bias in the results. So, its not a matter of adding anything so much as avoiding having to throw things away.

jsalvatier · 2015-05-28T07:48:53Z

In this model the first count series is your original series of counts for 1 species. The second one I intended to be a second series, where we're trying to predict the second counts from the first series but we have some missing data.

By "adding something to the model" I meant that having the missing data in the model needs to affect the inferences we make somewhat. In the disasters model it seems like if we removed the missing data, the inferences about the parameters we make would be completely unchanged.

If we have a bunch of predictors and some of them are missing for some of the rows, that would totally work. I could make a model like that, but it would be nice to have a nice story around it.

jsalvatier · 2015-05-28T07:50:17Z

Re: referencing the whole variable. I agree the pymc2 way is better. I think I figured out a nice way to handle both cases this afternoon. We can have two classes, one for the standard case where there's just one array that goes into observed ObservedRV which will function like pymc2. And a second for distributions with arrays that go into observed and thus should return multiple parts ObservedTupleRV or something. The second one can be indexed or unapplied mean, N, stdev = SummaryNormal('summary', mu, sd, observed={'mean' : a_mean, 'N' : n, 'stdev' : a_stdev }).

I'll probably implement it in the near future.

fonnesbeck · 2015-05-28T20:58:44Z

I see what you mean. Yes, this kind of additional information is available if there are additional (non-missing) covariates that would have been thrown out if the data were not imputed (the imputed values cannot add information).

At any rate, I have a nice dataset (200 or so observations) with missing values that a good example can be built around. I am just getting permission from the data's owners, and then I will pass it along.

jsalvatier · 2015-05-28T21:53:52Z

Fabulous!

On Thu, May 28, 2015 at 1:58 PM, Chris Fonnesbeck notifications@github.com
wrote:

I see what you mean. Yes, this kind of additional information is available
if there are additional (non-missing) covariates that would have been
thrown out if the data were not imputed (the imputed values cannot add
information).

At any rate, I have a nice little dataset with missing values that a good
example can be built around. I am just getting permission from the data's
owners, and then I will pass it along.

—
Reply to this email directly or view it on GitHub
#712 (comment).

fonnesbeck · 2015-05-28T22:03:36Z

So, here's the idea:

test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv')).fillna(-1)
(score, male, siblings, disability, 
    age, mother_hs, early_ident) = test_scores[['score', 'male', 'sib', 
                                                'synd_or_disab', 'age_test',
                                                'mother_hs', 'ident_by_3']].astype(float).values.T

with Model() as model:

    # Impute missing values
    sib_mean = Exponential('sib_mean', 1)
    siblings_imp = Poisson('siblings_imp', sib_mean, observed=masked_values(siblings, value=-1))

    p_disab = Beta('p_disab', 1, 1)
    disability_imp = Bernoulli('disability_imp', p_disab, observed=masked_values(disability, value=-1))

    p_mother = Beta('p_mother', 1, 1)
    mother_imp = Bernoulli('mother_imp', p_mother, observed=masked_values(mother_hs, value=-1))

    s = HalfCauchy('s', 5)
    beta = Laplace('beta', 0, 100, shape=7)

    p = (beta[0] + beta[1]*male + beta[2]*siblings_imp + beta[3]*disability_imp + 
        beta[4]*age + beta[5]*mother_imp + beta[6]*early_ident)

    observed_score = Normal('observed_score', expected_score, s ** -2, observed=score)

The problem here is that I'm not allowed to multiply my imputed data with a Tensor:

Traceback (most recent call last):
  File "/Users/fonnescj/GitHub/pymc3/pymc3/examples/lasso_missing.py", line 28, in <module>
    beta[4]*age + beta[5]*mother_imp + beta[6]*early_ident)
TypeError: unsupported operand type(s) for *: 'TensorVariable' and 'ObservedRV'

jsalvatier · 2015-05-28T22:09:13Z

Yeah, that makes sense. I'll fix that up soon the way I described. Can you
push it to a branch so I can pull it when I work on this?

On Thu, May 28, 2015 at 3:03 PM, Chris Fonnesbeck notifications@github.com
wrote:

So, here's the idea:

test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv')).fillna(-1)
(score, male, siblings, disability,
age, mother_hs, early_ident) = test_scores[['score', 'male', 'sib',
'synd_or_disab', 'age_test',
'mother_hs', 'ident_by_3']].astype(float).values.T
with Model() as model:
# Impute missing values
sib_mean = Exponential('sib_mean', 1)
siblings_imp = Poisson('siblings_imp', sib_mean, observed=masked_values(siblings, value=-1))

p_disab = Beta('p_disab', 1, 1)
disability_imp = Bernoulli('disability_imp', p_disab, observed=masked_values(disability, value=-1))

p_mother = Beta('p_mother', 1, 1)
mother_imp = Bernoulli('mother_imp', p_mother, observed=masked_values(mother_hs, value=-1))

s = HalfCauchy('s', 5)
beta = Laplace('beta', 0, 100, shape=7)

p = (beta[0] + beta[1]*male + beta[2]*siblings_imp + beta[3]*disability_imp +
    beta[4]*age + beta[5]*mother_imp + beta[6]*early_ident)

observed_score = Normal('observed_score', expected_score, s ** -2, observed=score)
The problem here is that I'm not allowed to multiply my imputed data with
a Tensor:

Traceback (most recent call last):
File "/Users/fonnescj/GitHub/pymc3/pymc3/examples/lasso_missing.py", line 28, in
beta[4]_age + beta[5]_mother_imp + beta[6]*early_ident)
TypeError: unsupported operand type(s) for *: 'TensorVariable' and 'ObservedRV'

—
Reply to this email directly or view it on GitHub
#712 (comment).

fonnesbeck · 2015-05-28T22:09:55Z

Yes. I will also email you the data (since I can't post it yet) so that you can run the model.

aflaxman · 2015-05-28T22:16:01Z

With respect to

test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv')).fillna(-1)
(score, male, siblings, disability, 
    age, mother_hs, early_ident) = test_scores[['score', 'male', 'sib', 
                                                'synd_or_disab', 'age_test',
                                                'mother_hs', 'ident_by_3']].astype(float).values.T
...

   siblings_imp = Poisson('siblings_imp', sib_mean, observed=masked_values(siblings, value=-1))

Wouldn't you rather just use the pandas.Series with its built-in coding of missing data? E.g.

test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv'))

...

    siblings_imp = Poisson('siblings_imp', sib_mean, observed=test_scores.sib)

I would! I think this is a concrete benefit of the approach @twiecki suggested above, and it sounds like you are planning to have both work. But here is my nudge to just do the pandas null values, and keep things simple.

(So cool to see so much activity on pymc3, btw!)

jsalvatier · 2015-05-28T22:25:59Z

Does this seem urgent to you?

My plan was to do this after finishing up on the chapter (due monday). I was planning on adding very basic missing value imputation in the ecology example in the chapter but not this more sophisticated stuff.

fonnesbeck · 2015-05-28T22:30:24Z

Nope.

Chris Fonnesbeck
Assistant Professor of Biostatistics
Department of Biostatistics
Vanderbilt Center for Quantitative Sciences
Vanderbilt University School of Medicine

On Thu, May 28, 2015 at 3:26 PM -0700, "John Salvatier" <notifications@github.com mailto:notifications@github.com> wrote:

Does this seem urgent to you?

My plan was to do this after finishing up on the chapter (due monday). I was planning on doing very basic missing value imputation in the ecology example.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/712#issuecomment-106617335.

jsalvatier · 2015-06-01T01:24:40Z

Turns out a reason to support Masked Array is that pandas does not support NaN for integer columns.

http://pandas.pydata.org/pandas-docs/dev/gotchas.html#support-for-integer-na

Might be possible to work around this if we get clever though.

fonnesbeck · 2015-06-01T02:46:26Z

Correct. Integers will show up as object arrays if they contain missing values. Does this matter though? I assume we would immediately get the values from the Pandas data structures as soon as they enter PyMC objects. I would hate to have to accommodate Pandas indexing behavior all over the place.

jsalvatier · 2015-06-01T03:34:41Z

I've run into some issues recently with values being of the wrong dtype.
I've seen floats show up in traces for discrete variables (which messes up
plotting). I've also had a Poisson missing variable treated as effectively
continuous.

This was because pandas.DataFrame({'x' : [1,2,3. np.nan]}) is a float
column and passing a float column and Poisson doesn't seem to have enough
checks to prevent the variable from actually just sampling as a continuous
variable.

#722

On Sun, May 31, 2015 at 7:46 PM, Chris Fonnesbeck notifications@github.com
wrote:

Correct. Integers will show up as object arrays if they contain missing
values. Does this matter though? I assume we would immediately get the
values from the Pandas data structures as soon as they enter PyMC
objects. I would hate to have to accommodate Pandas indexing behavior all
over the place.

—
Reply to this email directly or view it on GitHub
#712 (comment).

With the introduction of missing variables it becomes important to be able to use the value of an ObservedRV. However, this was not possible because ObservedRV didn't subclass TensorVariable because ObservedRV could also handle multiple observed arrays so wouldn't always be a single tensor. This splits up the two cases into ObservedRV, for single observed arrays, and MultiObservedRV for multiple observed arrays. ObservedRV now inherits from TensorVariable.

jsalvatier · 2015-06-05T19:11:50Z

@twiecki @fonnesbeck could one of you take a final look here?

fonnesbeck · 2015-06-05T19:18:24Z

Not allowed to include the dataset yet.

jsalvatier · 2015-06-05T19:21:41Z

Oh, the data I added is fake, I just made it up in excel. I didn't even look at the real data. Can you add it back?

This reverts commit b1bdda5.

fonnesbeck · 2015-06-05T19:31:20Z

Ah, sorry. Well, it works fine on the real data too ... will just have a quick look at the changes.

jsalvatier · 2015-06-05T20:33:16Z

I think you reverted "increase samples" instead of reverting "removed test_scores.csv".

This reverts commit 2e113df.

This reverts commit fe825af.

fonnesbeck · 2015-06-05T20:51:03Z

It runs more than an order of magnitude slower than the equivalent PyMC 2 model, but of course its a better sample. I'll try to put together a proper benchmark based on effective sample size.

jsalvatier · 2015-06-05T21:44:20Z

How fast does it run vs just Metropolis?

Also, PyMC2 still has a more sophisticated adaptive Metropolis, no? I wonder if we should invest in porting that over.

fonnesbeck · 2015-06-05T21:50:39Z

Closer, but still much slower

jsalvatier · 2015-06-05T22:25:01Z

We definitely have to fix that.

jsalvatier · 2015-06-06T03:22:29Z

When I run with Metropolis(blocked=True) it takes 4 seconds for me. Is
PyMC2 doing something close to blocked=True or blocked=False here?

On Fri, Jun 5, 2015 at 2:50 PM, Chris Fonnesbeck notifications@github.com
wrote:

Closer, but still much slower

[image: benchmarks]
https://camo.githubusercontent.com/e52c848ef68ac727a6b3893eebda42ece654162b/687474703a2f2f642e70722f692f3138624a672b

—
Reply to this email directly or view it on GitHub
#712 (comment).

jsalvatier · 2015-06-06T21:33:11Z

@fonnesbeck how do you feel about merging this at this point? Speed is definitely an issue for PyMC3 and I think one of the next things to tackle, but I think it doesn't have a lot to do with missing values per se, and so something we should talk about in a separate issue.

twiecki · 2015-06-07T00:15:21Z

agreed with @jsalvatier's points.

fonnesbeck · 2015-06-07T03:30:18Z

PyMC 2 will block-update vector-valued stochastics only, unless you merge individual variables into containers.

I agree. This stuff works now, so let's merge it.

jsalvatier · 2015-06-07T22:28:40Z

Closed with #743

jsalvatier added 5 commits May 24, 2015 01:06

simple draft of missing values

ae610ae

enable c code for gammaln/psi on non floats

5187b05

missing vals use NoDistribution

839ac29

test missing values

20abbe5

fix for python 3.4

e5833ad

handle missing data from pandas

1ac4d0d

Added disaster_model_missing (not working)

9e10b75

Added lasso model with missing covariates

7e128dc

jsalvatier added 2 commits June 2, 2015 14:59

make example work

e6ab0ce

jsalvatier added 8 commits June 4, 2015 16:43

bound bernoulli

43b9263

handle gradient when no vars are passed

251b72b

give missing variables the .model parameter

11956ff

use the right datatype for the missing values

c31fb56

make lasso missing work

a73d53c

add fake data for missing lasso example

2a02b3a

need n='short' for auto running

ed6b5aa

increase samples to get summary to work

b1bdda5

jsalvatier mentioned this pull request Jun 5, 2015

Stochastic's observed argument does not accept tuples #456

Closed

Removed test_scores.csv

fe825af

Revert "increase samples to get summary to work"

2e113df

This reverts commit b1bdda5.

Chris Fonnesbeck added 2 commits June 5, 2015 15:40

Revert "Revert "increase samples to get summary to work""

9f20c6a

This reverts commit 2e113df.

Revert "Removed test_scores.csv"

674bc52

This reverts commit fe825af.

jsalvatier mentioned this pull request Jun 5, 2015

Add missing value support to glm #647

Closed

jsalvatier closed this Jun 7, 2015

jsalvatier deleted the missing branch June 7, 2015 22:28

Missing values imputation implementation #712

Missing values imputation implementation #712

Uh oh!

Conversation

jsalvatier commented May 25, 2015

Uh oh!

jsalvatier commented May 25, 2015

Uh oh!

twiecki commented May 25, 2015

Uh oh!

fonnesbeck commented May 27, 2015

Uh oh!

jsalvatier commented May 27, 2015

Uh oh!

jsalvatier commented May 27, 2015

Uh oh!

fonnesbeck commented May 27, 2015

Uh oh!

fonnesbeck commented May 28, 2015

Uh oh!

jsalvatier commented May 28, 2015

Uh oh!

jsalvatier commented May 28, 2015

Uh oh!

fonnesbeck commented May 28, 2015

Uh oh!

jsalvatier commented May 28, 2015

Uh oh!

fonnesbeck commented May 28, 2015

Uh oh!

jsalvatier commented May 28, 2015

Uh oh!

fonnesbeck commented May 28, 2015

Uh oh!

aflaxman commented May 28, 2015

Uh oh!

jsalvatier commented May 28, 2015

Uh oh!

fonnesbeck commented May 28, 2015

Uh oh!

jsalvatier commented Jun 1, 2015

Uh oh!

fonnesbeck commented Jun 1, 2015

Uh oh!

jsalvatier commented Jun 1, 2015

Uh oh!

jsalvatier commented Jun 5, 2015

Uh oh!

fonnesbeck commented Jun 5, 2015

Uh oh!

jsalvatier commented Jun 5, 2015

Uh oh!

fonnesbeck commented Jun 5, 2015

Uh oh!

jsalvatier commented Jun 5, 2015

Uh oh!

fonnesbeck commented Jun 5, 2015

Uh oh!

jsalvatier commented Jun 5, 2015

Uh oh!

fonnesbeck commented Jun 5, 2015

Uh oh!

jsalvatier commented Jun 5, 2015

Uh oh!

jsalvatier commented Jun 6, 2015

Uh oh!

jsalvatier commented Jun 6, 2015

Uh oh!

twiecki commented Jun 7, 2015

Uh oh!

fonnesbeck commented Jun 7, 2015

Uh oh!

jsalvatier commented Jun 7, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development