Skip to content

Conversation

jsalvatier
Copy link
Member

Right now I have missing values having a pior distribution from NoDistribution, which is a little weird to me, but it seems to fit naturally. Thoughts?

Also need to implement pandas missing values still.

@jsalvatier
Copy link
Member Author

@fonnesbeck We should add a meaty missing values example. What should we add?

@twiecki
Copy link
Member

twiecki commented May 25, 2015

This is awesome. Hope to be able to take a look soon.

@fonnesbeck
Copy link
Member

I've added a version of the disasters model with missing data to the branch. Though it runs just fine, it does not impute the missing values. A summary of disasters_missing results in estimates of zero with no variance.

@jsalvatier
Copy link
Member Author

Also, do we have a better example for missing values than the disasters or ecology models? It seems like in these cases its exactly the same as removing the data, so its not really adding anything. Do we have a nice snappy example where it actually adds something?

@jsalvatier
Copy link
Member Author

I came up with this variant of the ecology model that actually uses missing values for something. Can you guys think of models where other RVs actually depend on missing values? Not sure if it makes a ton of sense:

Also, this makes me realize that we need a nice way of referring to the whole variable that has missing values.

import numpy as np
y = np.ma.masked_values([0, 2, 1, 0, 4, 2, 0, 0, -999, 0, 0, 0, 0, 0, -999, 0, 0, 6, 0, 0, 0, 2, 1,
       2, 0, 0, 0, 1, 0, 0, 0, 4, 2, 0, 0, 0, 1, 0, 2, 4, 0, 0, 1, 0, 0, 0,
       0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0,
       0, 0, 3, 0, 2, 0, 1, 2, 2, 2, 2, 3, 0, 0, 0, 0, 1, 0, 3, 1, 0, 0, 0,
       0, 0, 2, 0, 0, 1, 0, 0], value = -999)

x = np.array([0, 1, 1, 0, 2, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0,
 1, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 2, 0,
 1, 0, 1, 1, 1, 1, 1, 2, 0, 0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0])


pl.hist(y, bins=range(7))

from pymc3 import Beta, Bernoulli, ZeroInflatedPoisson, Uniform, Poisson

with Model() as zip_model:

    # Estimated occupancy
    p = Beta('p', 1, 1)

    # Latent variable for occupancy
    z = Bernoulli('z', p, y.shape)

    # Estimated mean count
    theta = Uniform('theta', 0, 100)

    # Poisson likelihood
    yd = ZeroInflatedPoisson('y', theta, z, observed=y)

    m = Normal('m', 1, 1)
    b = Normal('b', 1, 1)

    xd = Poisson('x', yd.data[0]*m+b, observed=x)

from pymc3 import Metropolis, BinaryMetropolis, sample 

with zip_model:

    start = {'p': 0.5, 'z': (y > 0).filled(1).astype(int), 'theta': 5, 'yd_missing': np.array([1,1])}


    step1 = Metropolis([theta, p, m, b] + yd.missing_values)

    step2 = BinaryMetropolis([z])

    trace = sample(10000, [step1, step2], start)

from pymc3 import traceplot, summary
traceplot(trace, ['p', 'theta', 'm', 'b']);

@fonnesbeck
Copy link
Member

This looks different than my ZIP model. What do the two series of counts represent?

Regarding referencing the whole variable, that's one thing that is nice about the PyMC 2 implementation: the node with missing values could just be treated as an unobserved stochastic with respect to plotting and summarization; you don't have to refer to the missing part of the node explicitly.

@fonnesbeck
Copy link
Member

I can get you a better missing data example.

Not sure what you mean by "adding something" to the model. In most cases, we are interested in imputing missing data because it avoids having to perform complete case analysis, which can induce bias in the results. So, its not a matter of adding anything so much as avoiding having to throw things away.

@jsalvatier
Copy link
Member Author

In this model the first count series is your original series of counts for 1 species. The second one I intended to be a second series, where we're trying to predict the second counts from the first series but we have some missing data.

By "adding something to the model" I meant that having the missing data in the model needs to affect the inferences we make somewhat. In the disasters model it seems like if we removed the missing data, the inferences about the parameters we make would be completely unchanged.

If we have a bunch of predictors and some of them are missing for some of the rows, that would totally work. I could make a model like that, but it would be nice to have a nice story around it.

@jsalvatier
Copy link
Member Author

Re: referencing the whole variable. I agree the pymc2 way is better. I think I figured out a nice way to handle both cases this afternoon. We can have two classes, one for the standard case where there's just one array that goes into observed ObservedRV which will function like pymc2. And a second for distributions with arrays that go into observed and thus should return multiple parts ObservedTupleRV or something. The second one can be indexed or unapplied mean, N, stdev = SummaryNormal('summary', mu, sd, observed={'mean' : a_mean, 'N' : n, 'stdev' : a_stdev }).

I'll probably implement it in the near future.

@fonnesbeck
Copy link
Member

I see what you mean. Yes, this kind of additional information is available if there are additional (non-missing) covariates that would have been thrown out if the data were not imputed (the imputed values cannot add information).

At any rate, I have a nice dataset (200 or so observations) with missing values that a good example can be built around. I am just getting permission from the data's owners, and then I will pass it along.

@jsalvatier
Copy link
Member Author

Fabulous!

On Thu, May 28, 2015 at 1:58 PM, Chris Fonnesbeck notifications@github.com
wrote:

I see what you mean. Yes, this kind of additional information is available
if there are additional (non-missing) covariates that would have been
thrown out if the data were not imputed (the imputed values cannot add
information).

At any rate, I have a nice little dataset with missing values that a good
example can be built around. I am just getting permission from the data's
owners, and then I will pass it along.


Reply to this email directly or view it on GitHub
#712 (comment).

@fonnesbeck
Copy link
Member

So, here's the idea:

test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv')).fillna(-1)
(score, male, siblings, disability, 
    age, mother_hs, early_ident) = test_scores[['score', 'male', 'sib', 
                                                'synd_or_disab', 'age_test',
                                                'mother_hs', 'ident_by_3']].astype(float).values.T

with Model() as model:

    # Impute missing values
    sib_mean = Exponential('sib_mean', 1)
    siblings_imp = Poisson('siblings_imp', sib_mean, observed=masked_values(siblings, value=-1))

    p_disab = Beta('p_disab', 1, 1)
    disability_imp = Bernoulli('disability_imp', p_disab, observed=masked_values(disability, value=-1))

    p_mother = Beta('p_mother', 1, 1)
    mother_imp = Bernoulli('mother_imp', p_mother, observed=masked_values(mother_hs, value=-1))

    s = HalfCauchy('s', 5)
    beta = Laplace('beta', 0, 100, shape=7)

    p = (beta[0] + beta[1]*male + beta[2]*siblings_imp + beta[3]*disability_imp + 
        beta[4]*age + beta[5]*mother_imp + beta[6]*early_ident)

    observed_score = Normal('observed_score', expected_score, s ** -2, observed=score)

The problem here is that I'm not allowed to multiply my imputed data with a Tensor:

Traceback (most recent call last):
  File "/Users/fonnescj/GitHub/pymc3/pymc3/examples/lasso_missing.py", line 28, in <module>
    beta[4]*age + beta[5]*mother_imp + beta[6]*early_ident)
TypeError: unsupported operand type(s) for *: 'TensorVariable' and 'ObservedRV'

@jsalvatier
Copy link
Member Author

Yeah, that makes sense. I'll fix that up soon the way I described. Can you
push it to a branch so I can pull it when I work on this?

On Thu, May 28, 2015 at 3:03 PM, Chris Fonnesbeck notifications@github.com
wrote:

So, here's the idea:

test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv')).fillna(-1)
(score, male, siblings, disability,
age, mother_hs, early_ident) = test_scores[['score', 'male', 'sib',
'synd_or_disab', 'age_test',
'mother_hs', 'ident_by_3']].astype(float).values.T
with Model() as model:

# Impute missing values
sib_mean = Exponential('sib_mean', 1)
siblings_imp = Poisson('siblings_imp', sib_mean, observed=masked_values(siblings, value=-1))

p_disab = Beta('p_disab', 1, 1)
disability_imp = Bernoulli('disability_imp', p_disab, observed=masked_values(disability, value=-1))

p_mother = Beta('p_mother', 1, 1)
mother_imp = Bernoulli('mother_imp', p_mother, observed=masked_values(mother_hs, value=-1))

s = HalfCauchy('s', 5)
beta = Laplace('beta', 0, 100, shape=7)

p = (beta[0] + beta[1]*male + beta[2]*siblings_imp + beta[3]*disability_imp +
    beta[4]*age + beta[5]*mother_imp + beta[6]*early_ident)

observed_score = Normal('observed_score', expected_score, s ** -2, observed=score)

The problem here is that I'm not allowed to multiply my imputed data with
a Tensor:

Traceback (most recent call last):
File "/Users/fonnescj/GitHub/pymc3/pymc3/examples/lasso_missing.py", line 28, in
beta[4]_age + beta[5]_mother_imp + beta[6]*early_ident)
TypeError: unsupported operand type(s) for *: 'TensorVariable' and 'ObservedRV'


Reply to this email directly or view it on GitHub
#712 (comment).

@fonnesbeck
Copy link
Member

Yes. I will also email you the data (since I can't post it yet) so that you can run the model.

@aflaxman
Copy link
Contributor

With respect to

test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv')).fillna(-1)
(score, male, siblings, disability, 
    age, mother_hs, early_ident) = test_scores[['score', 'male', 'sib', 
                                                'synd_or_disab', 'age_test',
                                                'mother_hs', 'ident_by_3']].astype(float).values.T
...

   siblings_imp = Poisson('siblings_imp', sib_mean, observed=masked_values(siblings, value=-1))

Wouldn't you rather just use the pandas.Series with its built-in coding of missing data? E.g.

test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv'))

...

    siblings_imp = Poisson('siblings_imp', sib_mean, observed=test_scores.sib)

I would! I think this is a concrete benefit of the approach @twiecki suggested above, and it sounds like you are planning to have both work. But here is my nudge to just do the pandas null values, and keep things simple.

(So cool to see so much activity on pymc3, btw!)

@jsalvatier
Copy link
Member Author

Does this seem urgent to you?

My plan was to do this after finishing up on the chapter (due monday). I was planning on adding very basic missing value imputation in the ecology example in the chapter but not this more sophisticated stuff.

@fonnesbeck
Copy link
Member

Nope.


Chris Fonnesbeck
Assistant Professor of Biostatistics
Department of Biostatistics
Vanderbilt Center for Quantitative Sciences
Vanderbilt University School of Medicine

On Thu, May 28, 2015 at 3:26 PM -0700, "John Salvatier" <notifications@github.commailto:notifications@github.com> wrote:

Does this seem urgent to you?

My plan was to do this after finishing up on the chapter (due monday). I was planning on doing very basic missing value imputation in the ecology example.


Reply to this email directly or view it on GitHubhttps://github.com//pull/712#issuecomment-106617335.

@jsalvatier
Copy link
Member Author

Turns out a reason to support Masked Array is that pandas does not support NaN for integer columns.

http://pandas.pydata.org/pandas-docs/dev/gotchas.html#support-for-integer-na

Might be possible to work around this if we get clever though.

@fonnesbeck
Copy link
Member

Correct. Integers will show up as object arrays if they contain missing values. Does this matter though? I assume we would immediately get the values from the Pandas data structures as soon as they enter PyMC objects. I would hate to have to accommodate Pandas indexing behavior all over the place.

@jsalvatier
Copy link
Member Author

I've run into some issues recently with values being of the wrong dtype.
I've seen floats show up in traces for discrete variables (which messes up
plotting). I've also had a Poisson missing variable treated as effectively
continuous.

This was because pandas.DataFrame({'x' : [1,2,3. np.nan]}) is a float
column and passing a float column and Poisson doesn't seem to have enough
checks to prevent the variable from actually just sampling as a continuous
variable.

#722

On Sun, May 31, 2015 at 7:46 PM, Chris Fonnesbeck notifications@github.com
wrote:

Correct. Integers will show up as object arrays if they contain missing
values. Does this matter though? I assume we would immediately get the
values from the Pandas data structures as soon as they enter PyMC
objects. I would hate to have to accommodate Pandas indexing behavior all
over the place.


Reply to this email directly or view it on GitHub
#712 (comment).

With the introduction of missing variables it becomes important to be
able to use the value of an ObservedRV. However, this was not possible
because ObservedRV didn't subclass TensorVariable because ObservedRV
could also handle multiple observed arrays so wouldn't always be a
single tensor.

This splits up the two cases into ObservedRV, for single observed
arrays, and MultiObservedRV for multiple observed arrays. ObservedRV
now inherits from TensorVariable.
@jsalvatier
Copy link
Member Author

@twiecki @fonnesbeck could one of you take a final look here?

@fonnesbeck
Copy link
Member

Not allowed to include the dataset yet.

@jsalvatier
Copy link
Member Author

Oh, the data I added is fake, I just made it up in excel. I didn't even look at the real data. Can you add it back?

@fonnesbeck
Copy link
Member

Ah, sorry. Well, it works fine on the real data too ... will just have a quick look at the changes.

@jsalvatier
Copy link
Member Author

I think you reverted "increase samples" instead of reverting "removed test_scores.csv".

Chris Fonnesbeck added 2 commits June 5, 2015 15:40
@fonnesbeck
Copy link
Member

It runs more than an order of magnitude slower than the equivalent PyMC 2 model, but of course its a better sample. I'll try to put together a proper benchmark based on effective sample size.

@jsalvatier
Copy link
Member Author

How fast does it run vs just Metropolis?

Also, PyMC2 still has a more sophisticated adaptive Metropolis, no? I wonder if we should invest in porting that over.

@fonnesbeck
Copy link
Member

Closer, but still much slower

benchmarks

@jsalvatier
Copy link
Member Author

We definitely have to fix that.

@jsalvatier
Copy link
Member Author

When I run with Metropolis(blocked=True) it takes 4 seconds for me. Is
PyMC2 doing something close to blocked=True or blocked=False here?

On Fri, Jun 5, 2015 at 2:50 PM, Chris Fonnesbeck notifications@github.com
wrote:

Closer, but still much slower

[image: benchmarks]
https://camo.githubusercontent.com/e52c848ef68ac727a6b3893eebda42ece654162b/687474703a2f2f642e70722f692f3138624a672b


Reply to this email directly or view it on GitHub
#712 (comment).

@jsalvatier
Copy link
Member Author

@fonnesbeck how do you feel about merging this at this point? Speed is definitely an issue for PyMC3 and I think one of the next things to tackle, but I think it doesn't have a lot to do with missing values per se, and so something we should talk about in a separate issue.

@twiecki
Copy link
Member

twiecki commented Jun 7, 2015

agreed with @jsalvatier's points.

@fonnesbeck
Copy link
Member

PyMC 2 will block-update vector-valued stochastics only, unless you merge individual variables into containers.

I agree. This stuff works now, so let's merge it.

@jsalvatier
Copy link
Member Author

Closed with #743

@jsalvatier jsalvatier closed this Jun 7, 2015
@jsalvatier jsalvatier deleted the missing branch June 7, 2015 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants