-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Missing values imputation implementation #712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@fonnesbeck We should add a meaty missing values example. What should we add? |
This is awesome. Hope to be able to take a look soon. |
I've added a version of the |
Also, do we have a better example for missing values than the disasters or ecology models? It seems like in these cases its exactly the same as removing the data, so its not really adding anything. Do we have a nice snappy example where it actually adds something? |
I came up with this variant of the ecology model that actually uses missing values for something. Can you guys think of models where other RVs actually depend on missing values? Not sure if it makes a ton of sense: Also, this makes me realize that we need a nice way of referring to the whole variable that has missing values.
|
This looks different than my ZIP model. What do the two series of counts represent? Regarding referencing the whole variable, that's one thing that is nice about the PyMC 2 implementation: the node with missing values could just be treated as an unobserved stochastic with respect to plotting and summarization; you don't have to refer to the missing part of the node explicitly. |
I can get you a better missing data example. Not sure what you mean by "adding something" to the model. In most cases, we are interested in imputing missing data because it avoids having to perform complete case analysis, which can induce bias in the results. So, its not a matter of adding anything so much as avoiding having to throw things away. |
In this model the first count series is your original series of counts for 1 species. The second one I intended to be a second series, where we're trying to predict the second counts from the first series but we have some missing data. By "adding something to the model" I meant that having the missing data in the model needs to affect the inferences we make somewhat. In the disasters model it seems like if we removed the missing data, the inferences about the parameters we make would be completely unchanged. If we have a bunch of predictors and some of them are missing for some of the rows, that would totally work. I could make a model like that, but it would be nice to have a nice story around it. |
Re: referencing the whole variable. I agree the pymc2 way is better. I think I figured out a nice way to handle both cases this afternoon. We can have two classes, one for the standard case where there's just one array that goes into observed I'll probably implement it in the near future. |
I see what you mean. Yes, this kind of additional information is available if there are additional (non-missing) covariates that would have been thrown out if the data were not imputed (the imputed values cannot add information). At any rate, I have a nice dataset (200 or so observations) with missing values that a good example can be built around. I am just getting permission from the data's owners, and then I will pass it along. |
Fabulous! On Thu, May 28, 2015 at 1:58 PM, Chris Fonnesbeck notifications@github.com
|
So, here's the idea: test_scores = pd.read_csv(get_data_file('pymc3.examples', 'data/test_scores.csv')).fillna(-1)
(score, male, siblings, disability,
age, mother_hs, early_ident) = test_scores[['score', 'male', 'sib',
'synd_or_disab', 'age_test',
'mother_hs', 'ident_by_3']].astype(float).values.T
with Model() as model:
# Impute missing values
sib_mean = Exponential('sib_mean', 1)
siblings_imp = Poisson('siblings_imp', sib_mean, observed=masked_values(siblings, value=-1))
p_disab = Beta('p_disab', 1, 1)
disability_imp = Bernoulli('disability_imp', p_disab, observed=masked_values(disability, value=-1))
p_mother = Beta('p_mother', 1, 1)
mother_imp = Bernoulli('mother_imp', p_mother, observed=masked_values(mother_hs, value=-1))
s = HalfCauchy('s', 5)
beta = Laplace('beta', 0, 100, shape=7)
p = (beta[0] + beta[1]*male + beta[2]*siblings_imp + beta[3]*disability_imp +
beta[4]*age + beta[5]*mother_imp + beta[6]*early_ident)
observed_score = Normal('observed_score', expected_score, s ** -2, observed=score) The problem here is that I'm not allowed to multiply my imputed data with a Tensor:
|
Yeah, that makes sense. I'll fix that up soon the way I described. Can you On Thu, May 28, 2015 at 3:03 PM, Chris Fonnesbeck notifications@github.com
|
Yes. I will also email you the data (since I can't post it yet) so that you can run the model. |
With respect to
Wouldn't you rather just use the
I would! I think this is a concrete benefit of the approach @twiecki suggested above, and it sounds like you are planning to have both work. But here is my nudge to just do the pandas null values, and keep things simple. (So cool to see so much activity on pymc3, btw!) |
Does this seem urgent to you? My plan was to do this after finishing up on the chapter (due monday). I was planning on adding very basic missing value imputation in the ecology example in the chapter but not this more sophisticated stuff. |
Nope. Chris Fonnesbeck On Thu, May 28, 2015 at 3:26 PM -0700, "John Salvatier" <notifications@github.commailto:notifications@github.com> wrote: Does this seem urgent to you? My plan was to do this after finishing up on the chapter (due monday). I was planning on doing very basic missing value imputation in the ecology example. — |
Turns out a reason to support Masked Array is that pandas does not support NaN for integer columns. http://pandas.pydata.org/pandas-docs/dev/gotchas.html#support-for-integer-na Might be possible to work around this if we get clever though. |
Correct. Integers will show up as |
I've run into some issues recently with values being of the wrong dtype. This was because pandas.DataFrame({'x' : [1,2,3. np.nan]}) is a float On Sun, May 31, 2015 at 7:46 PM, Chris Fonnesbeck notifications@github.com
|
With the introduction of missing variables it becomes important to be able to use the value of an ObservedRV. However, this was not possible because ObservedRV didn't subclass TensorVariable because ObservedRV could also handle multiple observed arrays so wouldn't always be a single tensor. This splits up the two cases into ObservedRV, for single observed arrays, and MultiObservedRV for multiple observed arrays. ObservedRV now inherits from TensorVariable.
@twiecki @fonnesbeck could one of you take a final look here? |
Not allowed to include the dataset yet. |
Oh, the data I added is fake, I just made it up in excel. I didn't even look at the real data. Can you add it back? |
This reverts commit b1bdda5.
Ah, sorry. Well, it works fine on the real data too ... will just have a quick look at the changes. |
I think you reverted "increase samples" instead of reverting "removed test_scores.csv". |
It runs more than an order of magnitude slower than the equivalent PyMC 2 model, but of course its a better sample. I'll try to put together a proper benchmark based on effective sample size. |
How fast does it run vs just Metropolis? Also, PyMC2 still has a more sophisticated adaptive Metropolis, no? I wonder if we should invest in porting that over. |
We definitely have to fix that. |
When I run with On Fri, Jun 5, 2015 at 2:50 PM, Chris Fonnesbeck notifications@github.com
|
@fonnesbeck how do you feel about merging this at this point? Speed is definitely an issue for PyMC3 and I think one of the next things to tackle, but I think it doesn't have a lot to do with missing values per se, and so something we should talk about in a separate issue. |
agreed with @jsalvatier's points. |
PyMC 2 will block-update vector-valued stochastics only, unless you merge individual variables into containers. I agree. This stuff works now, so let's merge it. |
Closed with #743 |
Right now I have missing values having a pior distribution from
NoDistribution
, which is a little weird to me, but it seems to fit naturally. Thoughts?Also need to implement pandas missing values still.