Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit BEST #1517

Closed
wants to merge 15 commits into from
Closed

Submit BEST #1517

wants to merge 15 commits into from

Conversation

ericmjl
Copy link
Member

@ericmjl ericmjl commented Nov 10, 2016

Referencing #1502, here’s the PR for it. Have provided:

  • BEST object with .fit() and .plot_posterior() functions
  • Jupyter notebook showing how to use it.

@ColCarroll
Copy link
Member

Really cool! Looks good to me.

@@ -0,0 +1,141 @@
from ..distributions import StudentT, Exponential, Uniform, HalfCauchy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call the dir modules and not models?

Copy link
Member Author

@ericmjl ericmjl Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because there's a models.py present in the pymc3/ directory. I'm not sure what a better name would be, though. Do you have a suggestion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... It looks like I'm blind when coding at 3 am in the morning (in Vienna). I just realized it's model.py, not models.py, so I'll change the directory name to models.

self.data['indices'] = self.data[self.sample_col].apply(
lambda x: sample_names[x])

# print(self.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray debug

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@twiecki
Copy link
Member

twiecki commented Nov 10, 2016

This is a great start. It warrants to think deeply about the API, choice of inference etc. But we might not be able to really specify it at this point so can move ahead here.

One key trick is to use summary statistics for the test_vals to help with initialization. See e.g. https://github.com/quantopian/pyfolio/blob/master/pyfolio/bayesian.py#L194

@ericmjl
Copy link
Member Author

ericmjl commented Nov 10, 2016

@twiecki thanks for the feedback!

On the API design: would it be worthwhile inserting a comment block at the top of the module to clarify how the API is currently designed and its rationale, and leaving a note for future contributors that they'd be free to modify the API if they see a better fit? Not sure what PyMC3's current practice is for this, so I'd love to get feedback.

@ericmjl
Copy link
Member Author

ericmjl commented Nov 11, 2016

ping @twiecki: any comment on leaving a note for future contributors?

@springcoil
Copy link
Contributor

Hi Eric,
Leave a note it's a good idea :)

On Fri, Nov 11, 2016 at 12:59 PM, Eric Ma notifications@github.com wrote:

ping @twiecki https://github.com/twiecki: any comment on leaving a note
for future contributors?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1517 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA8DiFzW-fft6uPJZPdIXoBy1j7-kdXGks5q9Ga6gaJpZM4KuIt0
.

Peadar Coyle
Skype: springcoilarch
www.twitter.com/springcoil
peadarcoyle.wordpress.com

@ericmjl
Copy link
Member Author

ericmjl commented Nov 11, 2016

@springcoil all done. Thanks for the feedback! The final commit was just adding a docstring at the top of the BEST file, so all tests should pass.

@twiecki
Copy link
Member

twiecki commented Nov 12, 2016

@ericmjl This looks great. The main thing is I think this should use NUTS, not ADVI. I have a PR that I think would fit well here, can you hold tight a bit?

@twiecki
Copy link
Member

twiecki commented Nov 12, 2016

@ericmjl I think #1523 could make inference here more concise.

@ferrine
Copy link
Member

ferrine commented Nov 12, 2016

I want to implement some glm models too and we surely need base class to share essential functionality, see #1524.

@ericmjl
Copy link
Member Author

ericmjl commented Nov 12, 2016

@ferrine I put a note outlining the API design on the top of the best.py file. Could you comment on what you think about it please?

To repeat it here briefly, I thought first about how the data should be structured in a Pandas dataframe in order to make the code most concise, and then designed it around that. From my limited experience, statisticians like it best when data are provided structured and cleaned, with one row representing one "observation". Maybe that'd be a starting point?

Actually, I'll move this to the other thread, we can discuss there.

Copy link
Member

@ferrine ferrine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there is some work left. Tried to do my best in reviewing

mean_test = self.data.groupby('indices').mean()[self.output_col].values
sd_test = self.data.groupby('indices').std()[self.output_col].values

with Model() as model:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds limitations to usage, I will not be possible to use model within a model. Am I right, @twiecki?

self.params = params
self.model = model

def plot_posterior(self, rotate_xticks=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also return axes for further user modifications

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I'll update the PR with a fix for this.

self.data['indices'] = self.data[self.sample_col].apply(
lambda x: sample_names[x])

def fit(self, n_steps=500000):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to custimize advifit:(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to change this to use @twiecki's sample_init function when it's merged in. I'll then wrap the kwargs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea, it is afterall what @twiecki's sample_init function is designed for.


self._convert_to_indices()

def _convert_to_indices(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to write static function without side effects. BTW it seems to be modifying original dataframe, that's bad for user

class O:
    def __init__(self, data):
        self.data = data
    def meth(self):
        self.data['a']=1
data = pd.DataFrame() # empty
o = O(data)
o.meth()
data # not empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the side effects here concern me.

import pandas as pd


class BEST(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for base class we should consider using lasagne base ideas. I'll do a sketch for that soon

sample_names = dict()
for i, name in enumerate(
list(np.unique(self.data[self.sample_col].values))):
print('Sample name {0} has the index {1}'.format(name, i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using the logging facility for this so that it can be turned off (by default) and on as needed.

@ferrine
Copy link
Member

ferrine commented Nov 12, 2016

@ericmjl I added a branch user_model in my fork, let's try to work in pair on implementing BEST, GLM and base class?

@ferrine ferrine mentioned this pull request Nov 12, 2016
@fonnesbeck
Copy link
Member

fonnesbeck commented Nov 12, 2016

As slick as the sklearn-like API is, it is not consistent with other fitting methods on PyMC3, which are functions (sample and advi), rather than methods. Why can't BEST simply go ahead and fit the model, as advi does? I don't see the advantage of having a BEST object kicking around. In any case, I think we should maintain consistency here.

@ericmjl
Copy link
Member Author

ericmjl commented Nov 13, 2016

@ferrine: let's see how the discussion pans out here; I can see the reasons behind @fonnesbeck's concerns. If reception here looks good, I'm happy to work on it, otherwise, let's respect the leads and work on the idea in a separate repo instead. How does that sound?

@twiecki
Copy link
Member

twiecki commented Nov 14, 2016

The sklearn API suggestion was only an idea, feel free to abandon it.

@fonnesbeck
Copy link
Member

Its not a bad idea at all, but my inclination is to keep the interface consistent across models, unless there is a good reason to deviate from it. If the sklearn API is something we like, then we should look for ways of moving to it across the package, and not just for new fitting methods.

@springcoil
Copy link
Contributor

I agree with @fonnesbeck about consistency across the API, in terms of the interface across models. I'm a huge fan of the sklearn API but would consider something to be too different.

@burnpanck
Copy link
Contributor

I am just starting to join the discussion so please feel free to tell me if I'm either stating obvious or irrelevant things. I was looking for ways to build models in a modular fashion from simpler components (see #1553). I do like the approach here. In particular with respect to @fonnesbeck's comment, I see value in the object oriented ("lasagne") approach. I do not care that much where the fitting function is found, for consistency that could very well be the usual PyMC3 functions. But I do like the concept of an object that encapsulates related variables of a particular (sub-) model which can be fitted to data and linked to other models. I imagine that such a specialised model-class is a stand-in for a set of related random variables, while instances describe a particular binding of such related random variables to actual data and possibly external random variables. The object provides convenient access to these variables and possibly provide additional related functionality like domain specific insight in picking start-values.

@twiecki
Copy link
Member

twiecki commented Nov 28, 2016

@ericmjl Now that we merged #1525 (into the 3.1 branch) I think it's worthwhile to try and port BEST to it to make use of the new API. Maybe @ferrine could help if he wants.

@twiecki
Copy link
Member

twiecki commented Dec 5, 2016

@ericmjl Just wondering if you are interested in refactoring this PR.

@ericmjl
Copy link
Member Author

ericmjl commented Dec 5, 2016

@twiecki yep, still interested, just lacking time right now. If you need to clean up PRs that are not active, I'm cool closing it for now.

@twiecki
Copy link
Member

twiecki commented Dec 5, 2016

No that's fine, just wondering if it was still active.

@springcoil
Copy link
Contributor

Wondering @ericmjl if you've time to get this over the line?

@ericmjl
Copy link
Member Author

ericmjl commented Apr 7, 2017

@springcoil I think it's low on my priority list right now - thesis defence, job hunt, prep for PyCon. I'll go ahead and close.

@ericmjl ericmjl closed this Apr 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants