-
-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using big data (50GB) sets with Stan on a Logistic Regression model #2221
Comments
Thanks for asking. Stan builds an expression graph for the log density, which requires about 40 bytes per subexpression. So I don't think it'll fit in 256GB. The only way to make this fly would be to build a custom logistic regression C++ function with analytic gradients (not that hard). But at the point you have 50GB of data and you're fitting 100 covariates, you probably don't need Bayesian methods. Just use a stochastic gradient method that doesn't keep all the data in memory; Vowpal Wabbit will probably work. I'm going to close this issue, because we limit them to technical feature specifications that have a clear path to implementation. We discuss bigger issues on the mailing list and preliminary designs on the Wiki. This is related to stocahstic methods and data distributed methods which we are thinking about. |
Dear Bob, I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand your comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods. Best, |
Sure. But none of this is Stan related.
The posterior will converge to a delta function around the
penalized maximum likelihood estimate (where the prior defines
the penalty). So full Bayes (which uses posterior estimation
uncertainty in posterior inference) doesn't buy you much over
just plugging in a point estimate.
- Bob
… On Feb 1, 2017, at 4:32 PM, Solomon K. ***@***.***> wrote:
Dear Bob,
Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there?
I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand you comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods.
Best,
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub, or mute the thread.
|
More importantly, by the time you’ve fit on hundreds of thousands
of data the posterior variances will shrink below the hidden bias
from assuming that everyone in your giant sample behaves
_exactly_ the same. Then you’ll need to build something more
elaborate, like a hierarchical logistic regression which will cause
your parameters to explore from hundreds to millions, even with
just hundreds of thousands of data. Spark isn’t going to do
anything on that. NUTS will still be your best bet, but it’ll be
on the edge of feasibility.
… On Feb 1, 2017, at 5:00 PM, Bob Carpenter ***@***.***> wrote:
Sure. But none of this is Stan related.
The posterior will converge to a delta function around the
penalized maximum likelihood estimate (where the prior defines
the penalty). So full Bayes (which uses posterior estimation
uncertainty in posterior inference) doesn't buy you much over
just plugging in a point estimate.
- Bob
> On Feb 1, 2017, at 4:32 PM, Solomon K. ***@***.***> wrote:
>
> Dear Bob,
> Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there?
>
> I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand you comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods.
>
> Best,
>
> —
> You are receiving this because you modified the open/close state.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#2221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdNltvj9KJ1_exDI9sWT1g1T8252_0Zks5rYQBugaJpZM4Lzk4r>.
|
+1. This is a super important point that often goes unmentioned in discussions like this one. |
@betanalpha do you have or know of any resources where I could read more about this?
Do you mean exchangeability / the iid assumption? |
The IID assumption gives you the typical logistic regression. Exchangeability is a weaker assumption that is consistent with heterogeneity in the population, but that gives you hierarchical logistic regression, not regular logistic regression. |
Summary:
Using big data (50GB) sets with Stan on a logistic regression model
Description:
I am currently working with PySpark and PyMC3 on a binary classification problem. I would like to use Stan for Bayesian Logistic Regression on large data sets on the order of ~50GB and around 100 co-variates/features.
Questions:
Many thanks,
Shlomo.
The text was updated successfully, but these errors were encountered: