Using big data (50GB) sets with Stan on a Logistic Regression model #2221

ghost · 2017-02-01T07:50:11Z

Summary:

Using big data (50GB) sets with Stan on a logistic regression model

Description:

I am currently working with PySpark and PyMC3 on a binary classification problem. I would like to use Stan for Bayesian Logistic Regression on large data sets on the order of ~50GB and around 100 co-variates/features.

Questions:

Has anyone had success with such large data sets?
How did you overcome the memory limitations of passing a data set from R and/or Python to Stan and the fact that Stan is single-threaded?
Assuming I have a machine with 256GB of memory, can Stan read data directly from Hadoop/S3 using C++ without going through R or Python first?

Many thanks,
Shlomo.

bob-carpenter · 2017-02-01T21:25:04Z

Thanks for asking. Stan builds an expression graph for the log density, which requires about 40 bytes per subexpression. So I don't think it'll fit in 256GB.

The only way to make this fly would be to build a custom logistic regression C++ function with analytic gradients (not that hard).

But at the point you have 50GB of data and you're fitting 100 covariates, you probably don't need Bayesian methods. Just use a stochastic gradient method that doesn't keep all the data in memory; Vowpal Wabbit will probably work.

I'm going to close this issue, because we limit them to technical feature specifications that have a clear path to implementation. We discuss bigger issues on the mailing list and preliminary designs on the Wiki. This is related to stocahstic methods and data distributed methods which we are thinking about.

ghost · 2017-02-01T21:32:25Z

Dear Bob,
Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there?

I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand your comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods.

Best,

bob-carpenter · 2017-02-01T22:00:13Z

Sure. But none of this is Stan related. The posterior will converge to a delta function around the penalized maximum likelihood estimate (where the prior defines the penalty). So full Bayes (which uses posterior estimation uncertainty in posterior inference) doesn't buy you much over just plugging in a point estimate. - Bob

…

On Feb 1, 2017, at 4:32 PM, Solomon K. ***@***.***> wrote: Dear Bob, Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there? I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand you comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods. Best, — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

betanalpha · 2017-02-01T22:47:48Z

More importantly, by the time you’ve fit on hundreds of thousands of data the posterior variances will shrink below the hidden bias from assuming that everyone in your giant sample behaves _exactly_ the same. Then you’ll need to build something more elaborate, like a hierarchical logistic regression which will cause your parameters to explore from hundreds to millions, even with just hundreds of thousands of data. Spark isn’t going to do anything on that. NUTS will still be your best bet, but it’ll be on the edge of feasibility.

…

On Feb 1, 2017, at 5:00 PM, Bob Carpenter ***@***.***> wrote: Sure. But none of this is Stan related. The posterior will converge to a delta function around the penalized maximum likelihood estimate (where the prior defines the penalty). So full Bayes (which uses posterior estimation uncertainty in posterior inference) doesn't buy you much over just plugging in a point estimate. - Bob > On Feb 1, 2017, at 4:32 PM, Solomon K. ***@***.***> wrote: > > Dear Bob, > Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there? > > I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand you comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods. > > Best, > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub, or mute the thread. > — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdNltvj9KJ1_exDI9sWT1g1T8252_0Zks5rYQBugaJpZM4Lzk4r>.

jgabry · 2017-02-01T23:56:39Z

More importantly, by the time you’ve fit on hundreds of thousands
of data the posterior variances will shrink below the hidden bias
from assuming that everyone in your giant sample behaves
exactly the same.

+1. This is a super important point that often goes unmentioned in discussions like this one.

statwonk · 2018-07-26T21:37:56Z

@betanalpha do you have or know of any resources where I could read more about this?

the hidden bias
from assuming that everyone in your giant sample behaves
exactly the same

Do you mean exchangeability / the iid assumption?

betanalpha · 2018-07-27T14:02:43Z

The IID assumption gives you the typical logistic regression. Exchangeability is a weaker assumption that is consistent with heterogeneity in the population, but that gives you hierarchical logistic regression, not regular logistic regression.

bob-carpenter closed this as completed Feb 1, 2017

seantalts modified the milestone: v2.15.0 Apr 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using big data (50GB) sets with Stan on a Logistic Regression model #2221

Using big data (50GB) sets with Stan on a Logistic Regression model #2221

ghost commented Feb 1, 2017

bob-carpenter commented Feb 1, 2017

ghost commented Feb 1, 2017 •

edited by ghost

Loading

bob-carpenter commented Feb 1, 2017 via email

betanalpha commented Feb 1, 2017 via email

jgabry commented Feb 1, 2017

statwonk commented Jul 26, 2018 •

edited

Loading

betanalpha commented Jul 27, 2018

Using big data (50GB) sets with Stan on a Logistic Regression model #2221

Using big data (50GB) sets with Stan on a Logistic Regression model #2221

Comments

ghost commented Feb 1, 2017

Summary:

Description:

bob-carpenter commented Feb 1, 2017

ghost commented Feb 1, 2017 • edited by ghost Loading

bob-carpenter commented Feb 1, 2017 via email

betanalpha commented Feb 1, 2017 via email

jgabry commented Feb 1, 2017

statwonk commented Jul 26, 2018 • edited Loading

betanalpha commented Jul 27, 2018

ghost commented Feb 1, 2017 •

edited by ghost

Loading

statwonk commented Jul 26, 2018 •

edited

Loading