Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using big data (50GB) sets with Stan on a Logistic Regression model #2221

Closed
ghost opened this issue Feb 1, 2017 · 7 comments
Closed

Using big data (50GB) sets with Stan on a Logistic Regression model #2221

ghost opened this issue Feb 1, 2017 · 7 comments
Milestone

Comments

@ghost
Copy link

ghost commented Feb 1, 2017

Summary:

Using big data (50GB) sets with Stan on a logistic regression model

Description:

I am currently working with PySpark and PyMC3 on a binary classification problem. I would like to use Stan for Bayesian Logistic Regression on large data sets on the order of ~50GB and around 100 co-variates/features.

Questions:

  1. Has anyone had success with such large data sets?
  2. How did you overcome the memory limitations of passing a data set from R and/or Python to Stan and the fact that Stan is single-threaded?
  3. Assuming I have a machine with 256GB of memory, can Stan read data directly from Hadoop/S3 using C++ without going through R or Python first?

Many thanks,
Shlomo.

@bob-carpenter
Copy link
Contributor

Thanks for asking. Stan builds an expression graph for the log density, which requires about 40 bytes per subexpression. So I don't think it'll fit in 256GB.

The only way to make this fly would be to build a custom logistic regression C++ function with analytic gradients (not that hard).

But at the point you have 50GB of data and you're fitting 100 covariates, you probably don't need Bayesian methods. Just use a stochastic gradient method that doesn't keep all the data in memory; Vowpal Wabbit will probably work.

I'm going to close this issue, because we limit them to technical feature specifications that have a clear path to implementation. We discuss bigger issues on the mailing list and preliminary designs on the Wiki. This is related to stocahstic methods and data distributed methods which we are thinking about.

@ghost
Copy link
Author

ghost commented Feb 1, 2017

Dear Bob,
Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there?

I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand your comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods.

Best,

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Feb 1, 2017 via email

@betanalpha
Copy link
Contributor

betanalpha commented Feb 1, 2017 via email

@jgabry
Copy link
Member

jgabry commented Feb 1, 2017

More importantly, by the time you’ve fit on hundreds of thousands
of data the posterior variances will shrink below the hidden bias
from assuming that everyone in your giant sample behaves
exactly the same.

+1. This is a super important point that often goes unmentioned in discussions like this one.

@seantalts seantalts modified the milestone: v2.15.0 Apr 14, 2017
@statwonk
Copy link
Contributor

statwonk commented Jul 26, 2018

@betanalpha do you have or know of any resources where I could read more about this?

the hidden bias
from assuming that everyone in your giant sample behaves
exactly the same

Do you mean exchangeability / the iid assumption?

@betanalpha
Copy link
Contributor

The IID assumption gives you the typical logistic regression. Exchangeability is a weaker assumption that is consistent with heterogeneity in the population, but that gives you hierarchical logistic regression, not regular logistic regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants