wordfish text model blows up when coercing sparse to dense matrix #124

cschwem2er · 2016-04-21T09:47:54Z

Hi,

I'm currently trying to fit scaling models on a larger dfm (length 5084391279 , ~ 190.000 documents) which results in an error:

wf <- textmodel(myDfm, model = "wordfish")


Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Do you have an idea what's going on here?

On a sidenote, I'm pretty impressed how fast sparse dfm matrices can be generated with quanteda. I discovered you out of frustration with tm :-)

Cheers,
Carsten

The text was updated successfully, but these errors were encountered:

kbenoit · 2016-04-21T10:03:06Z

Hi Carsten,

The problem here is that before sending the matrix to wordfish, it gets coerced to a (dense) matrix. The problem here is not in wordfish itself, it’s the coercion to the dense matrix. See line 156 of textmodel-wordfish.R.

Just try

as.matrix(myDfm)

and you should get the same error.

So: We need to rewrite the wordfish.cpp to compute the quantities directly from the sparse matrix. On the list…

Glad you like the package! Any suggestions welcome.

cschwem2er · 2016-04-21T11:36:06Z

Hi Ken, thanks for the fast response :)

I investigated a little but and I'm not sure whether you need SVD related computations for wordfish. But if so, you can maybe work with the irlba package.

kbenoit · 2016-05-24T14:40:57Z

@lauderdale See http://gallery.rcpp.org/articles/armadillo-sparse-matrix/ for a description of how to send a sparse matrix to C++. A dfm object is a a column-sparse matrix, a bit conceptually trickier than a simple triplet, but more efficient. And it's super easy and fast to convert from dgCMatrix to the triplet dgTMatrix in R if that helps. I think that is what you need for Armadillo's SpMat object type.

conjugateprior · 2016-05-24T14:51:42Z

I took a look into this on the plane. I think the right thing to do is extract the initialization code from the current wordfish c++ source and allow it to take a vector of starting thetas. Then you'd construct those starting thetas from an iterative restarted Lanczos routine for just the top singular vector. The svds function in RSpectra seems to be a reasonable implementation. It should take hardly any memory and also be quicker. Underneath it just calls ARPACK. I have some code for this and can construct a pull, probably this weekend.

kbenoit · 2016-05-24T15:07:37Z

@conjugateprior sounds great!

kbenoit · 2017-01-04T12:45:11Z

@conjugateprior any update on this? Should I get @HaiyanLW working on it instead?

lauderdale · 2017-01-04T12:48:00Z

No update likely in near future, so yes, please do! On 4 Jan 2017, at 12:45, Kenneth Benoit <notifications@github.com<mailto:notifications@github.com>> wrote: @conjugateprior<https://github.com/conjugateprior> any update on this? Should I get @HaiyanLW<https://github.com/HaiyanLW> working on it instead? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub<#124 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGYeYQEy_eFoxvYQ5if_4TZpHXXf-xyTks5rO5RZgaJpZM4IMg5P>.

…

--- Benjamin Lauderdale Associate Professor Department of Methodology London School of Economics Columbia House 8.10 Houghton Street London, WC2A 2AE United Kingdom +44 (0)2071075439 B.E.Lauderdale@lse.ac.uk<mailto:B.E.Lauderdale@lse.ac.uk>

---

kbenoit · 2017-02-27T18:07:21Z

@conjugateprior @lauderdale We have working code (thanks to @HaiyanLW) for this now but are stuck at a sparse method for CA used to obtain starting values. Because this operates on the residual matrix, that matrix tends to be dense, so unless we arbitrarily zero some residuals below a threshold (which we have tried), we are still coercing it to dense.

Is there a plausible alternative to SVD on the residual matrix (aka CA) to get starting values, that would allow us to remain in sparse-land?

lauderdale · 2017-02-28T01:50:00Z

I think we should just use less good starting values! I would recommend extracting a small sub-matrix that is relatively non-sparse. For example, the 100 most chatty documents and the 2000 most common features. Do CA on those. Plug those parameters into the relevant spots in the starting values, and then set all the missing document and feature parameters to zero for their start values. Cheers, Ben On 28 Feb 2017, at 02:07, Kenneth Benoit <notifications@github.com<mailto:notifications@github.com>> wrote: @conjugateprior<https://github.com/conjugateprior> @lauderdale<https://github.com/lauderdale> We have working code (thanks to @HaiyanLW<https://github.com/HaiyanLW>) for this now but are stuck at a sparse method for CA used to obtain starting values. Because this operates on the residual matrix, that matrix tends to be dense, so unless we arbitrarily zero some residuals below a threshold (which we have tried), we are still coercing it to dense. Is there a plausible alternative to SVD on the residual matrix (aka CA) to get starting values, that would allow us to remain in sparse-land? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#124 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGYeYfrjNhbxG5fClWLAvMD1-HvB_7IPks5rgxDagaJpZM4IMg5P>.

kbenoit · 2017-02-28T09:14:53Z

Good idea. @HaiyanLW we could do this in R and send the starting values to the C++ function (we would not change the user-exposed function signature for this, just send them through). We will need to be careful on how to select the features however since just selecting the most common will probably produce mostly stopwords. I can help with this part.

kbenoit changed the title ~~Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105~~ wordfish text model blows up when coercing sparse to dense matrix Apr 21, 2016

kbenoit assigned HaiyanLW and lauderdale Nov 6, 2016

kbenoit added enhancement performance labels Nov 6, 2016

cschwem2er mentioned this issue Jan 19, 2017

textmodel_wordfish raises error for large dfm #482

Closed

kbenoit modified the milestone: v1.0 Mar 16, 2017

kbenoit closed this as completed in b8201cb Mar 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wordfish text model blows up when coercing sparse to dense matrix #124

wordfish text model blows up when coercing sparse to dense matrix #124

cschwem2er commented Apr 21, 2016 •

edited

Loading

kbenoit commented Apr 21, 2016 •

edited

Loading

cschwem2er commented Apr 21, 2016

kbenoit commented May 24, 2016

conjugateprior commented May 24, 2016

kbenoit commented May 24, 2016

kbenoit commented Jan 4, 2017

lauderdale commented Jan 4, 2017 via email

kbenoit commented Feb 27, 2017

lauderdale commented Feb 28, 2017 via email

kbenoit commented Feb 28, 2017

wordfish text model blows up when coercing sparse to dense matrix #124

wordfish text model blows up when coercing sparse to dense matrix #124

Comments

cschwem2er commented Apr 21, 2016 • edited Loading

kbenoit commented Apr 21, 2016 • edited Loading

cschwem2er commented Apr 21, 2016

kbenoit commented May 24, 2016

conjugateprior commented May 24, 2016

kbenoit commented May 24, 2016

kbenoit commented Jan 4, 2017

lauderdale commented Jan 4, 2017 via email

kbenoit commented Feb 27, 2017

lauderdale commented Feb 28, 2017 via email

kbenoit commented Feb 28, 2017

cschwem2er commented Apr 21, 2016 •

edited

Loading

kbenoit commented Apr 21, 2016 •

edited

Loading