Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordfish text model blows up when coercing sparse to dense matrix #124

Closed
cschwem2er opened this issue Apr 21, 2016 · 10 comments
Closed

wordfish text model blows up when coercing sparse to dense matrix #124

cschwem2er opened this issue Apr 21, 2016 · 10 comments
Assignees
Milestone

Comments

@cschwem2er
Copy link

cschwem2er commented Apr 21, 2016

Hi,

I'm currently trying to fit scaling models on a larger dfm (length 5084391279 , ~ 190.000 documents) which results in an error:

wf <- textmodel(myDfm, model = "wordfish")


Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Do you have an idea what's going on here?

On a sidenote, I'm pretty impressed how fast sparse dfm matrices can be generated with quanteda. I discovered you out of frustration with tm :-)

Cheers,
Carsten

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 21, 2016

Hi Carsten,

The problem here is that before sending the matrix to wordfish, it gets coerced to a (dense) matrix. The problem here is not in wordfish itself, it’s the coercion to the dense matrix. See line 156 of textmodel-wordfish.R.

Just try

as.matrix(myDfm)

and you should get the same error.

So: We need to rewrite the wordfish.cpp to compute the quantities directly from the sparse matrix. On the list…

Glad you like the package! Any suggestions welcome.

@kbenoit kbenoit changed the title Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 wordfish text model blows up when coercing sparse to dense matrix Apr 21, 2016
@cschwem2er
Copy link
Author

Hi Ken, thanks for the fast response :)

I investigated a little but and I'm not sure whether you need SVD related computations for wordfish. But if so, you can maybe work with the irlba package.

@kbenoit
Copy link
Collaborator

kbenoit commented May 24, 2016

@lauderdale See http://gallery.rcpp.org/articles/armadillo-sparse-matrix/ for a description of how to send a sparse matrix to C++. A dfm object is a a column-sparse matrix, a bit conceptually trickier than a simple triplet, but more efficient. And it's super easy and fast to convert from dgCMatrix to the triplet dgTMatrix in R if that helps. I think that is what you need for Armadillo's SpMat object type.

@conjugateprior
Copy link
Collaborator

I took a look into this on the plane. I think the right thing to do is extract the initialization code from the current wordfish c++ source and allow it to take a vector of starting thetas. Then you'd construct those starting thetas from an iterative restarted Lanczos routine for just the top singular vector. The svds function in RSpectra seems to be a reasonable implementation. It should take hardly any memory and also be quicker. Underneath it just calls ARPACK. I have some code for this and can construct a pull, probably this weekend.

@kbenoit
Copy link
Collaborator

kbenoit commented May 24, 2016

@conjugateprior sounds great!

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 4, 2017

@conjugateprior any update on this? Should I get @HaiyanLW working on it instead?

@lauderdale
Copy link
Collaborator

lauderdale commented Jan 4, 2017 via email

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 27, 2017

@conjugateprior @lauderdale We have working code (thanks to @HaiyanLW) for this now but are stuck at a sparse method for CA used to obtain starting values. Because this operates on the residual matrix, that matrix tends to be dense, so unless we arbitrarily zero some residuals below a threshold (which we have tried), we are still coercing it to dense.

Is there a plausible alternative to SVD on the residual matrix (aka CA) to get starting values, that would allow us to remain in sparse-land?

@lauderdale
Copy link
Collaborator

lauderdale commented Feb 28, 2017 via email

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 28, 2017

Good idea. @HaiyanLW we could do this in R and send the starting values to the C++ function (we would not change the user-exposed function signature for this, just send them through). We will need to be careful on how to select the features however since just selecting the most common will probably produce mostly stopwords. I can help with this part.

@kbenoit kbenoit modified the milestone: v1.0 Mar 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants