-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wordfish text model blows up when coercing sparse to dense matrix #124
Comments
Hi Carsten, The problem here is that before sending the matrix to wordfish, it gets coerced to a (dense) matrix. The problem here is not in wordfish itself, it’s the coercion to the dense matrix. See line 156 of textmodel-wordfish.R. Just try as.matrix(myDfm) and you should get the same error. So: We need to rewrite the Glad you like the package! Any suggestions welcome. |
Hi Ken, thanks for the fast response :) I investigated a little but and I'm not sure whether you need SVD related computations for wordfish. But if so, you can maybe work with the irlba package. |
@lauderdale See http://gallery.rcpp.org/articles/armadillo-sparse-matrix/ for a description of how to send a sparse matrix to C++. A |
I took a look into this on the plane. I think the right thing to do is extract the initialization code from the current wordfish c++ source and allow it to take a vector of starting thetas. Then you'd construct those starting thetas from an iterative restarted Lanczos routine for just the top singular vector. The |
@conjugateprior sounds great! |
@conjugateprior any update on this? Should I get @HaiyanLW working on it instead? |
No update likely in near future, so yes, please do!
On 4 Jan 2017, at 12:45, Kenneth Benoit <notifications@github.com<mailto:notifications@github.com>> wrote:
@conjugateprior<https://github.com/conjugateprior> any update on this? Should I get @HaiyanLW<https://github.com/HaiyanLW> working on it instead?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub<#124 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGYeYQEy_eFoxvYQ5if_4TZpHXXf-xyTks5rO5RZgaJpZM4IMg5P>.
…---
Benjamin Lauderdale
Associate Professor
Department of Methodology
London School of Economics
Columbia House 8.10
Houghton Street
London, WC2A 2AE
United Kingdom
+44 (0)2071075439
B.E.Lauderdale@lse.ac.uk<mailto:B.E.Lauderdale@lse.ac.uk>
---
|
@conjugateprior @lauderdale We have working code (thanks to @HaiyanLW) for this now but are stuck at a sparse method for CA used to obtain starting values. Because this operates on the residual matrix, that matrix tends to be dense, so unless we arbitrarily zero some residuals below a threshold (which we have tried), we are still coercing it to dense. Is there a plausible alternative to SVD on the residual matrix (aka CA) to get starting values, that would allow us to remain in sparse-land? |
I think we should just use less good starting values! I would recommend extracting a small sub-matrix that is relatively non-sparse. For example, the 100 most chatty documents and the 2000 most common features. Do CA on those. Plug those parameters into the relevant spots in the starting values, and then set all the missing document and feature parameters to zero for their start values.
Cheers,
Ben
On 28 Feb 2017, at 02:07, Kenneth Benoit <notifications@github.com<mailto:notifications@github.com>> wrote:
@conjugateprior<https://github.com/conjugateprior> @lauderdale<https://github.com/lauderdale> We have working code (thanks to @HaiyanLW<https://github.com/HaiyanLW>) for this now but are stuck at a sparse method for CA used to obtain starting values. Because this operates on the residual matrix, that matrix tends to be dense, so unless we arbitrarily zero some residuals below a threshold (which we have tried), we are still coercing it to dense.
Is there a plausible alternative to SVD on the residual matrix (aka CA) to get starting values, that would allow us to remain in sparse-land?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#124 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGYeYfrjNhbxG5fClWLAvMD1-HvB_7IPks5rgxDagaJpZM4IMg5P>.
|
Good idea. @HaiyanLW we could do this in R and send the starting values to the C++ function (we would not change the user-exposed function signature for this, just send them through). We will need to be careful on how to select the features however since just selecting the most common will probably produce mostly stopwords. I can help with this part. |
Hi,
I'm currently trying to fit scaling models on a larger dfm (length 5084391279 , ~ 190.000 documents) which results in an error:
Do you have an idea what's going on here?
On a sidenote, I'm pretty impressed how fast sparse dfm matrices can be generated with
quanteda
. I discovered you out of frustration withtm
:-)Cheers,
Carsten
The text was updated successfully, but these errors were encountered: