New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add min threshold to textstat_dist()? #1210
Comments
Not exactly a clear feature request, but if you mean an option to return a reduced size dist object based on the dist values, this is not really workable since the threshold would differ wildly across the matrix. But maybe I've misunderstood. Are there examples of this for other dist operations? (from proxy for instance?) |
I did not have time to explain until now. I noticed that people are struggling to calculate pairwise similarity between large number of sentences or n-grams to detect reuse of texts because it generates large dense matrix. We can make it more efficient if we floor values lower than a threshold to zero. Flooring does not have any impact on the analysis because people are usually only interested in very high similarity. I haven't seen the code really, but could be done either in |
This is where flooring will should happen: Line 218 in 76e2cbf
|
OK, I see. So in the row/column loop in C++ we would just return a zero if the distance were below the minimum, and that would be one less cell in the sparse object returned by the C++ functions, so it is a smaller object returned to R. The current return to R is a dist object, which is an efficient lower-diagonal vector of a distance matrix, basically a vector with a separate record of the object dimensions. But a zero in that object would still be a zero numeric, so we would have to change the return from The question is whether this is really needed. Is the number of zero cells caused by a removing distances below a minimum significant enough to warrant this solution? Current object format for an fcm is a sparse lower diagonal matrix. Current object format for dist is the same, except that it records small values too. What would the "sparsity" level of a typical dist return? Finally, are there precedents for doing this with distance computations? Something whose practice we could emulate? |
Nicolas Merz is facing this issue right now, so we can ask him to test if flooring helps. I also have a plant to perform huge pairwise distance calculation in near future for "fake news". Analysis of text reuse has been restricted due to the lack of an efficient tool, but the new argument might create a new trend in text analysis.
It seems that not many functions that take sparse matrix as an input return dense |
After being asked about how to large-N pair-wise similarity computation, I am keen on adding Lines 122 to 164 in 57a3921
|
Let's look at some other packages to see how they deal with this issue, e.g. scipy - see http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html, or http://na-o-ys.github.io/others/2015-11-07-sparse-vector-similarities.html. |
I saw these but there is no indication that they have flooring option or something similar. This is why many people who are in trouble. |
Understood, but I'd like to see some other approach that implements the floor, or a paper justifying it, so that we fully understand the substantive consequences of zeroing small values on all of the metrics that we implement. This could be a test suite for instance. But it's possible that not all measures are indifferent to near-zero values being treated as zero. The R implementation formerly known as Revolution Analytics uses a parallelized and optimised version of LAPACK for speed, and it has a better |
This guy takes top-n for similar purpose: |
This is the 4th case in two month. We need to address this problem. |
I’m fine with seeking solutions to this, but let’s base it on something rather than engineer fixes whose statistical or mathematical basis is unproven. (Although I would be ok with demonstrations that this works approximately the same.) I found: And a paper: |
@kasperwelbers please share your thoughts with us |
So for my two cents, I agree that you would need to have some form of flooring, either by using a threshold or a top_n. Many documents will have at least one term overlapping, so without flooring you will often be close to n.x * n.y * 8 bytes. What I was aiming for is to simply let people determine a threshold themselves, so that if they are only interested in strong similarities (e.g., recommender systems, duplicates), big comparisons are feasible. As the proof is in the pudding, I tried testing how much this helps at the meeting, but didn't manage to fix it in time to present it. I just finished it in the airplane (no pun intended), and put it up here: https://github.com/kasperwelbers/matsim. This implementation uses Eigen and is a bit different from quanteda's, but the idea is simply to not store (i.e. floor) certain values. Using this on 50.000 documents goes pretty smooth, and I'm quite certain it could pull of much larger numbers given a high enough threshold or by using top_n. Off course, computation time still increases exponentially. If I understood Dmitriy correctly, there are ways around this that rely on approximations, but that's beyond my comprehension (also, I dislike approximations). Regarding threshold versus top_n. I like thresholds, but top n can be nice for things like recommender systems. Dmitriy informed me of his implementation, which is probably very fast (though not on CRAN): |
Ken, just for my understanding. When you say: "not all measures are indifferent to near-zero values being treated as zero" are you referring to measures that are based on some function of all the document similarity scores, like corpus-level similarity? If it's about measures for document pair similarity, I think it should be possible for most if not all common measures. But perhaps I'm totally missing something here. |
I like rsparse and matsim. I hope Dimitry to upload his package to CRAN, but he said he is not planing to do so. We could integrate matsim into quanteda or make it a single function package aiming to publish on CRAN. In either case, I can contribute to parallelizing the computation. |
A separate package that only implements various similarity/distance measures could be useful. But if it could be integrated into quanteda (rewritten in Armadillo) that would also work for me. So, there are already some sparse matrix multiplication implemetations in quanteda, but if I'm right, these return a non-sparse matrix, and i'm curious how the way in which they iterate over the columns compares to the approach used in matsim in terms of speed . We could use the matsim github to benchmark some implementations, and then see whether a separate package makes sense, or whether some tweaks in quanteda could fix it. If it's alright with you, I'll copy one of the quanteda parallelized implementations to matsim and rewrite it to be similar to the matsim implementation but in Armadillo. If that doesn't work well, we could try parallelizing the Eigen implementation. |
All the similarity measures are for sparse document-feature matrix, but not sparse similarity matrix. Line 218 in 76e2cbf
But popular measures like cosine is computed using Lines 122 to 137 in 919e012
As for parallel computation, using triplet seems to be the most efficient way to make a sparse matrix as in The advantage of integration the sparse similarity function that quanteda has all the configurations and objects for the parallel computation. Disadvantage is that we have to make the look and feel consistent with existing functions. I am sure that @kbenoit has somethings to say about this. |
Right, so it could be implement in the current form without major changes, but it would only be useful if the similarity matrix is returned as a sparse matrix, and I suppose there is a reason for not doing that in the first place (and otherwise, it would be a breaking change). Perhaps there is some virtue in only returning the sparse matrix if a threshold is used. That would make it more explicit that by using a threshold it is no longer a pure similarity matrix, thus making substantive consequences of using a threshold the user's own responsibility. |
The functions are not returning sparse matrix currently but R's dist object. We can just make |
Let's first check how many packages are using We could probably just change their output object to a matrix and run |
Partly, but also let’s explore what functions in other packages take dist and simil class objects, since it’s not only a compatibility, but also an extensibility issue. However - we will provide I’d also however like to test those methods when we have values close to zero, versus equal to zero, since the zeros can cause special problems in methods that use cross-products or element-wise multiplication, or division by zero. |
I finally have found time to work on this and created
|
@kasperwelbers, please give it a try. |
@nicmer, please try if it works with ngrams. |
It became as fast as the classic version by optimization.
|
Nice work! I noted one minor point of elegance that in some cases might also help in terms of speed and memory efficiency. The output of textstat_simil2 can currently still contain zeros if no limit is used. Since out output is a sparse matrix anyway, these might as well be dropped. For instance: |
The advantage of the current functions are that they work with the default For the truncation via |
Thanks @kasperwelbers for the input. By further optimization it became twice as fast and compact as the old similarity function!
|
Rank argument is added to get top n items.
|
I have added all other distance and similarity measures in the same C++ code. Since they are essentially the same, we can make |
I did some stress tests and noted that while the memory issue is solved by the new sparse output, it does not yet scale well in terms of speed. The reasons seems to be that the current implementation of simil_mt essentially still uses a dense matrix multiplication algorithm. To see what can be gained, I implemented a sparse solution (see gist) in your script that scales better and is much faster. I haven't made this a pull request, because there's two important trade-offs that you'd need to consider (also, i'm not confident in my parallel processing skills).
To test how it scales, I made three random sparse matrices with an increasing number of columns and the same density. Naturally, this data is not an accurate reflection of a natural dtm, which makes this a sloppy benchmark.
Current implementation
Alternative
So, there is much to be gained, but it comes with a substantial amount of hassle, plus the memory trade-off. If you think its useful, here or for another function (e.g. deduplication) I can lend a hand, since I did something similar in RNewsflow (but more focused on doc similarity and without the cool parallel processing). |
@koheiw the package I was trying to think of on Wed was ** qlcMatrix** at https://CRAN.R-project.org/package=qlcMatrix. This package operates on sparse matrices (which we did already) but also returns a sparse matrix format. For instance I'm not suggesting we use this package since our code is already pretty optimised for sparse inputs, but it would be interesting to compare performance. |
@kasperwelbers thanks. Your code is useful and inspiring. I will try to integrate your sparse computation without making drawbacks that you pointed out. |
@kasperwelbers I am studying you code, but don't know why you need to transpose the matrix. Can you tell me why? |
@koheiw As you know, a sparse matrix is either column or row major. In a column major, you can iterate over the non-zero cells for a given column, but not for a given row. In my code, I use the transpose to iterative over the non-zero cells for a given row. The current dense implementation is.
Effectively, this way all values in m.col(i) are compared to all values in each m.col(j). But if we use the dot product we actually only need to look at the rows for which m.col(i) is nonzero. For sake of convenience, let's say that we can iterate over both the columns and rows. we can then do:
This way, we (1) only ever touch the nonzero values in m, and (2) only use the nonzero values in rows where i is nonzero. |
I should emphasize that this is only useful for very sparse matrices, such as very large DTMs. One of the very nice things about your current implementation is that it would also work on a dense matrix, such as word embeddings. I think it would be hard to achieve this without the transpose (or row major) copy, because you simply need a way to iterate the nonzero values for a given row. |
@kasperwelbers Thanks for the comments. I now better understand your row-by-row approach and agree that it is a clever way to deal with sparse matrices (the transposition makes extraction of rows from That said, I am still trying to improve the performance of col-by-col by skipping zero rows, because it is easier to compute other proximity measure with two columns. Lines 64 to 70 in d21e5cd
Non-zero flag improves performance but not dramatically. I am also suspecting arma::mat(mt.col(i)) (conversion to dense matrix) is taking time, but it becomes considerably slower if I don't do that.
|
@kasperwelbers, I ended up separating function for cosine and correlation from others to use linear algebra. It also uses a transposed matrix, so close to you code. It is faster than the current serial R code and my parallel column-by-column code. I guess that Armadillo multiplies sparse matrices in a similar way as you do, but, if you think you can improve speed, why don't you incorporate your code. It is not fast with dense matrices, but we can switch algorithms based on sparsity. Lines 72 to 76 in 8158298
I run the function on my large Guardian corpus and it completed computation of similarity between 204K ^ 2 pairs on 186K dimensions in 5 hours with near-constant memory usage! > system.time({
+ out <- textstat_simil2(mt_gur, margin = "documents", min_simil = 0.9)
+ })
user system elapsed
132295.474 326.599 17019.246
> print(object.size(out), units = "Mb")
2578.3 Mb
> dim(mt_gur)
[1] 204061 186021 |
That looks really good! It's likely that arma already has stellar sparse matrix multiplication performance, so I don't think I'd be able to improve on that, but I'll try giving it a shot somewhere next week for the fun of it. If the performance is comparable, one advantage of not relying on arma's matrix multiplication is that it's not restricted to product sum, and thus could support more similarity/distance metrics. |
@kbenoit > syno = list()
> for (f in head(colnames(out2), 10)) {
+ syno[[f]] = head(sort(out2[,f], decreasing = TRUE), 10)
+ }
> syno
$`fellow-citizens`
fellow-citizens extensive elective executive examination
1.0000000 0.9157805 0.8962582 0.8723449 0.8711806
latter mode respectively however departments
0.8637614 0.8600896 0.8450003 0.8419395 0.8400058
$of
of the to by in its from and it
1.0000000 0.9922908 0.9761524 0.9623809 0.9616900 0.9575118 0.9489750 0.9463495 0.9447675
with
0.9377452
$the
the of to by in it its from be
1.0000000 0.9922908 0.9814107 0.9696001 0.9695980 0.9584249 0.9579936 0.9534339 0.9502737
and
0.9486306
$senate
senate genuine temptations medium immunity dispassionate
1.0000000 0.8854220 0.8834522 0.8834522 0.8834522 0.8834522
controlling examination latter declining
0.8765231 0.8466417 0.8268856 0.8115027
$and
and , to a . in the for of
1.0000000 0.9622326 0.9617753 0.9558603 0.9511407 0.9504642 0.9486306 0.9485077 0.9463495
with
0.9439638 I don't see any reason to define an original class for sparse proximity objects, because there is no methods for them. |
Here is a comparison between new and old in different levels of sparsity: mt1pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.01)))
mt10pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.1)))
mt50pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.5)))
mt100pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=1)))
microbenchmark::microbenchmark(
textstat_simil(mt1pc, margin = "features"),
textstat_simil(mt10pc, margin = "features"),
textstat_simil(mt50pc, margin = "features"),
textstat_simil(mt100pc, margin = "features"),
textstat_simil2(mt1pc, margin = "features"),
textstat_simil2(mt10pc, margin = "features"),
textstat_simil2(mt50pc, margin = "features"),
textstat_simil2(mt100pc, margin = "features"),
times = 1
) The new function is much faster when the matrix is sparse, but still faster when it is 100% dense.
If you set |
Your use of |
We've agreed this makes sense for similarity, which ranges from -1.0 to 1.0 in the measures we implement (and only in the (-1,0) range for correlation) but not for distance, since many distance measures are not bounded or not on the same scale. |
No description provided.
The text was updated successfully, but these errors were encountered: