Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add min threshold to textstat_dist()? #1210

Closed
koheiw opened this issue Feb 1, 2018 · 45 comments · Fixed by #1694
Closed

Add min threshold to textstat_dist()? #1210

koheiw opened this issue Feb 1, 2018 · 45 comments · Fixed by #1694

Comments

@koheiw
Copy link
Collaborator

koheiw commented Feb 1, 2018

No description provided.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 1, 2018

Not exactly a clear feature request, but if you mean an option to return a reduced size dist object based on the dist values, this is not really workable since the threshold would differ wildly across the matrix.

But maybe I've misunderstood. Are there examples of this for other dist operations? (from proxy for instance?)

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 3, 2018

I did not have time to explain until now. I noticed that people are struggling to calculate pairwise similarity between large number of sentences or n-grams to detect reuse of texts because it generates large dense matrix.

We can make it more efficient if we floor values lower than a threshold to zero. Flooring does not have any impact on the analysis because people are usually only interested in very high similarity. I haven't seen the code really, but could be done either in textstat_simil or textstat_dist.

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 3, 2018

This is where flooring will should happen:

rmat(j,i) = (double)sum(bb) * (double)ncol/(double)non_zeros.n_elem;

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 3, 2018

OK, I see. So in the row/column loop in C++ we would just return a zero if the distance were below the minimum, and that would be one less cell in the sparse object returned by the C++ functions, so it is a smaller object returned to R.

The current return to R is a dist object, which is an efficient lower-diagonal vector of a distance matrix, basically a vector with a separate record of the object dimensions. But a zero in that object would still be a zero numeric, so we would have to change the return from textstat_dist() object to be something like that the fcm class is currently. Would also need an as.dist() coercion method for that.

The question is whether this is really needed. Is the number of zero cells caused by a removing distances below a minimum significant enough to warrant this solution? Current object format for an fcm is a sparse lower diagonal matrix. Current object format for dist is the same, except that it records small values too. What would the "sparsity" level of a typical dist return?

Finally, are there precedents for doing this with distance computations? Something whose practice we could emulate?

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 3, 2018

Nicolas Merz is facing this issue right now, so we can ask him to test if flooring helps. I also have a plant to perform huge pairwise distance calculation in near future for "fake news". Analysis of text reuse has been restricted due to the lack of an efficient tool, but the new argument might create a new trend in text analysis.

wordspace::dist.matrix returns a matrix, but can be coerced to a dist object.
https://www.rdocumentation.org/packages/wordspace/versions/0.2-0/topics/dist.matrix

spam::nearestdist has delta as a threashold.
https://www.rdocumentation.org/packages/spam/versions/2.1-1/topics/nearestdist

It seems that not many functions that take sparse matrix as an input return dense dist as an output.

@koheiw koheiw self-assigned this Mar 20, 2018
@koheiw koheiw added this to the v1.5 milestone Mar 20, 2018
@koheiw
Copy link
Collaborator Author

koheiw commented Mar 21, 2018

After being asked about how to large-N pair-wise similarity computation, I am keen on adding dist_floor or something like that. However, I noticed that the most popular distance measures (cosine and correlation) are calculated by Matrix::crossprod(). So we need to write its equivalent in parallel C++ with floor argument.

quanteda/R/textstat_simil.R

Lines 122 to 164 in 57a3921

cosine_simil <- function(x, y = NULL, margin = 1) {
if (!(margin %in% 1:2)) stop("margin can only be 1 (rows) or 2 (columns)")
if (margin == 1) x <- t(x)
S <- rep(1, nrow(x))
N <- Matrix::Diagonal(x = sqrt(colSums(x ^ 2)) ^ -1)
x <- x %*% N
if (!is.null(y)) {
if (margin == 1) y <- t(y)
N <- Matrix::Diagonal(x = sqrt(colSums(y ^ 2)) ^ -1)
y <- y %*% N
return(as.matrix(Matrix::crossprod(x, y)))
} else
return(as.matrix(Matrix::crossprod(x)))
}
# Pearson correlation
correlation_simil <- function(x, y = NULL, margin = 1) {
if (!(margin %in% 1:2)) stop("margin can only be 1 (rows) or 2 (columns)")
func_cp <- if (margin == 2) Matrix::crossprod else Matrix::tcrossprod
func_tcp <- if (margin == 2) Matrix::tcrossprod else Matrix::crossprod
func_sum <- if (margin == 2) colSums else rowSums
n <- if (margin == 2) nrow(x) else ncol(x)
mux <- if (margin == 2) colMeans(x) else rowMeans(x)
if (!is.null(y)) {
stopifnot(if (margin == 2) nrow(x) == nrow(y) else ncol(x) == ncol(y))
muy <- if (margin == 2) colMeans(y) else rowMeans(y)
covmat <- (as.matrix(func_cp(x,y)) - n * tcrossprod(mux, muy)) / (n - 1)
sdvecX <- sqrt((func_sum(x ^ 2) - n * mux ^ 2) / (n - 1))
sdvecY <- sqrt((func_sum(y ^ 2) - n * muy ^ 2) / (n - 1))
cormat <- covmat / tcrossprod(sdvecX, sdvecY)
} else {
covmat <- (as.matrix(func_cp(x)) - drop(n * tcrossprod(mux))) / (n - 1)
sdvec <- sqrt(diag(covmat))
cormat <- covmat / tcrossprod(sdvec)
}
cormat
}

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 21, 2018

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 21, 2018

I saw these but there is no indication that they have flooring option or something similar. This is why many people who are in trouble.

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 21, 2018

Understood, but I'd like to see some other approach that implements the floor, or a paper justifying it, so that we fully understand the substantive consequences of zeroing small values on all of the metrics that we implement. This could be a test suite for instance. But it's possible that not all measures are indifferent to near-zero values being treated as zero.

The R implementation formerly known as Revolution Analytics uses a parallelized and optimised version of LAPACK for speed, and it has a better crossprod implementation, I believe. We cannot use that directly but we can learn from the source. See http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html.

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 21, 2018

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 24, 2018

This is the 4th case in two month. We need to address this problem.
https://stackoverflow.com/questions/49437226/similarity-between-documents-in-big-dataset-in-r

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 25, 2018

I’m fine with seeking solutions to this, but let’s base it on something rather than engineer fixes whose statistical or mathematical basis is unproven. (Although I would be ok with demonstrations that this works approximately the same.)

I found:
https://stackoverflow.com/questions/40900608/cosine-similarity-on-large-sparse-matrix-with-numpy
https://stats.stackexchange.com/questions/61085/cosine-similarity-on-sparse-matrix
http://text2vec.org/similarity.html
https://stanford.edu/~rezab/classes/cme323/S15/notes/lec10.pdf

And a paper:
https://www.cs.cmu.edu/~wcohen/postscript/ecai2010-pic-final.pdf

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 22, 2018

@kasperwelbers please share your thoughts with us

@kasperwelbers
Copy link

So for my two cents, I agree that you would need to have some form of flooring, either by using a threshold or a top_n. Many documents will have at least one term overlapping, so without flooring you will often be close to n.x * n.y * 8 bytes. What I was aiming for is to simply let people determine a threshold themselves, so that if they are only interested in strong similarities (e.g., recommender systems, duplicates), big comparisons are feasible.

As the proof is in the pudding, I tried testing how much this helps at the meeting, but didn't manage to fix it in time to present it. I just finished it in the airplane (no pun intended), and put it up here: https://github.com/kasperwelbers/matsim. This implementation uses Eigen and is a bit different from quanteda's, but the idea is simply to not store (i.e. floor) certain values. Using this on 50.000 documents goes pretty smooth, and I'm quite certain it could pull of much larger numbers given a high enough threshold or by using top_n. Off course, computation time still increases exponentially. If I understood Dmitriy correctly, there are ways around this that rely on approximations, but that's beyond my comprehension (also, I dislike approximations).

Regarding threshold versus top_n. I like thresholds, but top n can be nice for things like recommender systems. Dmitriy informed me of his implementation, which is probably very fast (though not on CRAN):
https://github.com/dselivanov/rsparse/blob/f488decc4d3cf3d2735a19ada96ba4559250a8d4/R/utils.R#L30. An additional advantage of the top_n approach is that you know the max required capacity (n.x * top_n).

@kasperwelbers
Copy link

Ken, just for my understanding. When you say:

"not all measures are indifferent to near-zero values being treated as zero"

are you referring to measures that are based on some function of all the document similarity scores, like corpus-level similarity? If it's about measures for document pair similarity, I think it should be possible for most if not all common measures. But perhaps I'm totally missing something here.

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 27, 2018

I like rsparse and matsim. I hope Dimitry to upload his package to CRAN, but he said he is not planing to do so. We could integrate matsim into quanteda or make it a single function package aiming to publish on CRAN. In either case, I can contribute to parallelizing the computation.

@kasperwelbers
Copy link

A separate package that only implements various similarity/distance measures could be useful. But if it could be integrated into quanteda (rewritten in Armadillo) that would also work for me.

So, there are already some sparse matrix multiplication implemetations in quanteda, but if I'm right, these return a non-sparse matrix, and i'm curious how the way in which they iterate over the columns compares to the approach used in matsim in terms of speed . We could use the matsim github to benchmark some implementations, and then see whether a separate package makes sense, or whether some tweaks in quanteda could fix it.

If it's alright with you, I'll copy one of the quanteda parallelized implementations to matsim and rewrite it to be similar to the matsim implementation but in Armadillo. If that doesn't work well, we could try parallelizing the Eigen implementation.

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 27, 2018

All the similarity measures are for sparse document-feature matrix, but not sparse similarity matrix.
Some of the similarity measures are computed in a C++ loop, where we can easily floor small values. For example,

rmat(j,i) = (double)sum(bb) * (double)ncol/(double)non_zeros.n_elem;

But popular measures like cosine is computed using crosspod(), so we need to rewrite this part in C++.

quanteda/R/textstat_simil.R

Lines 122 to 137 in 919e012

cosine_simil <- function(x, y = NULL, margin = 1) {
if (!(margin %in% 1:2)) stop("margin can only be 1 (rows) or 2 (columns)")
if (margin == 1) x <- t(x)
S <- rep(1, nrow(x))
N <- Matrix::Diagonal(x = sqrt(colSums(x ^ 2)) ^ -1)
x <- x %*% N
if (!is.null(y)) {
if (margin == 1) y <- t(y)
N <- Matrix::Diagonal(x = sqrt(colSums(y ^ 2)) ^ -1)
y <- y %*% N
return(as.matrix(Matrix::crossprod(x, y)))
} else
return(as.matrix(Matrix::crossprod(x)))
}

As for parallel computation, using triplet seems to be the most efficient way to make a sparse matrix as in fcm(). It is basically a TBB's concurrent_vector. We used to use Armadillo's sparse matrix as an output object, but it kept crashing when the data is large.

The advantage of integration the sparse similarity function that quanteda has all the configurations and objects for the parallel computation. Disadvantage is that we have to make the look and feel consistent with existing functions. I am sure that @kbenoit has somethings to say about this.

@kasperwelbers
Copy link

Right, so it could be implement in the current form without major changes, but it would only be useful if the similarity matrix is returned as a sparse matrix, and I suppose there is a reason for not doing that in the first place (and otherwise, it would be a breaking change).

Perhaps there is some virtue in only returning the sparse matrix if a threshold is used. That would make it more explicit that by using a threshold it is no longer a pure similarity matrix, thus making substantive consequences of using a threshold the user's own responsibility.

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 30, 2018

The functions are not returning sparse matrix currently but R's dist object. We can just make as.dist() to for conversion, however. Changing API specification is the most tricky part, so we should aim next measure upgrade, probably v1.5.

@koheiw
Copy link
Collaborator Author

koheiw commented May 2, 2018

Let's first check how many packages are using textstat_simil() or textstat_dist() to decide whether making separate package is a safer option. I guess there aren't many.

We could probably just change their output object to a matrix and run revdep_check().

@kbenoit
Copy link
Collaborator

kbenoit commented May 2, 2018

We could probably just change their output object to a matrix and run revdep_check().

Partly, but also let’s explore what functions in other packages take dist and simil class objects, since it’s not only a compatibility, but also an extensibility issue. However - we will provide as.dist() and as.simil() methods so they should take care of that problem.

I’d also however like to test those methods when we have values close to zero, versus equal to zero, since the zeros can cause special problems in methods that use cross-products or element-wise multiplication, or division by zero.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 18, 2018

I finally have found time to work on this and created textstat_simil2() with min_simil in the dev-simil branch. Classic linear algebra is faster but the output object (dsCMatrix) is much smaller than the dist object when a threshold is applied. For example, when min_simil = 0.5, only 5% of the dist object.

> quanteda_options(threads = 8)
> #mt <- dfm(c("a b c", "a b d", "b d e", "e"))
> #mt <- dfm(corpus_reshape(data_corpus_inaugural, "sentence"))
> mt <- dfm(data_corpus_inaugural)
> 
> # consine ---------------
> 
> microbenchmark::microbenchmark(
+     textstat_simil2(mt, margin = "features"),
+     textstat_simil2(mt, margin = "features", min_simil = 0.5),
+     textstat_simil(mt, margin = "features", method = "cosine"),
+     times = 10
+ )
Unit: seconds
                                                       expr      min       lq     mean   median
                   textstat_simil2(mt, margin = "features") 6.871598 6.934440 7.272729 7.253378
  textstat_simil2(mt, margin = "features", min_simil = 0.5) 3.183344 3.288676 3.344054 3.323054
 textstat_simil(mt, margin = "features", method = "cosine") 2.027438 2.094070 2.459622 2.233746
       uq      max neval
 7.567087 7.952743    10
 3.423212 3.482286    10
 2.667914 3.540729    10

> cos_new1 <- textstat_simil2(mt, margin = "feature", method = "cosine")
> cos_new2 <- textstat_simil2(mt, margin = "feature", method = "cosine", min_simil = 0.5)
> cos_old <- textstat_simil(mt, margin = "feature", method = "cosine")
> 
> print(object.size(cos_new1), units = "MB")
502.2 Mb
> print(object.size(cos_new2), units = "MB")
17.1 Mb
> print(object.size(cos_old), units = "MB")
334.5 Mb

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 18, 2018

@kasperwelbers, please give it a try.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 18, 2018

@nicmer, please try if it works with ngrams.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 18, 2018

It became as fast as the classic version by optimization.

Unit: seconds
                                                       expr      min       lq     mean   median       uq      max neval
                   textstat_simil2(mt, margin = "features") 4.300584 4.410055 4.479112 4.445247 4.545219 4.865179    10
  textstat_simil2(mt, margin = "features", min_simil = 0.5) 2.448227 2.579752 2.608285 2.629392 2.672302 2.692456    10
  textstat_simil2(mt, margin = "features", min_simil = 0.8) 2.460991 2.490255 2.562686 2.546120 2.629364 2.684760    10
 textstat_simil(mt, margin = "features", method = "cosine") 4.525663 4.636489 4.830470 4.831995 4.994409 5.124913    10

@kasperwelbers
Copy link

kasperwelbers commented Sep 18, 2018

Nice work!

I noted one minor point of elegance that in some cases might also help in terms of speed and memory efficiency. The output of textstat_simil2 can currently still contain zeros if no limit is used. Since out output is a sparse matrix anyway, these might as well be dropped. For instance:
if (simil != 0 && simil >= limit) { simil_tri_temp.push_back(std::make_tuple(i, j, simil)); }

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 18, 2018

The advantage of the current functions are that they work with the default dist()/simil() functions and produce objects of the same classes. The disadvantage of that format (as we have stated here) is that it's dense (even if its only a lower diagonal).

For the truncation via min_simil we would want to keep it fully sparse and therefore not record the zeros, since that is inefficient, but then we would need to structure this in a way that easily converts to a dist class object (via e.g. as.dist()). We would probably want to create our own sparse dist class that behaves more or less the same as does dist, but not recording zeros.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 18, 2018

Thanks @kasperwelbers for the input. By further optimization it became twice as fast and compact as the old similarity function!

Unit: seconds
                                                       expr      min       lq     mean   median       uq      max neval
                   textstat_simil2(mt, margin = "features") 2.254619 2.276864 2.338764 2.333284 2.376038 2.454407    10
  textstat_simil2(mt, margin = "features", min_simil = 0.5) 1.885950 1.894984 1.926533 1.912955 1.951385 2.007386    10
  textstat_simil2(mt, margin = "features", min_simil = 0.8) 1.841840 1.854850 1.896156 1.873338 1.950067 1.980754    10
 textstat_simil(mt, margin = "features", method = "cosine") 4.607828 4.662541 4.842706 4.767185 5.024243 5.310173    10
> cos_new1 <- textstat_simil2(mt, margin = "feature", method = "cosine")
> cos_new2 <- textstat_simil2(mt, margin = "feature", method = "cosine", min_simil = 0.5)
> cos_old <- textstat_simil(mt, margin = "feature", method = "cosine")
> print(object.size(cos_new1), units = "MB")
156.6 Mb
> print(object.size(cos_new2), units = "MB")
22.5 Mb
> print(object.size(cos_old), units = "MB")
334.5 Mb

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 19, 2018

Rank argument is added to get top n items.

> textstat_simil2(mt, rank = 5)[,1:3]
58 x 3 sparse Matrix of class "dgTMatrix"
                1789-Washington 1793-Washington 1797-Adams
1789-Washington       1.0000000       .          .        
1793-Washington       .               1.0000000  .        
1797-Adams            .               .          1.0000000
1801-Jefferson        .               .          0.9573024
1805-Jefferson        .               .          0.9453655
1809-Madison          .               .          .        
1813-Madison          .               .          .        
1817-Monroe           .               .          .        
1821-Monroe           0.9457271       .          .        
1825-Adams            .               0.8509694  .        
1829-Jackson          .               .          .        
1833-Jackson          .               .          .        
1837-VanBuren         .               .          .        
1841-Harrison         0.9470858       0.8549139  .        
1845-Polk             .               .          .        
1849-Taylor           .               0.8408378  .        
1853-Pierce           0.9437326       .          0.9415191
1857-Buchanan         .               .          .        
1861-Lincoln          .               .          .        
1865-Lincoln          .               .          .        
1869-Grant            .               .          .        
1873-Grant            .               .          .        
1877-Hayes            0.9510748       0.8392405  .        
1881-Garfield         .               .          .        
1885-Cleveland        .               .          .        
1889-Harrison         .               .          .        
1893-Cleveland        .               .          .        
1897-McKinley         .               .          0.9378949
1901-McKinley         .               .          .        
1905-Roosevelt        .               .          .        
1909-Taft             .               .          .        
1913-Wilson           .               .          .        
1917-Wilson           .               .          .        
1921-Harding          .               .          .        
1925-Coolidge         .               .          .        
1929-Hoover           .               .          .        
1933-Roosevelt        .               .          .        
1937-Roosevelt        .               .          .        
1941-Roosevelt        .               .          .        
1945-Roosevelt        .               .          .        
1949-Truman           .               .          .        
1953-Eisenhower       .               .          .        
1957-Eisenhower       .               .          .        
1961-Kennedy          .               .          .        
1965-Johnson          .               .          .        
1969-Nixon            .               .          .        
1973-Nixon            .               .          .        
1977-Carter           .               .          .        
1981-Reagan           .               .          .        
1985-Reagan           .               .          .        
1989-Bush             .               .          .        
1993-Clinton          .               .          .        
1997-Clinton          .               .          .        
2001-Bush             .               .          .        
2005-Bush             .               .          .        
2009-Obama            .               .          .        
2013-Obama            .               .          .        
2017-Trump            .               .          .        

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 19, 2018

We can actually convert the output object to a dist just by

d <- as.dist(as.matrix(textstat_simil2(mt, min_simil = 0.8)))
plot(hclust(d))

rplot

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 20, 2018

I have added all other distance and similarity measures in the same C++ code. Since they are essentially the same, we can make textstat_proxy() with type = c("dist", "simil"). texstat_simil() and textstat_dist() will use the new function internally to return simil and dist objects.

@kasperwelbers
Copy link

I did some stress tests and noted that while the memory issue is solved by the new sparse output, it does not yet scale well in terms of speed. The reasons seems to be that the current implementation of simil_mt essentially still uses a dense matrix multiplication algorithm.

To see what can be gained, I implemented a sparse solution (see gist) in your script that scales better and is much faster. I haven't made this a pull request, because there's two important trade-offs that you'd need to consider (also, i'm not confident in my parallel processing skills).

    • the implementation requires a copy (transpose, or row-major) of the input matrix, which is less memory efficient.
    • calculating certain similarity measures becomes much less elegant, and some might not be (easily) possible. I've now only added cosine and correlation.

To test how it scales, I made three random sparse matrices with an increasing number of columns and the same density. Naturally, this data is not an accurate reflection of a natural dtm, which makes this a sloppy benchmark.

> mt1000 = abs(Matrix::rsparsematrix(10000, 1000, density=0.02))
> mt10000 = abs(Matrix::rsparsematrix(10000, 10000, density=0.02))
> mt100000 = abs(Matrix::rsparsematrix(10000, 100000, density=0.02))

> mt1000 = as.dfm(mt1000)
> mt10000 = as.dfm(mt10000)
> mt100000 = as.dfm(mt100000)


Current implementation
For the current implementation, I could only compute the first two, as the third one took forever on my laptop.

> microbenchmark::microbenchmark(
+   textstat_simil2(mt1000, margin = "features", min_simil = 0.5),
+   textstat_simil2(mt10000, margin = "features", min_simil = 0.5),
+   #textstat_simil2(mt100000, margin = "features", min_simil = 0.5),
+   times = 1
+ )

Unit: seconds                                                         expr        min         lq       mean
  textstat_simil2(mt1000, margin = "features", min_simil = 0.5)   4.811078   4.811078   4.811078
 textstat_simil2(mt10000, margin = "features", min_simil = 0.5) 512.290270 512.290270 512.290270
     median         uq        max neval
   4.811078   4.811078   4.811078     1
 512.290270 512.290270 512.290270     1

Alternative
The alternative implementation is much faster. Do note that gain depends on sparsity. On the inaugural dtm (same size as mt1000, but less sparse) it's only about 3 times as fast. The important thing is that is scales better.

Unit: seconds
                                                            expr          min           lq
   textstat_simil2(mt1000, margin = "features", min_simil = 0.5)   0.05013938   0.05013938
  textstat_simil2(mt10000, margin = "features", min_simil = 0.5)   2.71005598   2.71005598
 textstat_simil2(mt100000, margin = "features", min_simil = 0.5) 216.37871924 216.37871924
         mean       median           uq          max neval
   0.05013938   0.05013938   0.05013938   0.05013938     1
   2.71005598   2.71005598   2.71005598   2.71005598     1
 216.37871924 216.37871924 216.37871924 216.37871924     1

So, there is much to be gained, but it comes with a substantial amount of hassle, plus the memory trade-off. If you think its useful, here or for another function (e.g. deduplication) I can lend a hand, since I did something similar in RNewsflow (but more focused on doc similarity and without the cool parallel processing).

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 20, 2018

@koheiw the package I was trying to think of on Wed was ** qlcMatrix** at https://CRAN.R-project.org/package=qlcMatrix. This package operates on sparse matrices (which we did already) but also returns a sparse matrix format.

For instance qlcMatrix::cosSparse() computes sparse cosine matrices and returns a sparse object, so it might be worth studying that return object format.

I'm not suggesting we use this package since our code is already pretty optimised for sparse inputs, but it would be interesting to compare performance.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 24, 2018

@kasperwelbers thanks. Your code is useful and inspiring. I will try to integrate your sparse computation without making drawbacks that you pointed out.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 25, 2018

@kasperwelbers I am studying you code, but don't know why you need to transpose the matrix. Can you tell me why?

@kasperwelbers
Copy link

kasperwelbers commented Sep 25, 2018

@koheiw As you know, a sparse matrix is either column or row major. In a column major, you can iterate over the non-zero cells for a given column, but not for a given row. In my code, I use the transpose to iterative over the non-zero cells for a given row.

The current dense implementation is.

for i in 1:ncols;
    outcome = vector of length ncols
    for j in 1:ncols:
          outcome[j] = dot(m.col(i), m.col(j))           

Effectively, this way all values in m.col(i) are compared to all values in each m.col(j). But if we use the dot product we actually only need to look at the rows for which m.col(i) is nonzero.

For sake of convenience, let's say that we can iterate over both the columns and rows. we can then do:

for i in 1:ncols:
     outcome = vector of length ncols with each value 0
     for nonzero_i in m.col(i)
           row = row index of nonzero_i
           for nonzero_j in m.row(row):
                  column = column index of nonzero_j
                  outcome[column] += nonzero_i.value * nonzero_j.value

This way, we (1) only ever touch the nonzero values in m, and (2) only use the nonzero values in rows where i is nonzero.

@kasperwelbers
Copy link

I should emphasize that this is only useful for very sparse matrices, such as very large DTMs. One of the very nice things about your current implementation is that it would also work on a dense matrix, such as word embeddings.

I think it would be hard to achieve this without the transpose (or row major) copy, because you simply need a way to iterate the nonzero values for a given row.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 26, 2018

@kasperwelbers Thanks for the comments. I now better understand your row-by-row approach and agree that it is a clever way to deal with sparse matrices (the transposition makes extraction of rows from sp_mat much faster).

That said, I am still trying to improve the performance of col-by-col by skipping zero rows, because it is easier to compute other proximity measure with two columns.

double simil_cosine2(const arma::sp_mat& col_i, const arma::sp_mat& col_j, arma::umat& nz, double magni_i, double magni_j) {
double p = 0;
for (arma::umat::iterator it = nz.begin(); it != nz.end(); ++it) {
p += col_i[*it] * col_j[*it];
}
return p / (magni_i * magni_j);
}

Non-zero flag improves performance but not dramatically. I am also suspecting arma::mat(mt.col(i)) (conversion to dense matrix) is taking time, but it becomes considerably slower if I don't do that.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 29, 2018

@kasperwelbers, I ended up separating function for cosine and correlation from others to use linear algebra. It also uses a transposed matrix, so close to you code. It is faster than the current serial R code and my parallel column-by-column code. I guess that Armadillo multiplies sparse matrices in a similar way as you do, but, if you think you can improve speed, why don't you incorporate your code. It is not fast with dense matrices, but we can switch algorithms based on sparsity.

simils = to_vector(trans(mt2 * mt1.col(i)) / (square * square[i]));
} else {
rowvec v1 = rowvec(trans(mt2 * mt1.col(i)));
rowvec v2 = center * center[i] * nrow;
simils = to_vector(((v1 - v2) / nrow) / (square * square[i]));

I run the function on my large Guardian corpus and it completed computation of similarity between 204K ^ 2 pairs on 186K dimensions in 5 hours with near-constant memory usage!

> system.time({
+ out <- textstat_simil2(mt_gur, margin = "documents", min_simil = 0.9)
+ })
      user     system    elapsed 
132295.474    326.599  17019.246 
> print(object.size(out), units = "Mb")
2578.3 Mb
> dim(mt_gur)
[1] 204061 186021

@kasperwelbers
Copy link

That looks really good!

It's likely that arma already has stellar sparse matrix multiplication performance, so I don't think I'd be able to improve on that, but I'll try giving it a shot somewhere next week for the fun of it. If the performance is comparable, one advantage of not relying on arma's matrix multiplication is that it's not restricted to product sum, and thus could support more similarity/distance metrics.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 29, 2018

@kbenoit qlcMatrix::cosSparse() only returns a dsCMatrix (sparse symmetric column-oriented) matrix. textstat_simil2 (or textstat_proxy()) returns dsTMatrix, because conversion from T to C matrix slows down around 20%. However, users would extract columns like below. In this case, C matrix is more suitable.

> syno = list()
> for (f in head(colnames(out2), 10)) {
+     syno[[f]] = head(sort(out2[,f], decreasing = TRUE), 10)
+ }
> syno
$`fellow-citizens`
fellow-citizens       extensive        elective       executive     examination 
      1.0000000       0.9157805       0.8962582       0.8723449       0.8711806 
         latter            mode    respectively         however     departments 
      0.8637614       0.8600896       0.8450003       0.8419395       0.8400058 

$of
       of       the        to        by        in       its      from       and        it 
1.0000000 0.9922908 0.9761524 0.9623809 0.9616900 0.9575118 0.9489750 0.9463495 0.9447675 
     with 
0.9377452 

$the
      the        of        to        by        in        it       its      from        be 
1.0000000 0.9922908 0.9814107 0.9696001 0.9695980 0.9584249 0.9579936 0.9534339 0.9502737 
      and 
0.9486306 

$senate
       senate       genuine   temptations        medium      immunity dispassionate 
    1.0000000     0.8854220     0.8834522     0.8834522     0.8834522     0.8834522 
  controlling   examination        latter     declining 
    0.8765231     0.8466417     0.8268856     0.8115027 

$and
      and         ,        to         a         .        in       the       for        of 
1.0000000 0.9622326 0.9617753 0.9558603 0.9511407 0.9504642 0.9486306 0.9485077 0.9463495 
     with 
0.9439638 

I don't see any reason to define an original class for sparse proximity objects, because there is no methods for them.

@koheiw
Copy link
Collaborator Author

koheiw commented Oct 4, 2018

Here is a comparison between new and old in different levels of sparsity:

mt1pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.01)))
mt10pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.1)))
mt50pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.5)))
mt100pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=1)))

microbenchmark::microbenchmark(
    textstat_simil(mt1pc, margin = "features"),
    textstat_simil(mt10pc, margin = "features"),
    textstat_simil(mt50pc, margin = "features"),
    textstat_simil(mt100pc, margin = "features"),
    textstat_simil2(mt1pc, margin = "features"),
    textstat_simil2(mt10pc, margin = "features"),
    textstat_simil2(mt50pc, margin = "features"),
    textstat_simil2(mt100pc, margin = "features"),
    times = 1
)

The new function is much faster when the matrix is sparse, but still faster when it is 100% dense.

Unit: seconds
                                          expr        min         lq       mean     median         uq        max neval
    textstat_simil(mt1pc, margin = "features")   7.691266   7.691266   7.691266   7.691266   7.691266   7.691266     1
   textstat_simil(mt10pc, margin = "features")  16.863189  16.863189  16.863189  16.863189  16.863189  16.863189     1
   textstat_simil(mt50pc, margin = "features")  66.698753  66.698753  66.698753  66.698753  66.698753  66.698753     1
  textstat_simil(mt100pc, margin = "features") 217.061203 217.061203 217.061203 217.061203 217.061203 217.061203     1
   textstat_simil2(mt1pc, margin = "features")   1.775254   1.775254   1.775254   1.775254   1.775254   1.775254     1
  textstat_simil2(mt10pc, margin = "features")   8.876443   8.876443   8.876443   8.876443   8.876443   8.876443     1
  textstat_simil2(mt50pc, margin = "features")  50.062099  50.062099  50.062099  50.062099  50.062099  50.062099     1
 textstat_simil2(mt100pc, margin = "features") 180.592381 180.592381 180.592381 180.592381 180.592381 180.592381     1

If you set min_smil, it becomes even faster because the resulting matrix becomes smaller.

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 5, 2018

Your use of = as an assignment operator tells me you've been programming too much in C++ recently! 😀

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 1, 2019

We've agreed this makes sense for similarity, which ranges from -1.0 to 1.0 in the measures we implement (and only in the (-1,0) range for correlation) but not for distance, since many distance measures are not bounded or not on the same scale.

@kbenoit kbenoit closed this as completed Jun 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants