Add min threshold to textstat_dist()? #1210

koheiw · 2018-02-01T11:10:04Z

No description provided.

kbenoit · 2018-02-01T13:54:13Z

Not exactly a clear feature request, but if you mean an option to return a reduced size dist object based on the dist values, this is not really workable since the threshold would differ wildly across the matrix.

But maybe I've misunderstood. Are there examples of this for other dist operations? (from proxy for instance?)

koheiw · 2018-02-03T07:36:57Z

I did not have time to explain until now. I noticed that people are struggling to calculate pairwise similarity between large number of sentences or n-grams to detect reuse of texts because it generates large dense matrix.

We can make it more efficient if we floor values lower than a threshold to zero. Flooring does not have any impact on the analysis because people are usually only interested in very high similarity. I haven't seen the code really, but could be done either in textstat_simil or textstat_dist.

koheiw · 2018-02-03T07:41:34Z

This is where flooring will should happen:

quanteda/src/dist_mt.cpp

Line 218 in 76e2cbf

rmat(j,i) = (double)sum(bb) * (double)ncol/(double)non_zeros.n_elem;

kbenoit · 2018-02-03T16:53:45Z

OK, I see. So in the row/column loop in C++ we would just return a zero if the distance were below the minimum, and that would be one less cell in the sparse object returned by the C++ functions, so it is a smaller object returned to R.

The current return to R is a dist object, which is an efficient lower-diagonal vector of a distance matrix, basically a vector with a separate record of the object dimensions. But a zero in that object would still be a zero numeric, so we would have to change the return from textstat_dist() object to be something like that the fcm class is currently. Would also need an as.dist() coercion method for that.

The question is whether this is really needed. Is the number of zero cells caused by a removing distances below a minimum significant enough to warrant this solution? Current object format for an fcm is a sparse lower diagonal matrix. Current object format for dist is the same, except that it records small values too. What would the "sparsity" level of a typical dist return?

Finally, are there precedents for doing this with distance computations? Something whose practice we could emulate?

koheiw · 2018-02-03T20:29:57Z

Nicolas Merz is facing this issue right now, so we can ask him to test if flooring helps. I also have a plant to perform huge pairwise distance calculation in near future for "fake news". Analysis of text reuse has been restricted due to the lack of an efficient tool, but the new argument might create a new trend in text analysis.

wordspace::dist.matrix returns a matrix, but can be coerced to a dist object.
https://www.rdocumentation.org/packages/wordspace/versions/0.2-0/topics/dist.matrix

spam::nearestdist has delta as a threashold.
https://www.rdocumentation.org/packages/spam/versions/2.1-1/topics/nearestdist

It seems that not many functions that take sparse matrix as an input return dense dist as an output.

koheiw · 2018-03-21T06:14:31Z

After being asked about how to large-N pair-wise similarity computation, I am keen on adding dist_floor or something like that. However, I noticed that the most popular distance measures (cosine and correlation) are calculated by Matrix::crossprod(). So we need to write its equivalent in parallel C++ with floor argument.

quanteda/R/textstat_simil.R

Lines 122 to 164 in 57a3921

    
           cosine_simil <- function(x, y = NULL, margin = 1) { 
        
               if (!(margin %in% 1:2)) stop("margin can only be 1 (rows) or 2 (columns)") 
        
               if (margin == 1) x <- t(x) 
        
               S <- rep(1, nrow(x)) 
        
               N <- Matrix::Diagonal(x = sqrt(colSums(x ^ 2)) ^ -1) 
        
               x <- x %*% N 
        
               if (!is.null(y)) { 
        
                   if (margin == 1) y <- t(y) 
        
                   N <- Matrix::Diagonal(x = sqrt(colSums(y ^ 2)) ^ -1) 
        
                   y <- y %*% N 
        
                   return(as.matrix(Matrix::crossprod(x, y))) 
        
               } else 
        
                   return(as.matrix(Matrix::crossprod(x))) 
        
           } 
        
           # Pearson correlation 
        
           correlation_simil <- function(x, y = NULL, margin = 1) { 
        
               if (!(margin %in% 1:2)) stop("margin can only be 1 (rows) or 2 (columns)") 
        
               func_cp <- if (margin == 2) Matrix::crossprod else Matrix::tcrossprod 
        
               func_tcp <- if (margin == 2) Matrix::tcrossprod else Matrix::crossprod 
        
               func_sum <- if (margin == 2) colSums else rowSums 
        
               n <- if (margin == 2) nrow(x) else ncol(x) 
        
               mux <- if (margin == 2) colMeans(x) else rowMeans(x) 
        
               if (!is.null(y)) { 
        
                   stopifnot(if (margin == 2) nrow(x) == nrow(y) else ncol(x) == ncol(y)) 
        
                   muy <- if (margin == 2) colMeans(y) else rowMeans(y) 
        
                   covmat <- (as.matrix(func_cp(x,y)) - n * tcrossprod(mux, muy)) / (n - 1) 
        
                   sdvecX <- sqrt((func_sum(x ^ 2) - n * mux ^ 2) / (n - 1)) 
        
                   sdvecY <- sqrt((func_sum(y ^ 2) - n * muy ^ 2) / (n - 1)) 
        
                   cormat <- covmat / tcrossprod(sdvecX, sdvecY) 
        
               } else { 
        
                   covmat <- (as.matrix(func_cp(x)) - drop(n * tcrossprod(mux))) / (n - 1) 
        
                   sdvec <- sqrt(diag(covmat)) 
        
                   cormat <- covmat / tcrossprod(sdvec) 
        
               } 
        
               cormat 
        
           }

kbenoit · 2018-03-21T06:41:43Z

Let's look at some other packages to see how they deal with this issue, e.g. scipy - see http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html, or http://na-o-ys.github.io/others/2015-11-07-sparse-vector-similarities.html.

koheiw · 2018-03-21T06:51:03Z

I saw these but there is no indication that they have flooring option or something similar. This is why many people who are in trouble.

kbenoit · 2018-03-21T06:55:10Z

Understood, but I'd like to see some other approach that implements the floor, or a paper justifying it, so that we fully understand the substantive consequences of zeroing small values on all of the metrics that we implement. This could be a test suite for instance. But it's possible that not all measures are indifferent to near-zero values being treated as zero.

The R implementation formerly known as Revolution Analytics uses a parallelized and optimised version of LAPACK for speed, and it has a better crossprod implementation, I believe. We cannot use that directly but we can learn from the source. See http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html.

koheiw · 2018-03-21T06:57:34Z

This guy takes top-n for similar purpose:
https://medium.com/@ingwbaa/https-medium-com-ingwbaa-boosting-selection-of-the-most-similar-entities-in-large-scale-datasets-450b3242e618

koheiw · 2018-03-24T11:04:44Z

This is the 4th case in two month. We need to address this problem.
https://stackoverflow.com/questions/49437226/similarity-between-documents-in-big-dataset-in-r

kbenoit · 2018-03-25T17:10:33Z

I’m fine with seeking solutions to this, but let’s base it on something rather than engineer fixes whose statistical or mathematical basis is unproven. (Although I would be ok with demonstrations that this works approximately the same.)

I found:
https://stackoverflow.com/questions/40900608/cosine-similarity-on-large-sparse-matrix-with-numpy
https://stats.stackexchange.com/questions/61085/cosine-similarity-on-sparse-matrix
http://text2vec.org/similarity.html
https://stanford.edu/~rezab/classes/cme323/S15/notes/lec10.pdf

And a paper:
https://www.cs.cmu.edu/~wcohen/postscript/ecai2010-pic-final.pdf

koheiw · 2018-04-22T11:03:52Z

@kasperwelbers please share your thoughts with us

kasperwelbers · 2018-04-24T10:11:04Z

So for my two cents, I agree that you would need to have some form of flooring, either by using a threshold or a top_n. Many documents will have at least one term overlapping, so without flooring you will often be close to n.x * n.y * 8 bytes. What I was aiming for is to simply let people determine a threshold themselves, so that if they are only interested in strong similarities (e.g., recommender systems, duplicates), big comparisons are feasible.

As the proof is in the pudding, I tried testing how much this helps at the meeting, but didn't manage to fix it in time to present it. I just finished it in the airplane (no pun intended), and put it up here: https://github.com/kasperwelbers/matsim. This implementation uses Eigen and is a bit different from quanteda's, but the idea is simply to not store (i.e. floor) certain values. Using this on 50.000 documents goes pretty smooth, and I'm quite certain it could pull of much larger numbers given a high enough threshold or by using top_n. Off course, computation time still increases exponentially. If I understood Dmitriy correctly, there are ways around this that rely on approximations, but that's beyond my comprehension (also, I dislike approximations).

Regarding threshold versus top_n. I like thresholds, but top n can be nice for things like recommender systems. Dmitriy informed me of his implementation, which is probably very fast (though not on CRAN):
https://github.com/dselivanov/rsparse/blob/f488decc4d3cf3d2735a19ada96ba4559250a8d4/R/utils.R#L30. An additional advantage of the top_n approach is that you know the max required capacity (n.x * top_n).

kasperwelbers · 2018-04-24T10:36:15Z

Ken, just for my understanding. When you say:

"not all measures are indifferent to near-zero values being treated as zero"

are you referring to measures that are based on some function of all the document similarity scores, like corpus-level similarity? If it's about measures for document pair similarity, I think it should be possible for most if not all common measures. But perhaps I'm totally missing something here.

koheiw · 2018-04-27T11:04:21Z

I like rsparse and matsim. I hope Dimitry to upload his package to CRAN, but he said he is not planing to do so. We could integrate matsim into quanteda or make it a single function package aiming to publish on CRAN. In either case, I can contribute to parallelizing the computation.

kasperwelbers · 2018-04-27T13:39:24Z

A separate package that only implements various similarity/distance measures could be useful. But if it could be integrated into quanteda (rewritten in Armadillo) that would also work for me.

So, there are already some sparse matrix multiplication implemetations in quanteda, but if I'm right, these return a non-sparse matrix, and i'm curious how the way in which they iterate over the columns compares to the approach used in matsim in terms of speed . We could use the matsim github to benchmark some implementations, and then see whether a separate package makes sense, or whether some tweaks in quanteda could fix it.

If it's alright with you, I'll copy one of the quanteda parallelized implementations to matsim and rewrite it to be similar to the matsim implementation but in Armadillo. If that doesn't work well, we could try parallelizing the Eigen implementation.

koheiw · 2018-04-27T17:08:29Z

All the similarity measures are for sparse document-feature matrix, but not sparse similarity matrix.
Some of the similarity measures are computed in a C++ loop, where we can easily floor small values. For example,

quanteda/src/dist_mt.cpp

Line 218 in 76e2cbf

rmat(j,i) = (double)sum(bb) * (double)ncol/(double)non_zeros.n_elem;

But popular measures like cosine is computed using crosspod(), so we need to rewrite this part in C++.

quanteda/R/textstat_simil.R

Lines 122 to 137 in 919e012

    
           cosine_simil <- function(x, y = NULL, margin = 1) { 
        
               if (!(margin %in% 1:2)) stop("margin can only be 1 (rows) or 2 (columns)") 
        
               if (margin == 1) x <- t(x) 
        
               S <- rep(1, nrow(x)) 
        
               N <- Matrix::Diagonal(x = sqrt(colSums(x ^ 2)) ^ -1) 
        
               x <- x %*% N 
        
               if (!is.null(y)) { 
        
                   if (margin == 1) y <- t(y) 
        
                   N <- Matrix::Diagonal(x = sqrt(colSums(y ^ 2)) ^ -1) 
        
                   y <- y %*% N 
        
                   return(as.matrix(Matrix::crossprod(x, y))) 
        
               } else 
        
                   return(as.matrix(Matrix::crossprod(x))) 
        
           }

As for parallel computation, using triplet seems to be the most efficient way to make a sparse matrix as in fcm(). It is basically a TBB's concurrent_vector. We used to use Armadillo's sparse matrix as an output object, but it kept crashing when the data is large.

The advantage of integration the sparse similarity function that quanteda has all the configurations and objects for the parallel computation. Disadvantage is that we have to make the look and feel consistent with existing functions. I am sure that @kbenoit has somethings to say about this.

kasperwelbers · 2018-04-30T16:47:42Z

Right, so it could be implement in the current form without major changes, but it would only be useful if the similarity matrix is returned as a sparse matrix, and I suppose there is a reason for not doing that in the first place (and otherwise, it would be a breaking change).

Perhaps there is some virtue in only returning the sparse matrix if a threshold is used. That would make it more explicit that by using a threshold it is no longer a pure similarity matrix, thus making substantive consequences of using a threshold the user's own responsibility.

koheiw · 2018-04-30T21:38:28Z

The functions are not returning sparse matrix currently but R's dist object. We can just make as.dist() to for conversion, however. Changing API specification is the most tricky part, so we should aim next measure upgrade, probably v1.5.

koheiw · 2018-05-02T17:08:31Z

Let's first check how many packages are using textstat_simil() or textstat_dist() to decide whether making separate package is a safer option. I guess there aren't many.

We could probably just change their output object to a matrix and run revdep_check().

kbenoit · 2018-05-02T17:25:49Z

We could probably just change their output object to a matrix and run revdep_check().

Partly, but also let’s explore what functions in other packages take dist and simil class objects, since it’s not only a compatibility, but also an extensibility issue. However - we will provide as.dist() and as.simil() methods so they should take care of that problem.

I’d also however like to test those methods when we have values close to zero, versus equal to zero, since the zeros can cause special problems in methods that use cross-products or element-wise multiplication, or division by zero.

koheiw · 2018-09-18T00:21:02Z

I finally have found time to work on this and created textstat_simil2() with min_simil in the dev-simil branch. Classic linear algebra is faster but the output object (dsCMatrix) is much smaller than the dist object when a threshold is applied. For example, when min_simil = 0.5, only 5% of the dist object.

> quanteda_options(threads = 8)
> #mt <- dfm(c("a b c", "a b d", "b d e", "e"))
> #mt <- dfm(corpus_reshape(data_corpus_inaugural, "sentence"))
> mt <- dfm(data_corpus_inaugural)
> 
> # consine ---------------
> 
> microbenchmark::microbenchmark(
+     textstat_simil2(mt, margin = "features"),
+     textstat_simil2(mt, margin = "features", min_simil = 0.5),
+     textstat_simil(mt, margin = "features", method = "cosine"),
+     times = 10
+ )
Unit: seconds
                                                       expr      min       lq     mean   median
                   textstat_simil2(mt, margin = "features") 6.871598 6.934440 7.272729 7.253378
  textstat_simil2(mt, margin = "features", min_simil = 0.5) 3.183344 3.288676 3.344054 3.323054
 textstat_simil(mt, margin = "features", method = "cosine") 2.027438 2.094070 2.459622 2.233746
       uq      max neval
 7.567087 7.952743    10
 3.423212 3.482286    10
 2.667914 3.540729    10

> cos_new1 <- textstat_simil2(mt, margin = "feature", method = "cosine")
> cos_new2 <- textstat_simil2(mt, margin = "feature", method = "cosine", min_simil = 0.5)
> cos_old <- textstat_simil(mt, margin = "feature", method = "cosine")
> 
> print(object.size(cos_new1), units = "MB")
502.2 Mb
> print(object.size(cos_new2), units = "MB")
17.1 Mb
> print(object.size(cos_old), units = "MB")
334.5 Mb

koheiw · 2018-09-18T00:21:29Z

@kasperwelbers, please give it a try.

koheiw · 2018-09-18T00:27:20Z

@nicmer, please try if it works with ngrams.

koheiw · 2018-09-18T12:43:12Z

It became as fast as the classic version by optimization.

Unit: seconds
                                                       expr      min       lq     mean   median       uq      max neval
                   textstat_simil2(mt, margin = "features") 4.300584 4.410055 4.479112 4.445247 4.545219 4.865179    10
  textstat_simil2(mt, margin = "features", min_simil = 0.5) 2.448227 2.579752 2.608285 2.629392 2.672302 2.692456    10
  textstat_simil2(mt, margin = "features", min_simil = 0.8) 2.460991 2.490255 2.562686 2.546120 2.629364 2.684760    10
 textstat_simil(mt, margin = "features", method = "cosine") 4.525663 4.636489 4.830470 4.831995 4.994409 5.124913    10

kasperwelbers · 2018-09-18T16:13:38Z

Nice work!

I noted one minor point of elegance that in some cases might also help in terms of speed and memory efficiency. The output of textstat_simil2 can currently still contain zeros if no limit is used. Since out output is a sparse matrix anyway, these might as well be dropped. For instance:
if (simil != 0 && simil >= limit) { simil_tri_temp.push_back(std::make_tuple(i, j, simil)); }

kbenoit · 2018-09-18T16:50:23Z

The advantage of the current functions are that they work with the default dist()/simil() functions and produce objects of the same classes. The disadvantage of that format (as we have stated here) is that it's dense (even if its only a lower diagonal).

For the truncation via min_simil we would want to keep it fully sparse and therefore not record the zeros, since that is inefficient, but then we would need to structure this in a way that easily converts to a dist class object (via e.g. as.dist()). We would probably want to create our own sparse dist class that behaves more or less the same as does dist, but not recording zeros.

koheiw · 2018-09-18T20:40:12Z

Thanks @kasperwelbers for the input. By further optimization it became twice as fast and compact as the old similarity function!

Unit: seconds
                                                       expr      min       lq     mean   median       uq      max neval
                   textstat_simil2(mt, margin = "features") 2.254619 2.276864 2.338764 2.333284 2.376038 2.454407    10
  textstat_simil2(mt, margin = "features", min_simil = 0.5) 1.885950 1.894984 1.926533 1.912955 1.951385 2.007386    10
  textstat_simil2(mt, margin = "features", min_simil = 0.8) 1.841840 1.854850 1.896156 1.873338 1.950067 1.980754    10
 textstat_simil(mt, margin = "features", method = "cosine") 4.607828 4.662541 4.842706 4.767185 5.024243 5.310173    10
> cos_new1 <- textstat_simil2(mt, margin = "feature", method = "cosine")
> cos_new2 <- textstat_simil2(mt, margin = "feature", method = "cosine", min_simil = 0.5)
> cos_old <- textstat_simil(mt, margin = "feature", method = "cosine")
> print(object.size(cos_new1), units = "MB")
156.6 Mb
> print(object.size(cos_new2), units = "MB")
22.5 Mb
> print(object.size(cos_old), units = "MB")
334.5 Mb

koheiw · 2018-09-19T08:03:24Z

Rank argument is added to get top n items.

> textstat_simil2(mt, rank = 5)[,1:3]
58 x 3 sparse Matrix of class "dgTMatrix"
                1789-Washington 1793-Washington 1797-Adams
1789-Washington       1.0000000       .          .        
1793-Washington       .               1.0000000  .        
1797-Adams            .               .          1.0000000
1801-Jefferson        .               .          0.9573024
1805-Jefferson        .               .          0.9453655
1809-Madison          .               .          .        
1813-Madison          .               .          .        
1817-Monroe           .               .          .        
1821-Monroe           0.9457271       .          .        
1825-Adams            .               0.8509694  .        
1829-Jackson          .               .          .        
1833-Jackson          .               .          .        
1837-VanBuren         .               .          .        
1841-Harrison         0.9470858       0.8549139  .        
1845-Polk             .               .          .        
1849-Taylor           .               0.8408378  .        
1853-Pierce           0.9437326       .          0.9415191
1857-Buchanan         .               .          .        
1861-Lincoln          .               .          .        
1865-Lincoln          .               .          .        
1869-Grant            .               .          .        
1873-Grant            .               .          .        
1877-Hayes            0.9510748       0.8392405  .        
1881-Garfield         .               .          .        
1885-Cleveland        .               .          .        
1889-Harrison         .               .          .        
1893-Cleveland        .               .          .        
1897-McKinley         .               .          0.9378949
1901-McKinley         .               .          .        
1905-Roosevelt        .               .          .        
1909-Taft             .               .          .        
1913-Wilson           .               .          .        
1917-Wilson           .               .          .        
1921-Harding          .               .          .        
1925-Coolidge         .               .          .        
1929-Hoover           .               .          .        
1933-Roosevelt        .               .          .        
1937-Roosevelt        .               .          .        
1941-Roosevelt        .               .          .        
1945-Roosevelt        .               .          .        
1949-Truman           .               .          .        
1953-Eisenhower       .               .          .        
1957-Eisenhower       .               .          .        
1961-Kennedy          .               .          .        
1965-Johnson          .               .          .        
1969-Nixon            .               .          .        
1973-Nixon            .               .          .        
1977-Carter           .               .          .        
1981-Reagan           .               .          .        
1985-Reagan           .               .          .        
1989-Bush             .               .          .        
1993-Clinton          .               .          .        
1997-Clinton          .               .          .        
2001-Bush             .               .          .        
2005-Bush             .               .          .        
2009-Obama            .               .          .        
2013-Obama            .               .          .        
2017-Trump            .               .          .

koheiw · 2018-09-19T08:10:06Z

We can actually convert the output object to a dist just by

d <- as.dist(as.matrix(textstat_simil2(mt, min_simil = 0.8)))
plot(hclust(d))

koheiw · 2018-09-20T06:32:01Z

I have added all other distance and similarity measures in the same C++ code. Since they are essentially the same, we can make textstat_proxy() with type = c("dist", "simil"). texstat_simil() and textstat_dist() will use the new function internally to return simil and dist objects.

kasperwelbers · 2018-09-20T10:49:59Z

I did some stress tests and noted that while the memory issue is solved by the new sparse output, it does not yet scale well in terms of speed. The reasons seems to be that the current implementation of simil_mt essentially still uses a dense matrix multiplication algorithm.

To see what can be gained, I implemented a sparse solution (see gist) in your script that scales better and is much faster. I haven't made this a pull request, because there's two important trade-offs that you'd need to consider (also, i'm not confident in my parallel processing skills).

- the implementation requires a copy (transpose, or row-major) of the input matrix, which is less memory efficient.
- calculating certain similarity measures becomes much less elegant, and some might not be (easily) possible. I've now only added cosine and correlation.

To test how it scales, I made three random sparse matrices with an increasing number of columns and the same density. Naturally, this data is not an accurate reflection of a natural dtm, which makes this a sloppy benchmark.

> mt1000 = abs(Matrix::rsparsematrix(10000, 1000, density=0.02))
> mt10000 = abs(Matrix::rsparsematrix(10000, 10000, density=0.02))
> mt100000 = abs(Matrix::rsparsematrix(10000, 100000, density=0.02))

> mt1000 = as.dfm(mt1000)
> mt10000 = as.dfm(mt10000)
> mt100000 = as.dfm(mt100000)

Current implementation
For the current implementation, I could only compute the first two, as the third one took forever on my laptop.

> microbenchmark::microbenchmark(
+   textstat_simil2(mt1000, margin = "features", min_simil = 0.5),
+   textstat_simil2(mt10000, margin = "features", min_simil = 0.5),
+   #textstat_simil2(mt100000, margin = "features", min_simil = 0.5),
+   times = 1
+ )

Unit: seconds                                                         expr        min         lq       mean
  textstat_simil2(mt1000, margin = "features", min_simil = 0.5)   4.811078   4.811078   4.811078
 textstat_simil2(mt10000, margin = "features", min_simil = 0.5) 512.290270 512.290270 512.290270
     median         uq        max neval
   4.811078   4.811078   4.811078     1
 512.290270 512.290270 512.290270     1

Alternative
The alternative implementation is much faster. Do note that gain depends on sparsity. On the inaugural dtm (same size as mt1000, but less sparse) it's only about 3 times as fast. The important thing is that is scales better.

Unit: seconds
                                                            expr          min           lq
   textstat_simil2(mt1000, margin = "features", min_simil = 0.5)   0.05013938   0.05013938
  textstat_simil2(mt10000, margin = "features", min_simil = 0.5)   2.71005598   2.71005598
 textstat_simil2(mt100000, margin = "features", min_simil = 0.5) 216.37871924 216.37871924
         mean       median           uq          max neval
   0.05013938   0.05013938   0.05013938   0.05013938     1
   2.71005598   2.71005598   2.71005598   2.71005598     1
 216.37871924 216.37871924 216.37871924 216.37871924     1

So, there is much to be gained, but it comes with a substantial amount of hassle, plus the memory trade-off. If you think its useful, here or for another function (e.g. deduplication) I can lend a hand, since I did something similar in RNewsflow (but more focused on doc similarity and without the cool parallel processing).

kbenoit · 2018-09-20T20:58:35Z

@koheiw the package I was trying to think of on Wed was ** qlcMatrix** at https://CRAN.R-project.org/package=qlcMatrix. This package operates on sparse matrices (which we did already) but also returns a sparse matrix format.

For instance qlcMatrix::cosSparse() computes sparse cosine matrices and returns a sparse object, so it might be worth studying that return object format.

I'm not suggesting we use this package since our code is already pretty optimised for sparse inputs, but it would be interesting to compare performance.

koheiw · 2018-09-24T03:24:40Z

@kasperwelbers thanks. Your code is useful and inspiring. I will try to integrate your sparse computation without making drawbacks that you pointed out.

koheiw · 2018-09-25T08:55:47Z

@kasperwelbers I am studying you code, but don't know why you need to transpose the matrix. Can you tell me why?

kasperwelbers · 2018-09-25T13:29:46Z

@koheiw As you know, a sparse matrix is either column or row major. In a column major, you can iterate over the non-zero cells for a given column, but not for a given row. In my code, I use the transpose to iterative over the non-zero cells for a given row.

The current dense implementation is.

for i in 1:ncols;
    outcome = vector of length ncols
    for j in 1:ncols:
          outcome[j] = dot(m.col(i), m.col(j))

Effectively, this way all values in m.col(i) are compared to all values in each m.col(j). But if we use the dot product we actually only need to look at the rows for which m.col(i) is nonzero.

For sake of convenience, let's say that we can iterate over both the columns and rows. we can then do:

for i in 1:ncols:
     outcome = vector of length ncols with each value 0
     for nonzero_i in m.col(i)
           row = row index of nonzero_i
           for nonzero_j in m.row(row):
                  column = column index of nonzero_j
                  outcome[column] += nonzero_i.value * nonzero_j.value

This way, we (1) only ever touch the nonzero values in m, and (2) only use the nonzero values in rows where i is nonzero.

kasperwelbers · 2018-09-25T13:56:03Z

I should emphasize that this is only useful for very sparse matrices, such as very large DTMs. One of the very nice things about your current implementation is that it would also work on a dense matrix, such as word embeddings.

I think it would be hard to achieve this without the transpose (or row major) copy, because you simply need a way to iterate the nonzero values for a given row.

koheiw · 2018-09-26T01:14:52Z

@kasperwelbers Thanks for the comments. I now better understand your row-by-row approach and agree that it is a clever way to deal with sparse matrices (the transposition makes extraction of rows from sp_mat much faster).

That said, I am still trying to improve the performance of col-by-col by skipping zero rows, because it is easier to compute other proximity measure with two columns.

quanteda/src/simil_mt.cpp

Lines 64 to 70 in d21e5cd

    
           double simil_cosine2(const arma::sp_mat& col_i, const arma::sp_mat& col_j, arma::umat& nz, double magni_i, double magni_j) { 
        
               double p = 0; 
        
               for (arma::umat::iterator it = nz.begin(); it != nz.end(); ++it) { 
        
                   p += col_i[*it] * col_j[*it]; 
        
               } 
        
               return p / (magni_i * magni_j); 
        
           }

Non-zero flag improves performance but not dramatically. I am also suspecting arma::mat(mt.col(i)) (conversion to dense matrix) is taking time, but it becomes considerably slower if I don't do that.

koheiw · 2018-09-29T03:14:33Z

@kasperwelbers, I ended up separating function for cosine and correlation from others to use linear algebra. It also uses a transposed matrix, so close to you code. It is faster than the current serial R code and my parallel column-by-column code. I guess that Armadillo multiplies sparse matrices in a similar way as you do, but, if you think you can improve speed, why don't you incorporate your code. It is not fast with dense matrices, but we can switch algorithms based on sparsity.

quanteda/src/simil_mt.cpp

Lines 72 to 76 in 8158298

    
               simils = to_vector(trans(mt2 * mt1.col(i)) / (square * square[i])); 
        
           } else { 
        
               rowvec v1 = rowvec(trans(mt2 * mt1.col(i))); 
        
               rowvec v2 = center * center[i] * nrow; 
        
               simils = to_vector(((v1 - v2) / nrow) / (square * square[i]));

I run the function on my large Guardian corpus and it completed computation of similarity between 204K ^ 2 pairs on 186K dimensions in 5 hours with near-constant memory usage!

> system.time({
+ out <- textstat_simil2(mt_gur, margin = "documents", min_simil = 0.9)
+ })
      user     system    elapsed 
132295.474    326.599  17019.246 
> print(object.size(out), units = "Mb")
2578.3 Mb
> dim(mt_gur)
[1] 204061 186021

kasperwelbers · 2018-09-29T11:36:00Z

That looks really good!

It's likely that arma already has stellar sparse matrix multiplication performance, so I don't think I'd be able to improve on that, but I'll try giving it a shot somewhere next week for the fun of it. If the performance is comparable, one advantage of not relying on arma's matrix multiplication is that it's not restricted to product sum, and thus could support more similarity/distance metrics.

koheiw · 2018-09-29T23:58:18Z

@kbenoit qlcMatrix::cosSparse() only returns a dsCMatrix (sparse symmetric column-oriented) matrix. textstat_simil2 (or textstat_proxy()) returns dsTMatrix, because conversion from T to C matrix slows down around 20%. However, users would extract columns like below. In this case, C matrix is more suitable.

> syno = list()
> for (f in head(colnames(out2), 10)) {
+     syno[[f]] = head(sort(out2[,f], decreasing = TRUE), 10)
+ }
> syno
$`fellow-citizens`
fellow-citizens       extensive        elective       executive     examination 
      1.0000000       0.9157805       0.8962582       0.8723449       0.8711806 
         latter            mode    respectively         however     departments 
      0.8637614       0.8600896       0.8450003       0.8419395       0.8400058 

$of
       of       the        to        by        in       its      from       and        it 
1.0000000 0.9922908 0.9761524 0.9623809 0.9616900 0.9575118 0.9489750 0.9463495 0.9447675 
     with 
0.9377452 

$the
      the        of        to        by        in        it       its      from        be 
1.0000000 0.9922908 0.9814107 0.9696001 0.9695980 0.9584249 0.9579936 0.9534339 0.9502737 
      and 
0.9486306 

$senate
       senate       genuine   temptations        medium      immunity dispassionate 
    1.0000000     0.8854220     0.8834522     0.8834522     0.8834522     0.8834522 
  controlling   examination        latter     declining 
    0.8765231     0.8466417     0.8268856     0.8115027 

$and
      and         ,        to         a         .        in       the       for        of 
1.0000000 0.9622326 0.9617753 0.9558603 0.9511407 0.9504642 0.9486306 0.9485077 0.9463495 
     with 
0.9439638

I don't see any reason to define an original class for sparse proximity objects, because there is no methods for them.

koheiw · 2018-10-04T03:08:17Z

Here is a comparison between new and old in different levels of sparsity:

mt1pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.01)))
mt10pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.1)))
mt50pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=0.5)))
mt100pc = as.dfm(abs(Matrix::rsparsematrix(1000, 10000, density=1)))

microbenchmark::microbenchmark(
    textstat_simil(mt1pc, margin = "features"),
    textstat_simil(mt10pc, margin = "features"),
    textstat_simil(mt50pc, margin = "features"),
    textstat_simil(mt100pc, margin = "features"),
    textstat_simil2(mt1pc, margin = "features"),
    textstat_simil2(mt10pc, margin = "features"),
    textstat_simil2(mt50pc, margin = "features"),
    textstat_simil2(mt100pc, margin = "features"),
    times = 1
)

The new function is much faster when the matrix is sparse, but still faster when it is 100% dense.

Unit: seconds
                                          expr        min         lq       mean     median         uq        max neval
    textstat_simil(mt1pc, margin = "features")   7.691266   7.691266   7.691266   7.691266   7.691266   7.691266     1
   textstat_simil(mt10pc, margin = "features")  16.863189  16.863189  16.863189  16.863189  16.863189  16.863189     1
   textstat_simil(mt50pc, margin = "features")  66.698753  66.698753  66.698753  66.698753  66.698753  66.698753     1
  textstat_simil(mt100pc, margin = "features") 217.061203 217.061203 217.061203 217.061203 217.061203 217.061203     1
   textstat_simil2(mt1pc, margin = "features")   1.775254   1.775254   1.775254   1.775254   1.775254   1.775254     1
  textstat_simil2(mt10pc, margin = "features")   8.876443   8.876443   8.876443   8.876443   8.876443   8.876443     1
  textstat_simil2(mt50pc, margin = "features")  50.062099  50.062099  50.062099  50.062099  50.062099  50.062099     1
 textstat_simil2(mt100pc, margin = "features") 180.592381 180.592381 180.592381 180.592381 180.592381 180.592381     1

If you set min_smil, it becomes even faster because the resulting matrix becomes smaller.

kbenoit · 2018-10-05T13:25:24Z

Your use of = as an assignment operator tells me you've been programming too much in C++ recently! 😀

kbenoit · 2019-06-01T10:28:16Z

We've agreed this makes sense for similarity, which ranges from -1.0 to 1.0 in the measures we implement (and only in the (-1,0) range for correlation) but not for distance, since many distance measures are not bounded or not on the same scale.

koheiw self-assigned this Mar 20, 2018

koheiw added enhancement performance textstat labels Mar 20, 2018

koheiw added this to the v1.5 milestone Mar 20, 2018

koheiw mentioned this issue Oct 3, 2018

Remove Chi-square and Hamming distance measure #1442

Closed

koheiw mentioned this issue Oct 12, 2018

Dev simil #1451

Merged

kbenoit modified the milestones: v2.0, v1.4 Dec 18, 2018

kbenoit modified the milestones: v1.4, v1.5 Jan 9, 2019

koheiw added a commit that referenced this issue Feb 23, 2019

Change outpout object of simil to data.frame for #1210 and #1455

50b6554

This was referenced May 20, 2019

Change output of textstat_simil/dist() to data.frame of pairs #1664

Closed

Change output of textstat_simil/dist to sparse Matrix format #1694

Merged

kbenoit closed this as completed Jun 1, 2019

Add min threshold to textstat_dist()? #1210

Add min threshold to textstat_dist()? #1210

Comments

koheiw commented Feb 1, 2018

kbenoit commented Feb 1, 2018

koheiw commented Feb 3, 2018

koheiw commented Feb 3, 2018

kbenoit commented Feb 3, 2018

koheiw commented Feb 3, 2018

koheiw commented Mar 21, 2018

kbenoit commented Mar 21, 2018

koheiw commented Mar 21, 2018

kbenoit commented Mar 21, 2018

koheiw commented Mar 21, 2018

koheiw commented Mar 24, 2018

kbenoit commented Mar 25, 2018

koheiw commented Apr 22, 2018

kasperwelbers commented Apr 24, 2018

kasperwelbers commented Apr 24, 2018

koheiw commented Apr 27, 2018 • edited

kasperwelbers commented Apr 27, 2018

koheiw commented Apr 27, 2018

kasperwelbers commented Apr 30, 2018

koheiw commented Apr 30, 2018

koheiw commented May 2, 2018

kbenoit commented May 2, 2018

koheiw commented Sep 18, 2018 • edited

koheiw commented Sep 18, 2018

koheiw commented Sep 18, 2018

koheiw commented Sep 18, 2018

kasperwelbers commented Sep 18, 2018 • edited

kbenoit commented Sep 18, 2018

koheiw commented Sep 18, 2018

koheiw commented Sep 19, 2018

koheiw commented Sep 19, 2018

koheiw commented Sep 20, 2018

kasperwelbers commented Sep 20, 2018

kbenoit commented Sep 20, 2018 • edited

koheiw commented Sep 24, 2018

koheiw commented Sep 25, 2018

kasperwelbers commented Sep 25, 2018 • edited

kasperwelbers commented Sep 25, 2018

koheiw commented Sep 26, 2018

koheiw commented Sep 29, 2018 • edited

kasperwelbers commented Sep 29, 2018

koheiw commented Sep 29, 2018 • edited

koheiw commented Oct 4, 2018 • edited

kbenoit commented Oct 5, 2018

kbenoit commented Jun 1, 2019

koheiw commented Apr 27, 2018 •

edited

koheiw commented Sep 18, 2018 •

edited

kasperwelbers commented Sep 18, 2018 •

edited

kbenoit commented Sep 20, 2018 •

edited

kasperwelbers commented Sep 25, 2018 •

edited

koheiw commented Sep 29, 2018 •

edited

koheiw commented Sep 29, 2018 •

edited

koheiw commented Oct 4, 2018 •

edited