Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textreuse #20

Closed
lmullen opened this issue Sep 18, 2015 · 20 comments

Comments

@lmullen
Copy link
Member

commented Sep 18, 2015

    1. What does this package do? (explain in 50 words or less)

This package detects document similarity, and implements the minhash/lsh algorithms.

    1. Paste the full DESCRIPTION file inside a code block (bounded by ``` on either end).
Package: textreuse
Type: Package
Title: Detect Text Reuse and Document Similarity
Version: 0.0.1.9001
Date: 2015-09-17
Authors@R: c(person("Lincoln", "Mullen", role = c("aut", "cre"),
    email = "lincoln@lincolnmullen.com"))
Description: Tools for measuring similarity among documents and detecting
    passages which have been reused. Implements shingled n-gram, skip n-gram,
    and other tokenizers; similarity/dissimilarity functions; pairwise
    comparisons; and minhash and locality sensitive hashing algorithms.
License: MIT + file LICENSE
LazyData: TRUE
URL: https://github.com/lmullen/textreuse
BugReports: https://github.com/lmullen/textreuse/issues
VignetteBuilder: knitr
Depends: R (>= 3.1.2)
Imports: assertthat (>= 0.1),
    digest(>= 0.6.8),
    hash (>= 2.2.6),
    NLP (>= 0.1.8),
    Rcpp (>= 0.12.0),
    stringr (>= 1.0.0)
Suggests: testthat (>= 0.10.0),
    knitr (>= 1.11),
    rmarkdown (>= 0.8)
LinkingTo: Rcpp,
    BH
    1. URL for the package (the development repository, not a stylized html page)

https://github.com/lmullen/textreuse/

    1. What data source(s) does it work with (if applicable)?

This package anticipates that the user has documents in plain text. Future versions could provide, for example, XML readers as the tm package does, but I think that probably does not belong in this package.

    1. Who is the target audience?

Detecting document similarity is a common problem when working the natural language, so I anticipate that this package will be broadly useful for anyone working in NLP.

    1. Are there other R packages that accomplish the same thing? If so, what is different about yours?

No, there are no other R packages that implement minhash/locality-sensitive hashing. The tm package does implement some document similarity measures, but these are similarity in terms of content rather than in terms of actual borrowing of text. In other words, it would mark two documents that both talked about football as being similar, even if they had no shared text.

That said, this package extends classes from the NLP and tm packages, so it is intended to play nice with other R NLP packages.

    1. Check the box next to each policy below, confirming that you agree. These are mandatory.
  • This package does not violate the Terms of Service of any service it interacts with.
  • The repository has continuous integration with Travis and/or another service
  • The package contains a vignette
  • The package contains a reasonably complete readme with devtools install instructions
  • The package contains unit tests
  • The package only exports functions to the NAMESPACE that are intended for end users
    1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.

Yes, I comply with all those guidelines. The exception is that I have named classes, for example, TextReuseTextDocument bowing to the precedent set by the NLP package. I don't like the name any better than you, but that's just how they do it with those packages.

  • Are there any package dependencies not on CRAN?
  • Do you intend for this package to go on CRAN?
  • Does the package have a CRAN accepted license?
  • Did devtools::check() produce any errors or warnings? If so paste them below.
    1. Please add explanations below for any exceptions to the above:
    1. If this is a resubmission following rejection, please explain the change in cirucmstances.
@lmullen

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2015

Just a note to whoever ends up reviewing this package: I've made some performance improvements/bug fixes, but these are all on master now (with tags).

@sckott

This comment has been minimized.

Copy link
Member

commented Sep 22, 2015

thanks for the heads up, still seeking reviewer

@lmullen

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2015

I'm used to waiting 6 to 9 months. :-)

On Tue, Sep 22, 2015 at 2:53 PM, Scott Chamberlain <notifications@github.com

wrote:

thanks for the heads up, still seeking reviewer


Reply to this email directly or view it on GitHub
#20 (comment).

Lincoln Mullen, http://lincolnmullen.com
Assistant Professor, Department of History & Art History
George Mason University

@sckott

This comment has been minimized.

Copy link
Member

commented Sep 22, 2015

@noamross assigned

@noamross

This comment has been minimized.

Copy link
Collaborator

commented Oct 16, 2015

General Comments

👏👏👏

This package is so f*ing elegant. It's classy, really goddam classy and I barely have anything to add and kind of want to delete everything on my hard drive in response.

Some things I really, really like:

  • Right-sized: The package implements a few awesome methods for a particular task,
    and doesn't try too much else.
  • Excellent test coverage
  • Constant use of assertions
  • Well organized, readable code
  • Just the right amount of S3
  • Well-written, easy to follow, multiple short vignettes
  • progress = interactive() as default for all long-running functions.

I'm wasn't familiar with the particular methods described here but they were easy to learn from the vignettes and I was able to apply them to a corpus of 3000 documents in no time.

The package is also a testament to how much recent tooling such as testthat, assertthat, Rcpp, and such have improved the quality of R packages.

I really couldn't find anything in the code to complain about. The following is mostly a bunch of tiny feature and documentation requests.

Great, job, @lmullen!

Tests

  • Tests run successfully on OSX Yosemite, R 3.2.2
  • All code in README, vignettes and examples ran as expected

Documentation

  • In the intro vingettes, there isn't text specifically explaining the meta() function
  • ?textreuse only defines the datasets. I find having a short description of the package and pointers to vignettes or key functions helpful here.
  • The documentation discusses how the user may use or write other tokenizing, hash, or dissimilarity functions. It would be helpful for examples to include a case or two of using a function from another package. Also, if you hope for others to do so, you might provide a template in a CONTRIBUTING.md file.
  • In the README as well was the minhashing vignette you mention that pairwise comparisons grow exponentially with the size of the corpus. I think the term geometrically is correct here. (Can you see that I'm really reaching for things?.
  • In minhash.Rmd, line 63: "same four", but this identifies only 3.
  • In rehash.R, line 5: tokenens --> tokens
  • In rehash.R, line 5: which --> when
  • In lsh_compare.R, line 6: "each of the documents" --> "each of the candidate pairs"
  • I recommend adding a code of conduct to the package. (devtools:::use_code_of_conduct())

Functionality

  • I would like to be able to create a TextReuseCorpus from a character vector rather than just a directory of files, as I commonly am loading a bunch of other files (e.g., html) and processing them to extract just the relevant text. In this case one would have to provide a vector of ids or use the vector names as ids, which could default to just using the index number as id.
  • When loading in a large corpus of documents, I got the error Error: n not less than length(words), which was ultimately traced to assert_that(n < length(words)) in tokenize_ngrams(). I found this was because I had a very short document, but also because the document was all whitespace characters. I wonder if it would make sense to check for empty documents.
  • Perhaps it makes sense to give pairwise candidate dataframes their own class for assertion testing, rather than testing on column names, as in lsh_compare().
  • One needs to use the same minhashing function when adding new documents to a hash. Perhaps the hash could store the minhash function and bands used to create it, and check against this for new documents.
@noamross

This comment has been minimized.

Copy link
Collaborator

commented Oct 16, 2015

(P.S., sorry this is late.)

@lmullen

This comment has been minimized.

Copy link
Member Author

commented Oct 16, 2015

@noamross Thanks so much for this very thorough review. I really appreciate the level of detail of your review, which gives me a lot more confidence in the package. I'll respond point by point as necessary as I work on implementing your suggestions. But in general, I think everything that you have suggested is worth doing, and I plan on implementing them before sending the package to CRAN. In particular, I like the idea of giving the pairwise data frames a class: much more elegant. Also, I ran into the n not less than length(words) error when using this for my own analysis. I'll have think about the best way to either document or work around that: what I've been doing is filtering the corpus with wordcount().

One other thing. I've noticed that a performance problem using the hash package (ugh---so many kinds of hashes in this package!). With a corpus of 71K documents, the lsh_compare() function ran for 3 days, got to about 30% completion, and wouldn't budge from there before I quit. The problem is that while hash tables are very fast for reading data, they don't seem to have any advantage in writing data. In particular, I don't think the implementation of hash tables in the hash package lets you preallocate the size of the hash table, which I couldn't do anyway because there is no way of knowing in advance how many collisions there will be with the hashes. Maybe I'm doing something naive with the hash table and just don't know it. But I've got an implementation (external to the package, example here) that uses dplyr to calculate the matches. It took less than 90 minutes for the 71K document corpus. This will entail dropping the dependency on the hash package and adding a dependency on dplyr. I think that will be simpler for users, since the hash table is an unfamiliar data structure. And while dplyr is a somewhat heavier dependency, in the long run it will permit using a database backend so that the package can be used with out of memory data.

Thanks, Noam!

@noamross

This comment has been minimized.

Copy link
Collaborator

commented Oct 16, 2015

I'm happy to review a new implementation.

@karthik

This comment has been minimized.

Copy link
Member

commented Oct 16, 2015

👏

@lmullen

This comment has been minimized.

Copy link
Member Author

commented Oct 18, 2015

I've implemented almost all of your suggestions, @noamross. They are now on the master branch of the repository, or you can get the most recent version tagged v0.0.1.9004. Here is a list of the most significant changes:

  • As mentioned above, LSH is now implemented using dplyr. For my ATS corpus I got exactly the same results with the new implementation as with the old implementation, and I tested an earlier version on a different corpus, so I'm pretty confident that it is correct.
  • You can now created a corpus using a character vector of documents, and if the character vector is named, the names will be used as the document IDs.
  • Creating a document or a corpus now checks whether the document has at least n + 1 words, where n is length of the n-grams. (And by extension, it checks for empty documents.) When creating a corpus, you get a warning with the ID of empty or short documents but the corpus just skips over time.

The one suggestion that you made that I did not implement is somehow storing or at least validating the minhash function when creating adding new documents to the LSH buckets/cache. I agree in principle that this should be done, but I couldn't think of a good way to do it. I don't think it is a big issue for the first version of the package, since the package can only deal with in-memory data at this point. That is to say, the biggest corpus that I'm likely to use it on has 71K documents and that isn't even close to using up the memory on my laptop. In those cases, it is unlikely that someone would add to the cache instead of simply doing it all in one batch. For the future I'm thinking about how this could be extended for out of memory data, which the new dplyr implementation will permit. I think I'll have to figure out how to implement your suggestion then.

Thanks again for your detailed review. It has improved the package a great deal.

@noamross

This comment has been minimized.

Copy link
Collaborator

commented Oct 19, 2015

Great, let me know when is a good point to do an updated review.

@lmullen

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2015

I think it's ready whenever is convenient for you.

@noamross

This comment has been minimized.

Copy link
Collaborator

commented Oct 19, 2015

k, I should be able to slot it in in the next week or so.

@lmullen

This comment has been minimized.

Copy link
Member Author

commented Oct 22, 2015

I've added one last feature (I promise this time!). It is a local alignment function that implements a version of the Smith-Waterman function from protein alignment for use with natural language. It is one function local_align(), there is a vignette textreuse-alignment to go with it, and the new tag is v0.0.1.9005.

@noamross

This comment has been minimized.

Copy link
Collaborator

commented Oct 28, 2015

Note: Github's "compare" functionality is super helpful for doing this review. I just did https://github.com/lmullen/textreuse/compare/f9d3555...HEAD to find the updated changes. Yes, I realize I could do this locally.

  • All tests pass on R 3.2.2, OSX El Capitan,

  • Code of conduct added; 👍

  • Loading text via charactervector works great; 👍

  • Nice package-level documentation and documentation updates; 👍

  • New LSH algorithm looks good; 👍

  • When loading from a directory of files including some to be skipped via skip_short = TRUE, the warning messages do not give an ID for the document. Thus, you can't tell which might be the problematic file. I see when I load a named character vector there is an ID in the warning message. I believe this is because the ID is assigned after the check for the short document in TextReuseTextDocument().

  • I believe that for your Rcpp function sw_matrix, you want to define score, deletion, insertion, and value before the for loop, and only assign to them....ah, it's easier to just do a PR: ropensci/textreuse#59

  • I'm still grokking the new Smith-Waterman algorithm, but I wonder if we could test by comparing results with a bioinformatics package that implements it. If we created a test file where the only words are space-separated "A", "C", "T", and "G" we should expect identical results, no? So far I haven't been able to do this with Biostrings::pairwiseAlignment(), but I'm not sure how to set the score assignments in that properly.

    Perhaps @vsbuffalo, @blahah or one of our other bioinfo friends could chime in with the right usage for expected results? I'm trying to do Biostrings::pairwiseAlignment(), with type="local" and set scores for matching (2), and mismatch/gap (-1).

@lmullen

This comment has been minimized.

Copy link
Member Author

commented Oct 28, 2015

Thanks for doing the second thorough review, @noamross. Much appreciated!

Here are the responses to your queries/notes:

  • The missing ID was definitely a bug. Fixed in this commit, with tests added.
  • Thanks for the PR for sw_matrix(). I've merged it in. (Related: @Ironholds sent a PR to improve hash_string().
  • The Biostrings and the textreuse implementations of Smith-Waterman can't exactly be compared directly, since Biostrings permits substitutions, where I've opted to mark substitutions as a deletion then an insertion. The reasons is that there is a limited vocabulary that Biostrings is working with, but there is an unlimited vocabulary for natural language. But I've prepared this document which demonstrates that they get the same results.

In addition, I've cleared out any remaining issues in the repository.

@sckott, @karthik: Anything else to be done before this package can be part of rOpenSci?

@sckott

This comment has been minimized.

Copy link
Member

commented Oct 28, 2015

Looks good to me. any thoughts @karthik

@sckott

This comment has been minimized.

Copy link
Member

commented Oct 29, 2015

LGTM, approved

@lmullen i think you can move pkgs into ropensci already

@lmullen

This comment has been minimized.

Copy link
Member Author

commented Oct 30, 2015

Okay, I've transferred the repository and added the rOpenSci footer, etc.

@sckott

This comment has been minimized.

Copy link
Member

commented Oct 30, 2015

🚀 thanks!

@sckott sckott closed this Oct 30, 2015
@sckott sckott referenced this issue Aug 2, 2016
9 of 13 tasks complete
@fmichonneau fmichonneau referenced this issue Sep 18, 2016
10 of 13 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.