Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset quanteda corpus based on token count #795

Closed
EricHe98 opened this issue Jun 13, 2017 · 9 comments
Closed

Subset quanteda corpus based on token count #795

EricHe98 opened this issue Jun 13, 2017 · 9 comments

Comments

@EricHe98
Copy link

Hi,

I have a corpus object with 306,108 documents I would like to subset based on word count of each document. The corpus summary() command already lists the number of tokens in each document, but I am unable to access the token count using the corpus_subset() command.

I believe I can add a docvar() meta variable giving the number of tokens of each document using the ntoken() function and then use corpus_subset() to subset based on the meta variable, but this seems unnecessarily redundant. Is there any efficient way to compute this subset?

Thanks!

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 13, 2017

Try:

subcorpus <- corpus_subset(wholecorpus, ntoken(wholecorpus) > threshold)

to select only the documents whose token count exceeds threshold. Note that unless you supply options to ntoken(), the defaults for tokens() will be used, which include punctuation.

@EricHe98
Copy link
Author

Hi Ken,

The command did work, but it ate up my entire 16gb of RAM and some swap space doing the computation on the corpus which was 2gb. Is there anything I can do to reduce my memory usage?

Thanks again.

@koheiw
Copy link
Collaborator

koheiw commented Jun 14, 2017

@kbenoit the underlying function of ntoken is the classic function...

ntoken.character <- function(x, ...) {
    ntoken(tokenize(x, ...))
}

Also we can use stri::stri_count_boundaries() for rough estimates for corpus objects.

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 14, 2017

@koheiw that's a very good suggestion - I will change the code for ntoken.character() now.

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 14, 2017

Actually, we cannot just count boundaries in ntoken(), since it passes through options to tokens(x, ...) and this would fail to provide full control.

However @JumpyAF I suggest you try @koheiw's suggestion as:

subcorpus <- corpus_subset(wholecorpus, stringi::stri_count_boundaries(texts(wholecorpus)) > threshold)

@EricHe98
Copy link
Author

Hi all,

I tried Koheiw's command and while the computation took a few seconds, it did not use up any memory at all. However the word counts it gives is different (strictly less than) from the ntoken command and the information given by summary(wholecorpus). I am wondering why there is not just a quick command to extract the token counts which seem to be already computed and displayed by the corpus.

Best.

@EricHe98
Copy link
Author

EricHe98 commented Jun 19, 2017 via email

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 19, 2017

The reason is that currently, a corpus has not yet been tokenised, so to count tokens it has to be done on the fly. Also, we leave it to the user to define what is a "token", so there is no universal answer. You are getting different results with the stri_count_boundaries() for this reason. But it will be close.

In a future version of the package we will index tokens in the corpus, and still allow the user to define what is a token when counting them. Not done yet though.

@EricHe98
Copy link
Author

All right, thanks for all your help Ken. Your package is much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants