Subset quanteda corpus based on token count #795

EricHe98 · 2017-06-13T18:37:43Z

Hi,

I have a corpus object with 306,108 documents I would like to subset based on word count of each document. The corpus summary() command already lists the number of tokens in each document, but I am unable to access the token count using the corpus_subset() command.

I believe I can add a docvar() meta variable giving the number of tokens of each document using the ntoken() function and then use corpus_subset() to subset based on the meta variable, but this seems unnecessarily redundant. Is there any efficient way to compute this subset?

Thanks!

kbenoit · 2017-06-13T18:51:08Z

Try:

subcorpus <- corpus_subset(wholecorpus, ntoken(wholecorpus) > threshold)

to select only the documents whose token count exceeds threshold. Note that unless you supply options to ntoken(), the defaults for tokens() will be used, which include punctuation.

EricHe98 · 2017-06-14T17:09:15Z

Hi Ken,

The command did work, but it ate up my entire 16gb of RAM and some swap space doing the computation on the corpus which was 2gb. Is there anything I can do to reduce my memory usage?

Thanks again.

koheiw · 2017-06-14T17:19:31Z

@kbenoit the underlying function of ntoken is the classic function...

ntoken.character <- function(x, ...) {
    ntoken(tokenize(x, ...))
}

Also we can use stri::stri_count_boundaries() for rough estimates for corpus objects.

kbenoit · 2017-06-14T19:23:45Z

@koheiw that's a very good suggestion - I will change the code for ntoken.character() now.

kbenoit · 2017-06-14T20:27:03Z

Actually, we cannot just count boundaries in ntoken(), since it passes through options to tokens(x, ...) and this would fail to provide full control.

However @JumpyAF I suggest you try @koheiw's suggestion as:

subcorpus <- corpus_subset(wholecorpus, stringi::stri_count_boundaries(texts(wholecorpus)) > threshold)

EricHe98 · 2017-06-19T19:39:16Z

Hi all,

I tried Koheiw's command and while the computation took a few seconds, it did not use up any memory at all. However the word counts it gives is different (strictly less than) from the ntoken command and the information given by summary(wholecorpus). I am wondering why there is not just a quick command to extract the token counts which seem to be already computed and displayed by the corpus.

Best.

EricHe98 · 2017-06-19T19:40:39Z

Hi Ken, I commented on the closed thread just now, and am just sending this email as a heads-up. I tried Koheiw's command and while the computation took a few seconds, it did not use up any memory at all. However the word counts it gives is different (strictly less than) from the ntoken command and the information given by summary(wholecorpus). I am wondering why there is not just a quick command to extract the token counts which seem to be already computed and displayed by the corpus. Best.

…

On Thu, Jun 15, 2017 at 6:19 PM, Kenneth Benoit ***@***.***> wrote: Closed #795 <#795>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#795 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWNZrM1Mc9plASmcCUHo7wbF7iwtn8CRks5sEa3-gaJpZM4N454L> .

kbenoit · 2017-06-19T20:16:07Z

The reason is that currently, a corpus has not yet been tokenised, so to count tokens it has to be done on the fly. Also, we leave it to the user to define what is a "token", so there is no universal answer. You are getting different results with the stri_count_boundaries() for this reason. But it will be close.

In a future version of the package we will index tokens in the corpus, and still allow the user to define what is a token when counting them. Not done yet though.

EricHe98 · 2017-06-20T18:19:35Z

All right, thanks for all your help Ken. Your package is much appreciated!

stefan-mueller assigned stefan-mueller and unassigned stefan-mueller Jun 13, 2017

kbenoit mentioned this issue Jun 14, 2017

tokenize() to tokens() - Fix 795 #801

Merged

kbenoit closed this as completed Jun 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subset quanteda corpus based on token count #795

Subset quanteda corpus based on token count #795

EricHe98 commented Jun 13, 2017

kbenoit commented Jun 13, 2017

EricHe98 commented Jun 14, 2017

koheiw commented Jun 14, 2017

kbenoit commented Jun 14, 2017

kbenoit commented Jun 14, 2017

EricHe98 commented Jun 19, 2017

EricHe98 commented Jun 19, 2017 via email

kbenoit commented Jun 19, 2017

EricHe98 commented Jun 20, 2017

Subset quanteda corpus based on token count #795

Subset quanteda corpus based on token count #795

Comments

EricHe98 commented Jun 13, 2017

kbenoit commented Jun 13, 2017

EricHe98 commented Jun 14, 2017

koheiw commented Jun 14, 2017

kbenoit commented Jun 14, 2017

kbenoit commented Jun 14, 2017

EricHe98 commented Jun 19, 2017

EricHe98 commented Jun 19, 2017 via email

kbenoit commented Jun 19, 2017

EricHe98 commented Jun 20, 2017