-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subset quanteda corpus based on token count #795
Comments
Try: subcorpus <- corpus_subset(wholecorpus, ntoken(wholecorpus) > threshold) to select only the documents whose token count exceeds |
Hi Ken, The command did work, but it ate up my entire 16gb of RAM and some swap space doing the computation on the corpus which was 2gb. Is there anything I can do to reduce my memory usage? Thanks again. |
@kbenoit the underlying function of ntoken.character <- function(x, ...) {
ntoken(tokenize(x, ...))
} Also we can use |
@koheiw that's a very good suggestion - I will change the code for |
Actually, we cannot just count boundaries in However @JumpyAF I suggest you try @koheiw's suggestion as: subcorpus <- corpus_subset(wholecorpus, stringi::stri_count_boundaries(texts(wholecorpus)) > threshold) |
Hi all, I tried Koheiw's command and while the computation took a few seconds, it did not use up any memory at all. However the word counts it gives is different (strictly less than) from the ntoken command and the information given by summary(wholecorpus). I am wondering why there is not just a quick command to extract the token counts which seem to be already computed and displayed by the corpus. Best. |
Hi Ken,
I commented on the closed thread just now, and am just sending this email
as a heads-up.
I tried Koheiw's command and while the computation took a few seconds, it
did not use up any memory at all. However the word counts it gives is
different (strictly less than) from the ntoken command and the information
given by summary(wholecorpus). I am wondering why there is not just a quick
command to extract the token counts which seem to be already computed and
displayed by the corpus.
Best.
…On Thu, Jun 15, 2017 at 6:19 PM, Kenneth Benoit ***@***.***> wrote:
Closed #795 <#795>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#795 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AWNZrM1Mc9plASmcCUHo7wbF7iwtn8CRks5sEa3-gaJpZM4N454L>
.
|
The reason is that currently, a corpus has not yet been tokenised, so to count tokens it has to be done on the fly. Also, we leave it to the user to define what is a "token", so there is no universal answer. You are getting different results with the In a future version of the package we will index tokens in the corpus, and still allow the user to define what is a token when counting them. Not done yet though. |
All right, thanks for all your help Ken. Your package is much appreciated! |
Hi,
I have a corpus object with 306,108 documents I would like to subset based on word count of each document. The corpus summary() command already lists the number of tokens in each document, but I am unable to access the token count using the corpus_subset() command.
I believe I can add a docvar() meta variable giving the number of tokens of each document using the ntoken() function and then use corpus_subset() to subset based on the meta variable, but this seems unnecessarily redundant. Is there any efficient way to compute this subset?
Thanks!
The text was updated successfully, but these errors were encountered: