Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dfm.corpus(x, verbose = TRUE) never completes the verbose output #1894

Closed
kbenoit opened this issue Feb 27, 2020 · 3 comments · Fixed by #1902
Closed

dfm.corpus(x, verbose = TRUE) never completes the verbose output #1894

kbenoit opened this issue Feb 27, 2020 · 3 comments · Fixed by #1902
Assignees

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 27, 2020

Unlike the tokens() sequence, where we always end with tokens.tokens(), dfm() does not end with dfm.dfm(). When verbose = TRUE, we don't get the ending.

Should look like:

> dfm(data_dfm_lbgexample, remove = "A*", verbose = TRUE)
Creating a dfm from a dfm input...
   ... removed 1 feature
   ... lowercasing
   ... created a 6 x 36 sparse dfm
   ... complete. 
Elapsed time: 0.011 seconds.

but instead we don't see the final two/three lines:

> dfmat <- dfm(data_corpus_inaugural, remove = stopwords("en"), verbose = TRUE)
Creating a dfm from a corpus input...
   ... lowercasing
   ... found 58 documents, 9,399 features
   ... removed 136 features
@kbenoit kbenoit added this to the v2.0.1 bugfixes milestone Feb 27, 2020
@koheiw
Copy link
Collaborator

koheiw commented Feb 27, 2020

Do we still need elapsed time?

Elapsed time: 0.011 seconds.

I want to remove these lines

quanteda/R/dfm.R

Lines 142 to 143 in d360d82

dfm_env <- new.env()
dfm_env$START_TIME <- NULL

If users are interested in timing, they just use system.time().

@kbenoit
Copy link
Collaborator Author

kbenoit commented Feb 27, 2020

Yes I very much like the elapsed time and closing message. It tells us something about the timings and sequence, good for diagnostics and for some users who want to know the precise sequences and/or timing without wrapping the function, it's useful.

Works nicely for tokens() too.

> tokens(data_corpus_inaugural, remove_punct = TRUE, 
+        remove_symbols = TRUE, remove_numbers = TRUE,
+        verbose = TRUE)
Creating a tokens object from a corpus input...
...starting tokenization
...preserving hyphens
...preserving social media tags (#, @)
...tokenizing 1 of 1 blocks
...segmenting tokens
...serializing tokens 10062 unique types
...removing separators, punctuation, symbols, numbers 
...total elapsed:  0.406 seconds.
Finished constructing tokens from 58 texts.

@koheiw
Copy link
Collaborator

koheiw commented Feb 28, 2020

Have you ever really used the elapsed time for diagnosis? I don't think so, because elapsed time for single execution doesn't mean anything. When we want to find slow functions, we use profvis::profvis().

I like this part

...starting tokenization
...preserving hyphens
...preserving social media tags (#, @)
...tokenizing 1 of 1 blocks
...segmenting tokens
...serializing tokens 10062 unique types
...removing separators, punctuation, symbols, numbers

but not this

...total elapsed:  0.406 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants