dfm.corpus(x, verbose = TRUE) never completes the verbose output #1894

kbenoit · 2020-02-27T04:17:38Z

Unlike the tokens() sequence, where we always end with tokens.tokens(), dfm() does not end with dfm.dfm(). When verbose = TRUE, we don't get the ending.

Should look like:

> dfm(data_dfm_lbgexample, remove = "A*", verbose = TRUE)
Creating a dfm from a dfm input...
   ... removed 1 feature
   ... lowercasing
   ... created a 6 x 36 sparse dfm
   ... complete. 
Elapsed time: 0.011 seconds.

but instead we don't see the final two/three lines:

> dfmat <- dfm(data_corpus_inaugural, remove = stopwords("en"), verbose = TRUE)
Creating a dfm from a corpus input...
   ... lowercasing
   ... found 58 documents, 9,399 features
   ... removed 136 features

The text was updated successfully, but these errors were encountered:

koheiw · 2020-02-27T17:43:20Z

Do we still need elapsed time?

Elapsed time: 0.011 seconds.

I want to remove these lines

quanteda/R/dfm.R

Lines 142 to 143 in d360d82

    
           dfm_env <- new.env() 
        
           dfm_env$START_TIME <- NULL

If users are interested in timing, they just use system.time().

kbenoit · 2020-02-27T22:40:35Z

Yes I very much like the elapsed time and closing message. It tells us something about the timings and sequence, good for diagnostics and for some users who want to know the precise sequences and/or timing without wrapping the function, it's useful.

Works nicely for tokens() too.

> tokens(data_corpus_inaugural, remove_punct = TRUE, 
+        remove_symbols = TRUE, remove_numbers = TRUE,
+        verbose = TRUE)
Creating a tokens object from a corpus input...
...starting tokenization
...preserving hyphens
...preserving social media tags (#, @)
...tokenizing 1 of 1 blocks
...segmenting tokens
...serializing tokens 10062 unique types
...removing separators, punctuation, symbols, numbers 
...total elapsed:  0.406 seconds.
Finished constructing tokens from 58 texts.

koheiw · 2020-02-28T16:03:12Z

Have you ever really used the elapsed time for diagnosis? I don't think so, because elapsed time for single execution doesn't mean anything. When we want to find slow functions, we use profvis::profvis().

I like this part

...starting tokenization
...preserving hyphens
...preserving social media tags (#, @)
...tokenizing 1 of 1 blocks
...segmenting tokens
...serializing tokens 10062 unique types
...removing separators, punctuation, symbols, numbers

but not this

...total elapsed:  0.406 seconds.

kbenoit assigned koheiw Feb 27, 2020

kbenoit added this to the v2.0.1 bugfixes milestone Feb 27, 2020

kbenoit assigned kbenoit and unassigned koheiw Mar 2, 2020

kbenoit mentioned this issue Mar 2, 2020

Make dfm, tokens verbose output more consistent #1902

Merged

kbenoit closed this as completed in #1902 Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dfm.corpus(x, verbose = TRUE) never completes the verbose output #1894

dfm.corpus(x, verbose = TRUE) never completes the verbose output #1894

kbenoit commented Feb 27, 2020 •

edited

Loading

koheiw commented Feb 27, 2020

kbenoit commented Feb 27, 2020 •

edited

Loading

koheiw commented Feb 28, 2020

dfm.corpus(x, verbose = TRUE) never completes the verbose output #1894

dfm.corpus(x, verbose = TRUE) never completes the verbose output #1894

Comments

kbenoit commented Feb 27, 2020 • edited Loading

koheiw commented Feb 27, 2020

kbenoit commented Feb 27, 2020 • edited Loading

koheiw commented Feb 28, 2020

kbenoit commented Feb 27, 2020 •

edited

Loading

kbenoit commented Feb 27, 2020 •

edited

Loading