improve dfm_trim() options #1254

kbenoit · 2018-03-04T19:38:29Z

dfm_trim() currently accepts two sets of min/max arguments, for count and docfreq respectively. Each also accepts a fractional value, currently interpreted as a percentile of the features by their total frequencies. We want to make this interpretation both more consistent and more flexible by adding arguments to dfm_trim().

Proposal

dfm_trim(x,
    min_count = 1, max_count = NULL, count_fun = c("sum", "prop", "pctile", "invrank")
    min_docfreq = 1, max_docfreq = NULL, docfreq_fun = c("boolean", "prop" "pctile", "invrank"),
    sparsity = NULL,
    verbose = quanteda_options("verbose"))

where

count_fun refers to the aggregation across documents for the feature count, whether this is weighted or not. Applied to the global set of feature counts, the values mean the total, the proportion, the percentile, and the inverse rank of each feature, respectively. (Inverse rank means selecting the top relative frequency min_count features.)
docfreq_fun refers to the method of applying the document frequency threshold across documents. boolean is the default, which means count the document frequency of a feature as the number of documents in which it has some weighted value > 0. The other values are the same as applied for counts.

Examples

counts

# remove any features occurring fewer than 5 times in total
dfm_trim(x, min_count = 5)

# remove all but the top 200 most frequent features
dfm_trim(x, min_count = 200, count_fun = "invrank")

# remove any features not occurring at a rate of 10% of total feature count
dfm_trim(x, min_count = .10, count_fun = "prop")

# keep only features occurring at the median total frequency or above
dfm_trim(x, min_count = .50, count_fun = "pctile")

document frequency

# remove any features occurring in fewer than 5 documents each
dfm_trim(x, min_docfreq = 5)

# keep only the most frequent 200 features in terms of document frequency
dfm_trim(x, min_docfreq = 200, docfreq_fun = "invrank")

# remove any features not occurring at least 5% of the total document frequency
dfm_trim(x, min_count = .05, docfreq_fun = "prop")

# keep only features with the median document frequency or higher
dfm_trim(x, min_count = .50, docfreq_fun = "pctile")

The text was updated successfully, but these errors were encountered:

koheiw · 2018-03-05T09:19:19Z

I like the functionality, so I can start implementing, but I am not sure about the labels. "invrank" could be just "rank" because it is unlikely that people think this is for the least frequent features. "pctile" should be "perc" because "prop" is the the first four letters of proportion.

kbenoit · 2018-03-05T09:22:36Z

OK, maybe "toprank"? I don't favour "perc" because I think this is too easily confused with percent.

One of my earlier ideas was for the user to supply a function for aggregation, such as sum, min, max, quantile, etc with options passed through ... but for the function rank they would need to reverse it using a user function. Probably too difficult, and unnecessary.

koheiw · 2018-03-05T10:32:53Z

I support your proposal to add count_fun instead of taking functions through .... Rank, percentile and proportion are all threshold that can be applied to select either top or bottom features, so others should be named like topprop, topperc if rank is toprank. I think this is redundant. As for percentile, how about quan (or quant) from quantile(), but min_count = 0.05 means keeping top 5% (it is also a reversed behavior the base function).

Sticking to the behavior of base functions is possible, but we need to ask users for a bit more coding. For top 10 features:

dfm_trim(some_dfm, min_count = nfeat(some_dfm) - 10, count_fun = "rank")

Relatedly, we also need to decide which is more appropriate min_count or min_freq. If input value is not raw count min_freq sounds more appropriate. We have both currently:

textplot_network(x, min_freq = 0.5, ...)
textplot_wordcloud(x, min_size = 0.5, max_size = 4, min_count = 3, ...)

koheiw · 2018-03-05T10:52:18Z

I also think that the default values for count_fun and docfreq_fun should be raw because count_fun is applied to colSums() and docfreq() uses boolean values always.

koheiw · 2018-03-05T11:35:16Z

A consistent design here would be to use larger min_count or max_count values for frequent features, and to set the default value min_count = 0 regardless of count_fun. In this way, we can avoid the inversing problem.

kbenoit · 2018-03-05T17:34:42Z

Can you collect the above into an alternate set of examples and syntax, similar to the first suggestion above?

Briefly:

What you suggest about "raw" to mean sum makes sense, but "prop" as I suggested above is not a proportion of column sums, but rather a proportion of the column sum from all features. So if the feature counts were very unevenly distributed (as they always are because of Zipf's law etc) then using that method could result in lots of features being trimmed. But I suppose that since this still involves column sums it still makes sense within the scheme whose default is "raw".
I think we can just use "rank" and make it clear that we mean the top-ranked features where top means most frequent - we can specify that in the Rd. For percentile and percent those are naturally from lowest to highest so I don't think we need to qualify those. e.g. 95th percentile always means the top 5% most frequent features.
I don't think we have an inconsistency between min_count and min_freq since the latter can be for features or documents, and is usually weighted (and hence not summing counts).

koheiw · 2018-03-05T21:32:13Z

It would be like these:

counts

# remove any features occurring fewer than 5 times in total
dfm_trim(x, min_count = 5, count_fun = "raw") 

# remove all but the top 200 most frequent features
dfm_trim(x, min_count = nfeat(x) - 200, count_fun = "rank")

# keep all features
dfm_trim(x, min_count = 1, count_fun = "rank")

# remove any features not occurring at a rate of 10% of total feature count
dfm_trim(x, min_count = .10, count_fun = "prop")

# keep only features occurring at the median total frequency or above
dfm_trim(x, min_count = .50, count_fun = "quant")

# keep only 5% most frequent features
dfm_trim(x, min_count = .95, count_fun = "quant")

document frequency

# remove any features occurring in fewer than 5 documents each
dfm_trim(x, min_docfreq = 5, docfreq_fun = "raw") 

# keep only the most frequent 200 features in terms of document frequency
dfm_trim(x, min_docfreq = ndoc(x) - 200, docfreq_fun = "rank")

# remove any features not occurring at least 5% of the total document frequency
dfm_trim(x, min_count = .05, docfreq_fun = "prop")

# keep only features with the median document frequency or higher
dfm_trim(x, min_count = .50, docfreq_fun = "quant")

kbenoit · 2018-03-06T16:37:57Z

Updated thinking:

dfm_trim(x, 
    min_termfreq = NULL, max_termfreq = NULL, termfreq_type = c("count", "rank", "quantile"),
    min_docfreq = NULL, max_docfreq = NULL, docfreq_type = c("count", "rank" "quantile"),
    sparsity = NULL,
    verbose = quanteda_options("verbose"))

This changes the behaviour since at the moment, calling just dfm_trim(x) returns a dfm with frequency count totals >= 1. This would return the original dfm, unchanged, if only x were supplied with no additional arguments.

We should trap min_count and max_count arguments and issue a warning if they are used.

HolgersID · 2018-04-06T12:37:04Z

Hi,
thank you for the continuous improvement of quanteda!

Sorry, I am a bit confused about the status of dfm_trim. In the current version of quanteda (1.1.6, last commit 8a15a33) there seems to be two issues regarding dfm_trim:

dfm_trim breaks backward compatibility
In v1.1.1:

(myDfm <- dfm(data_corpus_inaugural[1:5]))
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).
(dfm_trim (myDfm, min_docfreq=.4))
## Document-feature matrix of: 5 documents, 569 features (44.2% sparse).

In 1.1.6:

(myDfm <- dfm(data_corpus_inaugural[1:5]))
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).
dfm_trim (myDfm, min_docfreq=.4)
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).

In order to get the original result one has to change the call of dfm_trim to:

dfm_trim (myDfm, min_docfreq=.4, docfreq_type="prop")
## Document-feature matrix of: 5 documents, 569 features (44.2% sparse).

What is your strategy regarding backward compatibility?

The value "prop" of argument docfreq_type is not documented in the 1.1.6 version.

Sorry if this is an obsolete comment because you are already aware of the issues.

kbenoit · 2018-04-06T12:53:25Z

Good eye @HolgersID! We changed dfm_trim() slightly, to make it more consistent and more logical. Our policy towards backward compatibility is to try to achieve this whenever possible, usually by trapping and supporting deprecated arguments, but issuing a note when they are used. In this case, we changed the behaviour.

@koheiw I see two things we missed. First, the signature for the generic does not match the dfm_trim.dfm(), since it omits "prop" from the list of docfreq_type values. Second, we should issue a warning when someone uses a min_/max_ < 1 and when the *_type is "count" and when the @weightTf for the dfm is also "count". (Remember we allowed fractional values because some dfms are weighted.)

- Add warnings of fractional termfreq - Correct generic signiture

Address #1254

kbenoit added enhancement dfm labels Mar 4, 2018

kbenoit assigned kbenoit and koheiw Mar 4, 2018

kbenoit mentioned this issue Mar 4, 2018

how to select top most frequent features from a dfm #1253

Closed

koheiw added a commit that referenced this issue Mar 8, 2018

Add extra arguments for #1254 but needs more specification

08da571

kbenoit closed this as completed Mar 10, 2018

kbenoit reopened this Apr 6, 2018

koheiw added a commit that referenced this issue Apr 6, 2018

Address #1254

f342ffc

- Add warnings of fractional termfreq - Correct generic signiture

kbenoit added a commit that referenced this issue Apr 7, 2018

Merge pull request #1300 from quanteda/issue-1254

fe41500

Address #1254

kbenoit closed this as completed Apr 7, 2018

danilovcorrea mentioned this issue Oct 14, 2019

How to keep the most frequent (95%) n-grams tokens? #1749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve dfm_trim() options #1254

improve dfm_trim() options #1254

kbenoit commented Mar 4, 2018 •

edited by koheiw

Loading

koheiw commented Mar 5, 2018

kbenoit commented Mar 5, 2018

koheiw commented Mar 5, 2018 •

edited

Loading

koheiw commented Mar 5, 2018

koheiw commented Mar 5, 2018

kbenoit commented Mar 5, 2018 •

edited

Loading

koheiw commented Mar 5, 2018

kbenoit commented Mar 6, 2018 •

edited

Loading

HolgersID commented Apr 6, 2018

kbenoit commented Apr 6, 2018

improve dfm_trim() options #1254

improve dfm_trim() options #1254

Comments

kbenoit commented Mar 4, 2018 • edited by koheiw Loading

Proposal

Examples

counts

document frequency

koheiw commented Mar 5, 2018

kbenoit commented Mar 5, 2018

koheiw commented Mar 5, 2018 • edited Loading

koheiw commented Mar 5, 2018

koheiw commented Mar 5, 2018

kbenoit commented Mar 5, 2018 • edited Loading

koheiw commented Mar 5, 2018

counts

document frequency

kbenoit commented Mar 6, 2018 • edited Loading

HolgersID commented Apr 6, 2018

kbenoit commented Apr 6, 2018

kbenoit commented Mar 4, 2018 •

edited by koheiw

Loading

koheiw commented Mar 5, 2018 •

edited

Loading

kbenoit commented Mar 5, 2018 •

edited

Loading

kbenoit commented Mar 6, 2018 •

edited

Loading