Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve dfm_trim() options #1254

Closed
kbenoit opened this issue Mar 4, 2018 · 10 comments
Closed

improve dfm_trim() options #1254

kbenoit opened this issue Mar 4, 2018 · 10 comments
Assignees

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 4, 2018

dfm_trim() currently accepts two sets of min/max arguments, for count and docfreq respectively. Each also accepts a fractional value, currently interpreted as a percentile of the features by their total frequencies. We want to make this interpretation both more consistent and more flexible by adding arguments to dfm_trim().

Proposal

dfm_trim(x,
    min_count = 1, max_count = NULL, count_fun = c("sum", "prop", "pctile", "invrank")
    min_docfreq = 1, max_docfreq = NULL, docfreq_fun = c("boolean", "prop" "pctile", "invrank"),
    sparsity = NULL,
    verbose = quanteda_options("verbose"))

where

  • count_fun refers to the aggregation across documents for the feature count, whether this is weighted or not. Applied to the global set of feature counts, the values mean the total, the proportion, the percentile, and the inverse rank of each feature, respectively. (Inverse rank means selecting the top relative frequency min_count features.)
  • docfreq_fun refers to the method of applying the document frequency threshold across documents. boolean is the default, which means count the document frequency of a feature as the number of documents in which it has some weighted value > 0. The other values are the same as applied for counts.

Examples

counts

# remove any features occurring fewer than 5 times in total
dfm_trim(x, min_count = 5)

# remove all but the top 200 most frequent features
dfm_trim(x, min_count = 200, count_fun = "invrank")

# remove any features not occurring at a rate of 10% of total feature count
dfm_trim(x, min_count = .10, count_fun = "prop")

# keep only features occurring at the median total frequency or above
dfm_trim(x, min_count = .50, count_fun = "pctile")

document frequency

# remove any features occurring in fewer than 5 documents each
dfm_trim(x, min_docfreq = 5)

# keep only the most frequent 200 features in terms of document frequency
dfm_trim(x, min_docfreq = 200, docfreq_fun = "invrank")

# remove any features not occurring at least 5% of the total document frequency
dfm_trim(x, min_count = .05, docfreq_fun = "prop")

# keep only features with the median document frequency or higher
dfm_trim(x, min_count = .50, docfreq_fun = "pctile")
@koheiw
Copy link
Collaborator

koheiw commented Mar 5, 2018

I like the functionality, so I can start implementing, but I am not sure about the labels. "invrank" could be just "rank" because it is unlikely that people think this is for the least frequent features. "pctile" should be "perc" because "prop" is the the first four letters of proportion.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Mar 5, 2018

OK, maybe "toprank"? I don't favour "perc" because I think this is too easily confused with percent.

One of my earlier ideas was for the user to supply a function for aggregation, such as sum, min, max, quantile, etc with options passed through ... but for the function rank they would need to reverse it using a user function. Probably too difficult, and unnecessary.

@koheiw
Copy link
Collaborator

koheiw commented Mar 5, 2018

I support your proposal to add count_fun instead of taking functions through .... Rank, percentile and proportion are all threshold that can be applied to select either top or bottom features, so others should be named like topprop, topperc if rank is toprank. I think this is redundant. As for percentile, how about quan (or quant) from quantile(), but min_count = 0.05 means keeping top 5% (it is also a reversed behavior the base function).

Sticking to the behavior of base functions is possible, but we need to ask users for a bit more coding. For top 10 features:

dfm_trim(some_dfm, min_count = nfeat(some_dfm) - 10, count_fun = "rank")

Relatedly, we also need to decide which is more appropriate min_count or min_freq. If input value is not raw count min_freq sounds more appropriate. We have both currently:

textplot_network(x, min_freq = 0.5, ...)
textplot_wordcloud(x, min_size = 0.5, max_size = 4, min_count = 3, ...)

@koheiw
Copy link
Collaborator

koheiw commented Mar 5, 2018

I also think that the default values for count_fun and docfreq_fun should be raw because count_fun is applied to colSums() and docfreq() uses boolean values always.

@koheiw
Copy link
Collaborator

koheiw commented Mar 5, 2018

A consistent design here would be to use larger min_count or max_count values for frequent features, and to set the default value min_count = 0 regardless of count_fun. In this way, we can avoid the inversing problem.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Mar 5, 2018

Can you collect the above into an alternate set of examples and syntax, similar to the first suggestion above?

Briefly:

  • What you suggest about "raw" to mean sum makes sense, but "prop" as I suggested above is not a proportion of column sums, but rather a proportion of the column sum from all features. So if the feature counts were very unevenly distributed (as they always are because of Zipf's law etc) then using that method could result in lots of features being trimmed. But I suppose that since this still involves column sums it still makes sense within the scheme whose default is "raw".
  • I think we can just use "rank" and make it clear that we mean the top-ranked features where top means most frequent - we can specify that in the Rd. For percentile and percent those are naturally from lowest to highest so I don't think we need to qualify those. e.g. 95th percentile always means the top 5% most frequent features.
  • I don't think we have an inconsistency between min_count and min_freq since the latter can be for features or documents, and is usually weighted (and hence not summing counts).

@koheiw
Copy link
Collaborator

koheiw commented Mar 5, 2018

It would be like these:

counts

# remove any features occurring fewer than 5 times in total
dfm_trim(x, min_count = 5, count_fun = "raw") 

# remove all but the top 200 most frequent features
dfm_trim(x, min_count = nfeat(x) - 200, count_fun = "rank")

# keep all features
dfm_trim(x, min_count = 1, count_fun = "rank")

# remove any features not occurring at a rate of 10% of total feature count
dfm_trim(x, min_count = .10, count_fun = "prop")

# keep only features occurring at the median total frequency or above
dfm_trim(x, min_count = .50, count_fun = "quant")

# keep only 5% most frequent features
dfm_trim(x, min_count = .95, count_fun = "quant")

document frequency

# remove any features occurring in fewer than 5 documents each
dfm_trim(x, min_docfreq = 5, docfreq_fun = "raw") 

# keep only the most frequent 200 features in terms of document frequency
dfm_trim(x, min_docfreq = ndoc(x) - 200, docfreq_fun = "rank")

# remove any features not occurring at least 5% of the total document frequency
dfm_trim(x, min_count = .05, docfreq_fun = "prop")

# keep only features with the median document frequency or higher
dfm_trim(x, min_count = .50, docfreq_fun = "quant")

@kbenoit
Copy link
Collaborator Author

kbenoit commented Mar 6, 2018

Updated thinking:

dfm_trim(x, 
    min_termfreq = NULL, max_termfreq = NULL, termfreq_type = c("count", "rank", "quantile"),
    min_docfreq = NULL, max_docfreq = NULL, docfreq_type = c("count", "rank" "quantile"),
    sparsity = NULL,
    verbose = quanteda_options("verbose"))

This changes the behaviour since at the moment, calling just dfm_trim(x) returns a dfm with frequency count totals >= 1. This would return the original dfm, unchanged, if only x were supplied with no additional arguments.

We should trap min_count and max_count arguments and issue a warning if they are used.

@HolgersID
Copy link

Hi,
thank you for the continuous improvement of quanteda!

Sorry, I am a bit confused about the status of dfm_trim. In the current version of quanteda (1.1.6, last commit 8a15a33) there seems to be two issues regarding dfm_trim:

  1. dfm_trim breaks backward compatibility
    In v1.1.1:
(myDfm <- dfm(data_corpus_inaugural[1:5]))
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).
(dfm_trim (myDfm, min_docfreq=.4))
## Document-feature matrix of: 5 documents, 569 features (44.2% sparse).

In 1.1.6:

(myDfm <- dfm(data_corpus_inaugural[1:5]))
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).
dfm_trim (myDfm, min_docfreq=.4)
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).

In order to get the original result one has to change the call of dfm_trim to:

dfm_trim (myDfm, min_docfreq=.4, docfreq_type="prop")
## Document-feature matrix of: 5 documents, 569 features (44.2% sparse).

What is your strategy regarding backward compatibility?

  1. The value "prop" of argument docfreq_type is not documented in the 1.1.6 version.

Sorry if this is an obsolete comment because you are already aware of the issues.

@kbenoit kbenoit reopened this Apr 6, 2018
@kbenoit
Copy link
Collaborator Author

kbenoit commented Apr 6, 2018

Good eye @HolgersID! We changed dfm_trim() slightly, to make it more consistent and more logical. Our policy towards backward compatibility is to try to achieve this whenever possible, usually by trapping and supporting deprecated arguments, but issuing a note when they are used. In this case, we changed the behaviour.

@koheiw I see two things we missed. First, the signature for the generic does not match the dfm_trim.dfm(), since it omits "prop" from the list of docfreq_type values. Second, we should issue a warning when someone uses a min_/max_ < 1 and when the *_type is "count" and when the @weightTf for the dfm is also "count". (Remember we allowed fractional values because some dfms are weighted.)

koheiw added a commit that referenced this issue Apr 6, 2018
- Add warnings of fractional termfreq
- Correct generic signiture
kbenoit added a commit that referenced this issue Apr 7, 2018
@kbenoit kbenoit closed this as completed Apr 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants