Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rank and frequency #1634

Closed
ManfredBernhard opened this issue Mar 2, 2019 · 5 comments
Closed

rank and frequency #1634

ManfredBernhard opened this issue Mar 2, 2019 · 5 comments

Comments

@ManfredBernhard
Copy link

Dear Quanteda,
I realised that identical frequencies in a dfm receive different ranks as shown below.

txtfreq<-textstat_frequency(froschk_dfm)
txtfreq
feature frequency rank docfreq group
1 und 64 1 1 all
2 der 37 2 1 all
3 sie 35 3 1 all
4 die 30 4 1 all
5 du 20 5 1 all
6 frosch 20 6 1 all
7 er 20 7 1 all
8 in 19 8 1 all
9 als 19 9 1 all
10 ich 17 10 1 all
11 war 15 11 1 all
12 ihr 15 12 1 all
13 es 15 13 1 all
14 da 15 14 1 all
15 aber 14 15 1 all
16 so 14 16 1 all
17 ein 13 17 1 all
18 dem 13 18 1 all

Can you please fix this bug?
Best,
ManfredBernhard

@jiongweilua
Copy link
Collaborator

Hi @ManfredBernhard ,

Do you mean to say that in this section of the output, they should all be ranked as 5 rather than 5, 6 and 7 ?

feature frequency rank docfreq group
5 du 20 5 1 all
6 frosch 20 6 1 all
7 er 20 7 1 all

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 2, 2019

Good point. It would be better to replace https://github.com/quanteda/quanteda/blob/master/R/textstat_frequency.R#L92-L93 with a call to data.table::frank() and add ties.method as an argument. Our existing method is basically the same as ties.method = "random".

However it's easy to override this (although it can be a bit trickier if you have used the groups argument, but not much). Just overwrite the ranks column of the output with one you've computed on your own using alternative ties.method arguments to rank().

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.9000
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dfmat <- dfm(c("a a b c d d", "a b b c d"))
dfmat
## Document-feature matrix of: 2 documents, 4 features (0.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
##        features
## docs    a b c d
##   text1 2 1 1 2
##   text2 1 2 1 1

tstat <- textstat_frequency(dfmat)
tstat
##   feature frequency rank docfreq group
## 1       a         3    1       2   all
## 2       b         3    2       2   all
## 3       d         3    3       2   all
## 4       c         2    4       2   all

rank(tstat[["frequency"]], ties.method = "last") %>%
  rev()
## [1] 1 2 3 4
rank(tstat[["frequency"]], ties.method = "average") %>%
  rev()
## [1] 1 3 3 3
set.seed(1)
rank(tstat[["frequency"]], ties.method = "random") %>%
  rev()
## [1] 1 4 3 2

So using any of the rank() calls above, you could reassign their returns to tstat[["frequency"]] to get the type of ties you prefer.

@ManfredBernhard
Copy link
Author

Dear All,
thanks for the fast response, though I am not sure, if I have been understood properly. Please see my response below:
Imagine the following situation: you have three runners that do a 100 meter run. They arrive at the
finish at exactly the same time of 10 seconds. This means, that all three of them will receive a gold
medal not a gold medal, a silver medal and a bronze medal, repectively. In my „unwanted”
solution of the frequency / rank calculation of the tokens of the fairy tale „Froschkönig” (Frog
Prince), the tokens „du”, „frosch”, and „er” have a frequency of 20, and therefore must be placed
at rank 5, and not 5, 6, and 7 in a table! Is there a solution for this?
By the way, is this the workflow that you are suggesting?
tstat_froschk<-textstat_frequency(froschk_dfm)

head(tstat_froschk)
feature frequency rank docfreq group
1 und 64 1 1 all
2 der 37 2 1 all
3 sie 35 3 1 all
4 die 30 4 1 all
5 du 20 5 1 all
6 frosch 20 6 1 all
#######”du” and „frosch” should have rank „5”!
return.average<-rank(tstat_froschk[["frequency"]], ties.method="average")%>%

  • rev()

return.average
[1] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
[27] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
[53] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
[79] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
[105] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
Hoping, I just misunderstood your solution.
Best,
ManfredBernhard

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 3, 2019

Yes, in pull request #1636, we allow you to control this via ties.method. (See ?data.table::frank.)

For now, you can reassign the rank column as explained above, but with rank(tstat_froschk[["rank"]], ties.method = "min").

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

times <- c(10, 10, 10)

# the way textstat_frequency works in version <- 1.4.1
rank(times, ties.method = "first")
## [1] 1 2 3

# the way you want it to work
rank(times, ties.method = "min")
## [1] 1 1 1

@ManfredBernhard
Copy link
Author

ManfredBernhard commented Mar 21, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants