rank and frequency #1634

ManfredBernhard · 2019-03-02T07:24:35Z

Dear Quanteda,
I realised that identical frequencies in a dfm receive different ranks as shown below.

txtfreq<-textstat_frequency(froschk_dfm)
txtfreq
feature frequency rank docfreq group
1 und 64 1 1 all
2 der 37 2 1 all
3 sie 35 3 1 all
4 die 30 4 1 all
5 du 20 5 1 all
6 frosch 20 6 1 all
7 er 20 7 1 all
8 in 19 8 1 all
9 als 19 9 1 all
10 ich 17 10 1 all
11 war 15 11 1 all
12 ihr 15 12 1 all
13 es 15 13 1 all
14 da 15 14 1 all
15 aber 14 15 1 all
16 so 14 16 1 all
17 ein 13 17 1 all
18 dem 13 18 1 all

Can you please fix this bug?
Best,
ManfredBernhard

jiongweilua · 2019-03-02T10:19:53Z

Hi @ManfredBernhard ,

Do you mean to say that in this section of the output, they should all be ranked as 5 rather than 5, 6 and 7 ?

feature frequency rank docfreq group
5 du 20 5 1 all
6 frosch 20 6 1 all
7 er 20 7 1 all

kbenoit · 2019-03-02T22:37:33Z

Good point. It would be better to replace https://github.com/quanteda/quanteda/blob/master/R/textstat_frequency.R#L92-L93 with a call to data.table::frank() and add ties.method as an argument. Our existing method is basically the same as ties.method = "random".

However it's easy to override this (although it can be a bit trickier if you have used the groups argument, but not much). Just overwrite the ranks column of the output with one you've computed on your own using alternative ties.method arguments to rank().

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.9000
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dfmat <- dfm(c("a a b c d d", "a b b c d"))
dfmat
## Document-feature matrix of: 2 documents, 4 features (0.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
##        features
## docs    a b c d
##   text1 2 1 1 2
##   text2 1 2 1 1

tstat <- textstat_frequency(dfmat)
tstat
##   feature frequency rank docfreq group
## 1       a         3    1       2   all
## 2       b         3    2       2   all
## 3       d         3    3       2   all
## 4       c         2    4       2   all

rank(tstat[["frequency"]], ties.method = "last") %>%
  rev()
## [1] 1 2 3 4
rank(tstat[["frequency"]], ties.method = "average") %>%
  rev()
## [1] 1 3 3 3
set.seed(1)
rank(tstat[["frequency"]], ties.method = "random") %>%
  rev()
## [1] 1 4 3 2

So using any of the rank() calls above, you could reassign their returns to tstat[["frequency"]] to get the type of ties you prefer.

ManfredBernhard · 2019-03-03T06:56:37Z

Dear All,
thanks for the fast response, though I am not sure, if I have been understood properly. Please see my response below:
Imagine the following situation: you have three runners that do a 100 meter run. They arrive at the
finish at exactly the same time of 10 seconds. This means, that all three of them will receive a gold
medal not a gold medal, a silver medal and a bronze medal, repectively. In my „unwanted”
solution of the frequency / rank calculation of the tokens of the fairy tale „Froschkönig” (Frog
Prince), the tokens „du”, „frosch”, and „er” have a frequency of 20, and therefore must be placed
at rank 5, and not 5, 6, and 7 in a table! Is there a solution for this?
By the way, is this the workflow that you are suggesting?
tstat_froschk<-textstat_frequency(froschk_dfm)

head(tstat_froschk)
feature frequency rank docfreq group
1 und 64 1 1 all
2 der 37 2 1 all
3 sie 35 3 1 all
4 die 30 4 1 all
5 du 20 5 1 all
6 frosch 20 6 1 all
#######”du” and „frosch” should have rank „5”!
return.average<-rank(tstat_froschk[["frequency"]], ties.method="average")%>%

rev()

return.average
[1] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
[27] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
[53] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
[79] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
[105] 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5 139.5
Hoping, I just misunderstood your solution.
Best,
ManfredBernhard

kbenoit · 2019-03-03T07:27:44Z

Yes, in pull request #1636, we allow you to control this via ties.method. (See ?data.table::frank.)

For now, you can reassign the rank column as explained above, but with rank(tstat_froschk[["rank"]], ties.method = "min").

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

times <- c(10, 10, 10)

# the way textstat_frequency works in version <- 1.4.1
rank(times, ties.method = "first")
## [1] 1 2 3

# the way you want it to work
rank(times, ties.method = "min")
## [1] 1 1 1

ManfredBernhard · 2019-03-21T07:38:09Z

Dear Quanteda-Team, thank you very much for your help with Quanteda. Best, Manfred B. Sellner Am 20.03.2019 um 19:13 schrieb Stefan Müller <notifications@github.com<mailto:notifications@github.com>>: Closed #1634<#1634>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1634 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/At6udezugEvG8Qy0BmLvKNAoCWMDjneBks5vYno3gaJpZM4baPHS>.

kbenoit mentioned this issue Mar 3, 2019

Add ties.method to textstat_frequency() #1636

Merged

stefan-mueller closed this as completed Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rank and frequency #1634

rank and frequency #1634

ManfredBernhard commented Mar 2, 2019

jiongweilua commented Mar 2, 2019

kbenoit commented Mar 2, 2019

ManfredBernhard commented Mar 3, 2019

kbenoit commented Mar 3, 2019

ManfredBernhard commented Mar 21, 2019 via email

rank and frequency #1634

rank and frequency #1634

Comments

ManfredBernhard commented Mar 2, 2019

jiongweilua commented Mar 2, 2019

kbenoit commented Mar 2, 2019

ManfredBernhard commented Mar 3, 2019

kbenoit commented Mar 3, 2019

ManfredBernhard commented Mar 21, 2019 via email