-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The scale for Dale.Chall readability is wrong #1410
Comments
Thanks @cl50803, you're completely right. I have located a bug and am working on it. |
@kbenoit Thank you! |
It turns out that solving this is considerably harder than I thought at first, even with the original bug having been fixed. The reason is that the rules for matching observed terms to the Dale-Chall list involves applying a set of rules far more complex than fixed matching. For details, see
This would need to involve:
This is complex enough that we might consider moving this and the other readability rules our of quanteda and into own package. |
@kbenoit just fyi, I looked at the #load the package
require(koRpus)
#get the DC list, for use in a minute
require(quanteda)
dc.word.list <- quanteda::data_char_wordlists$dalechall
# the relevant text files (only 1 in this case)
# are here --
txt.file <- "C:/projects/temp/korp/"
# english tokenization
tok <- tokenize(txt.file,lang="en")
# Warning message:
# In readLines(txt, encoding = fileEncoding) :
# incomplete final line found on 'C:/projects/temp/korp//alaska.txt'
# fit dale chall to it.
dc.out <-dale.chall(tok, word.list=dc.word.list)
# Warning message:
# Text is relatively short (<100 tokens), results are probably not reliable!
print(dc.out)
#
# Dale-Chall Readability Formula
# Parameters: custom
# Not on list: 32%
# Raw value: 23.53
# Grade: 11-12
# Age: 16-18
# Text language: en
# looks wrong but this site reports grades 7-8, which isn't quite right either, but is in the correct ball park |
Some of the rules from the Dale and Chall (1948) paper above, and how to implement them:
|
I am still getting the incorrect scale when I use Dale.Chall (1995). The values being returned range from -50 to positive 50. I am using package version 1.4.5 |
@acholonu can you supply an example? The classic 0-10 Dale-Chall measure is 64 - (0.95 * 100 * Nwd / Nw) - (0.69 * ASL) meaning that if the Number of Difficult Words (Nwd) is high relative to the Number of Words (Nw) then the amount subtracted from 64 will be > 64 and hence make the value negative. You can see here that in a few examples, this did in fact happen: > textstat_readability(data_corpus_inaugural,
measure = c("Dale.Chall", "Dale.Chall.old", "Dale.Chall.PSK"))
document Dale.Chall Dale.Chall.old Dale.Chall.PSK
1 1789-Washington -3.0285325 10.727912 9.905231
2 1793-Washington 18.1939815 9.053315 8.016478
3 1797-Adams -2.9004890 10.673285 9.877219
4 1801-Jefferson 14.3477936 9.142939 8.278966
5 1805-Jefferson 8.0248421 9.807390 8.903467
6 1809-Madison -0.4205207 10.695995 9.738012
7 1813-Madison 15.1448217 9.368313 8.315611
8 1817-Monroe 21.3385642 8.910271 7.775835
9 1821-Monroe 16.6481936 9.248561 8.181395
10 1825-Adams 10.0630138 10.036684 8.867665
11 1829-Jackson 7.2643631 10.129913 9.069118
12 1833-Jackson 12.8306914 9.499814 8.502493
13 1837-VanBuren 9.3647603 10.083897 8.926866
14 1841-Harrison 13.1137137 9.471400 8.475035
15 1845-Polk 16.8707271 9.422886 8.233200
16 1849-Taylor 2.5274606 10.626246 9.536430
17 1853-Pierce 16.1573058 9.497595 8.303565
18 1857-Buchanan 18.5969938 9.110467 8.013807
19 1861-Lincoln 23.7848381 8.566274 7.501781
20 1865-Lincoln 27.7788494 7.902038 7.016048
21 1869-Grant 22.3991802 8.710708 7.638199
22 1873-Grant 22.0088684 8.587630 7.615510
23 1877-Hayes 12.1381387 9.508565 8.547001
24 1881-Garfield 22.8218709 8.728127 7.619528
25 1885-Cleveland 11.9786791 9.787563 8.660617
26 1889-Harrison 20.9526121 8.968627 7.820596
27 1893-Cleveland 12.7842301 9.883460 8.648434
28 1897-McKinley 18.4059212 9.224108 8.067596
29 1901-McKinley 24.0677567 8.830068 7.583380
30 1905-Roosevelt 24.9740941 8.172431 7.283980
31 1909-Taft 16.3979675 9.321642 8.223569
32 1913-Wilson 28.0343697 7.981492 7.030483
33 1917-Wilson 28.3486904 7.875422 6.972181
34 1921-Harding 22.4535805 9.074016 7.770544
35 1925-Coolidge 26.0672234 8.594128 7.376260
36 1929-Hoover 23.0311531 8.974120 7.698869
37 1933-Roosevelt 25.9024231 8.528420 7.361552
38 1937-Roosevelt 29.7397547 8.094971 6.971278
39 1941-Roosevelt 33.3299010 7.446845 6.515607
40 1945-Roosevelt 34.7195617 7.103884 6.304861
41 1949-Truman 26.7756917 8.543749 7.315269
42 1953-Eisenhower 28.4906115 8.194168 7.082684
43 1957-Eisenhower 34.0379518 7.442143 6.471688
44 1961-Kennedy 27.1667406 8.047584 7.106816
45 1965-Johnson 39.0458324 6.741381 5.911947
46 1969-Nixon 36.7109985 6.830073 6.084086
47 1973-Nixon 30.3161378 7.507482 6.717706
48 1977-Carter 27.1976653 8.218925 7.168918
49 1981-Reagan 32.7731994 7.588069 6.601463
50 1985-Reagan 32.9864964 7.430830 6.530080
51 1989-Bush 39.8287269 6.584037 5.806606
52 1993-Clinton 33.9828704 7.340028 6.436859
53 1997-Clinton 33.8129309 7.388557 6.465090
54 2001-Bush 36.0627972 7.216451 6.266881
55 2005-Bush 31.8166518 7.622865 6.671411
56 2009-Obama 32.4781966 7.456305 6.569856
57 2013-Obama 29.3080331 7.845061 6.903722
58 2017-Trump 38.6682695 6.777431 5.947885 |
Describe the bug
The scale for Dale-Chall readability is wrong. It should be ranged from 0 to 10 and should not be negative.
Reproducible code
Expected behavior
document Dale.Chall
1 text1 -38.245
Please explain what you expected to happen.
The score should be positive and lie in the range of [0,10]
## System information
Please run
sessionInfo()
and paste the output.The text was updated successfully, but these errors were encountered: