The scale for Dale.Chall readability is wrong #1410

cl50803 · 2018-08-02T20:37:09Z

Describe the bug

The scale for Dale-Chall readability is wrong. It should be ranged from 0 to 10 and should not be negative.

Reproducible code

library(quanteda)
dale.chall<-textstat_readability("The scale for Dale-Chall readability is wrong. It should be ranged from 0 to 10 and should not be negative",measure = c("Dale.Chall"))

Expected behavior

document Dale.Chall
1 text1 -38.245
Please explain what you expected to happen.
The score should be positive and lie in the range of [0,10]
## System information

Please run sessionInfo() and paste the output.

R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] koRpus_0.10-2     data.table_1.11.4 quanteda_1.3.4    doParallel_1.0.11
 [5] iterators_1.0.10  foreach_1.4.4     lubridate_1.7.4   rlist_0.4.6.1    
 [9] jsonlite_1.5      rcompanion_1.13.2 moments_0.14      MASS_7.3-50      
[13] pequod_0.0-5      car_3.0-0         carData_3.0-1     sentimentr_2.3.2 
[17] bindrcpp_0.2.2    jtools_1.0.0      ggplot2_3.0.0     sjmisc_2.7.3     
[21] sjPlot_2.5.0      psych_1.8.4       magrittr_1.5      dplyr_0.7.6      

loaded via a namespace (and not attached):
  [1] readxl_1.0.0           backports_1.1.2        fastmatch_1.1-0       
  [4] miscTools_0.6-22       BSDA_1.2.0             plyr_1.8.4            
  [7] lazyeval_0.2.1         TMB_1.7.14             splines_3.4.4         
 [10] TH.data_1.0-9          digest_0.6.15          htmltools_0.3.6       
 [13] fansi_0.2.3            cluster_2.0.6          openxlsx_4.1.0        
 [16] modelr_0.1.2           RcppParallel_4.4.1     textshape_1.6.0       
 [19] sandwich_2.4-0         colorspace_1.3-2       haven_1.1.2           
 [22] crayon_1.3.4           lme4_1.1-17            bindr_0.1.1           
 [25] survival_2.41-3        zoo_1.8-3              glue_1.3.0            
 [28] stopwords_0.9.0        gtable_0.2.0           emmeans_1.2.3         
 [31] sjstats_0.16.0         spacyr_0.9.91          maxLik_1.3-4          
 [34] mlapi_0.1.0            abind_1.4-5            scales_0.5.0          
 [37] futile.options_1.0.1   mvtnorm_1.0-8          qdapRegex_0.7.2       
 [40] ggeffects_0.4.0        Rcpp_0.12.18           xtable_1.8-2          
 [43] foreign_0.8-69         text2vec_0.5.1         textclean_0.9.3       
 [46] stats4_3.4.4           prediction_0.3.6       survey_3.33-2         
 [49] RColorBrewer_1.1-2     modeltools_0.2-22      pkgconfig_2.0.1       
 [52] reshape_0.8.7          manipulate_1.0.1       nnet_7.3-12           
 [55] multcompView_0.1-7     utf8_1.1.4             tidyselect_0.2.4      
 [58] labeling_0.3           rlang_0.2.1            munsell_0.5.0         
 [61] cellranger_1.1.0       tools_3.4.4            cli_1.0.0             
 [64] ade4_1.7-11            sjlabelled_1.0.12      broom_0.5.0           
 [67] ggridges_0.5.0         evaluate_0.11          EMT_1.1               
 [70] stringr_1.3.1          arm_1.10-1             knitr_1.20            
 [73] zip_1.0.0              RVAideMemoire_0.9-69-3 purrr_0.2.5           
 [76] coin_1.2-2             WRS2_0.10-0            nlme_3.1-131.1        
 [79] formatR_1.5            compiler_3.4.4         bayesplot_1.5.0       
 [82] rstudioapi_0.7         curl_3.2               e1071_1.6-8           
 [85] syuzhet_1.0.4          tibble_1.4.2           DescTools_0.99.24     
 [88] stringi_1.2.4          hermite_1.1.2          futile.logger_1.4.3   
 [91] forcats_0.3.0          lattice_0.20-35        Matrix_1.2-12         
 [94] nloptr_1.0.4           vegan_2.5-2            permute_0.9-4         
 [97] effects_4.0-2          stringdist_0.9.5.1     pillar_1.3.0          
[100] mc2d_0.1-18            pwr_1.2-2              lmtest_0.9-36         
[103] ucminf_1.1-4           estimability_1.3       R6_2.2.2              
[106] ordinal_2018.4-19      rio_0.5.10             lexicon_1.0.0         
[109] codetools_0.2-15       lambda.r_1.2.3         boot_1.3-20           
[112] assertthat_0.2.0       rprojroot_1.3-2        withr_2.1.2           
[115] nortest_1.0-4          mnormt_1.5-5           multcomp_1.4-8        
[118] mgcv_1.8-23            expm_0.999-2           hms_0.4.2             
[121] grid_3.4.4             tidyr_0.8.1            coda_0.19-1           
[124] glmmTMB_0.2.2.0        class_7.3-14           minqa_1.2.4           
[127] rmarkdown_1.10         snakecase_0.9.1        pbdZMQ_0.3-3 ```

## Additional info

Please add any other information about the issue.

The text was updated successfully, but these errors were encountered:

kbenoit · 2018-08-03T17:59:25Z

Thanks @cl50803, you're completely right. I have located a bug and am working on it.

cl50803 · 2018-08-04T00:27:12Z

@kbenoit Thank you!

kbenoit · 2018-08-04T15:27:22Z

It turns out that solving this is considerably harder than I thought at first, even with the original bug having been fixed. The reason is that the rules for matching observed terms to the Dale-Chall list involves applying a set of rules far more complex than fixed matching.

For details, see

Edgar Dale and Jeanne S. Chall. (1948) "A Formula for Predicting Readability: Instructions." Educational Research Bulletin 27 (2, Feb. 18, 1948): 37-54. https://www.jstor.org/stable/1473669

This would need to involve:

Matching the Dale-Chall terms using their rules to a large dictionary of English words, and then using those matches for the new fixed match list. By these rules, the familiar word sharp will match sharply but not sharpen. (There are many other, more complicated examples.).
Filtering those matches to meet the rules in the above source (Dale and Chall 1948).
Implementing all of the other D&C rules.

This is complex enough that we might consider moving this and the other readability rules our of quanteda and into own package.

ArthurSpirling · 2018-08-14T19:47:43Z

@kbenoit just fyi, I looked at the korpus implementation for one of the toy examples from the Dale-Chall, 1995 (which they say is grades 5-6): alaska.txt

#load the package
require(koRpus)

#get the DC list, for use in a minute
require(quanteda)
dc.word.list <- quanteda::data_char_wordlists$dalechall

# the relevant text files (only 1 in this case)
# are here -- 
txt.file <- "C:/projects/temp/korp/"

# english tokenization
tok <- tokenize(txt.file,lang="en")
# Warning message:
# In readLines(txt, encoding = fileEncoding) :
#   incomplete final line found on 'C:/projects/temp/korp//alaska.txt'
 
# fit dale chall to it.
dc.out <-dale.chall(tok, word.list=dc.word.list)
# Warning message:
# Text is relatively short (<100 tokens), results are probably not reliable! 
print(dc.out)
# 
# Dale-Chall Readability Formula
#   Parameters: custom 
#  Not on list: 32%
#    Raw value: 23.53 
#        Grade: 11-12 
#          Age: 16-18 

# Text language: en 
# looks wrong

but this site reports grades 7-8, which isn't quite right either, but is in the correct ball park

kbenoit · 2018-08-18T16:31:11Z

Some of the rules from the Dale and Chall (1948) paper above, and how to implement them:

"Names of persons and places Japan, Smith, and so on, are familiar, even though they do not appear on the word list."
Could be implemented using
```
spacyr::spacy_parse(txt) %>% 
    subset(subset = grepl("^(PERSON|GPE)", .[["entity"]]), select = "token")
```
U.S., A.M., and P.M. should be treated as familiar words.
An abbreviation which is used several times within a 100-word sample is counted as two unfamiliar words only.
Adjectives formed by adding n to a proper noun are familiar. For example, American, Austrian.
Could add NORP to the filter list, as above.
Only consider hyphenated words as familiar if all elements of the hyphenated word appear on the familiar list.
Count a word unfamiliar if two or more endings are added to a word on the list. clippings is considered unfamiliar, although clip is on the list.
Numerals like 1947, 18, and so on, are considered familiar.

acholonu · 2019-05-29T19:05:22Z

I am still getting the incorrect scale when I use Dale.Chall (1995). The values being returned range from -50 to positive 50. I am using package version 1.4.5

kbenoit · 2019-05-30T08:18:15Z

@acholonu can you supply an example? The classic 0-10 Dale-Chall measure is measure = "Dale.Chall.old". The one used for "Dale.Chall" is the "new" D-C measure, computed as

64 - (0.95 * 100 * Nwd / Nw) - (0.69 * ASL)

meaning that if the Number of Difficult Words (Nwd) is high relative to the Number of Words (Nw) then the amount subtracted from 64 will be > 64 and hence make the value negative.

You can see here that in a few examples, this did in fact happen:

> textstat_readability(data_corpus_inaugural, 
       measure = c("Dale.Chall", "Dale.Chall.old", "Dale.Chall.PSK"))

          document Dale.Chall Dale.Chall.old Dale.Chall.PSK
1  1789-Washington -3.0285325      10.727912       9.905231
2  1793-Washington 18.1939815       9.053315       8.016478
3       1797-Adams -2.9004890      10.673285       9.877219
4   1801-Jefferson 14.3477936       9.142939       8.278966
5   1805-Jefferson  8.0248421       9.807390       8.903467
6     1809-Madison -0.4205207      10.695995       9.738012
7     1813-Madison 15.1448217       9.368313       8.315611
8      1817-Monroe 21.3385642       8.910271       7.775835
9      1821-Monroe 16.6481936       9.248561       8.181395
10      1825-Adams 10.0630138      10.036684       8.867665
11    1829-Jackson  7.2643631      10.129913       9.069118
12    1833-Jackson 12.8306914       9.499814       8.502493
13   1837-VanBuren  9.3647603      10.083897       8.926866
14   1841-Harrison 13.1137137       9.471400       8.475035
15       1845-Polk 16.8707271       9.422886       8.233200
16     1849-Taylor  2.5274606      10.626246       9.536430
17     1853-Pierce 16.1573058       9.497595       8.303565
18   1857-Buchanan 18.5969938       9.110467       8.013807
19    1861-Lincoln 23.7848381       8.566274       7.501781
20    1865-Lincoln 27.7788494       7.902038       7.016048
21      1869-Grant 22.3991802       8.710708       7.638199
22      1873-Grant 22.0088684       8.587630       7.615510
23      1877-Hayes 12.1381387       9.508565       8.547001
24   1881-Garfield 22.8218709       8.728127       7.619528
25  1885-Cleveland 11.9786791       9.787563       8.660617
26   1889-Harrison 20.9526121       8.968627       7.820596
27  1893-Cleveland 12.7842301       9.883460       8.648434
28   1897-McKinley 18.4059212       9.224108       8.067596
29   1901-McKinley 24.0677567       8.830068       7.583380
30  1905-Roosevelt 24.9740941       8.172431       7.283980
31       1909-Taft 16.3979675       9.321642       8.223569
32     1913-Wilson 28.0343697       7.981492       7.030483
33     1917-Wilson 28.3486904       7.875422       6.972181
34    1921-Harding 22.4535805       9.074016       7.770544
35   1925-Coolidge 26.0672234       8.594128       7.376260
36     1929-Hoover 23.0311531       8.974120       7.698869
37  1933-Roosevelt 25.9024231       8.528420       7.361552
38  1937-Roosevelt 29.7397547       8.094971       6.971278
39  1941-Roosevelt 33.3299010       7.446845       6.515607
40  1945-Roosevelt 34.7195617       7.103884       6.304861
41     1949-Truman 26.7756917       8.543749       7.315269
42 1953-Eisenhower 28.4906115       8.194168       7.082684
43 1957-Eisenhower 34.0379518       7.442143       6.471688
44    1961-Kennedy 27.1667406       8.047584       7.106816
45    1965-Johnson 39.0458324       6.741381       5.911947
46      1969-Nixon 36.7109985       6.830073       6.084086
47      1973-Nixon 30.3161378       7.507482       6.717706
48     1977-Carter 27.1976653       8.218925       7.168918
49     1981-Reagan 32.7731994       7.588069       6.601463
50     1985-Reagan 32.9864964       7.430830       6.530080
51       1989-Bush 39.8287269       6.584037       5.806606
52    1993-Clinton 33.9828704       7.340028       6.436859
53    1997-Clinton 33.8129309       7.388557       6.465090
54       2001-Bush 36.0627972       7.216451       6.266881
55       2005-Bush 31.8166518       7.622865       6.671411
56      2009-Obama 32.4781966       7.456305       6.569856
57      2013-Obama 29.3080331       7.845061       6.903722
58      2017-Trump 38.6682695       6.777431       5.947885

kbenoit self-assigned this Aug 3, 2018

kbenoit added bug textstat labels Aug 3, 2018

kbenoit added a commit that referenced this issue Aug 3, 2018

Start fix of #1410

1298842

kbenoit added a commit that referenced this issue Sep 3, 2018

Increment version and add NEWS for #1410

8e7043b

kbenoit mentioned this issue Sep 3, 2018

Fix readability measures that use word lists #1425

Merged

kbenoit closed this as completed in #1425 Sep 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The scale for Dale.Chall readability is wrong #1410

The scale for Dale.Chall readability is wrong #1410

cl50803 commented Aug 2, 2018 •

edited

Loading

kbenoit commented Aug 3, 2018

cl50803 commented Aug 4, 2018

kbenoit commented Aug 4, 2018 •

edited

Loading

ArthurSpirling commented Aug 14, 2018 •

edited by kbenoit

Loading

kbenoit commented Aug 18, 2018 •

edited

Loading

acholonu commented May 29, 2019 •

edited

Loading

kbenoit commented May 30, 2019 •

edited

Loading

The scale for Dale.Chall readability is wrong #1410

The scale for Dale.Chall readability is wrong #1410

Comments

cl50803 commented Aug 2, 2018 • edited Loading

Describe the bug

Reproducible code

Expected behavior

kbenoit commented Aug 3, 2018

cl50803 commented Aug 4, 2018

kbenoit commented Aug 4, 2018 • edited Loading

ArthurSpirling commented Aug 14, 2018 • edited by kbenoit Loading

kbenoit commented Aug 18, 2018 • edited Loading

acholonu commented May 29, 2019 • edited Loading

kbenoit commented May 30, 2019 • edited Loading

cl50803 commented Aug 2, 2018 •

edited

Loading

kbenoit commented Aug 4, 2018 •

edited

Loading

ArthurSpirling commented Aug 14, 2018 •

edited by kbenoit

Loading

kbenoit commented Aug 18, 2018 •

edited

Loading

acholonu commented May 29, 2019 •

edited

Loading

kbenoit commented May 30, 2019 •

edited

Loading