Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The scale for Dale.Chall readability is wrong #1410

Closed
cl50803 opened this issue Aug 2, 2018 · 7 comments · Fixed by #1425
Closed

The scale for Dale.Chall readability is wrong #1410

cl50803 opened this issue Aug 2, 2018 · 7 comments · Fixed by #1425
Assignees

Comments

@cl50803
Copy link

cl50803 commented Aug 2, 2018

Describe the bug

The scale for Dale-Chall readability is wrong. It should be ranged from 0 to 10 and should not be negative.

Reproducible code

library(quanteda)
dale.chall<-textstat_readability("The scale for Dale-Chall readability is wrong. It should be ranged from 0 to 10 and should not be negative",measure = c("Dale.Chall"))

Expected behavior

document Dale.Chall
1 text1 -38.245
Please explain what you expected to happen.
The score should be positive and lie in the range of [0,10]
## System information

Please run sessionInfo() and paste the output.

R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] koRpus_0.10-2     data.table_1.11.4 quanteda_1.3.4    doParallel_1.0.11
 [5] iterators_1.0.10  foreach_1.4.4     lubridate_1.7.4   rlist_0.4.6.1    
 [9] jsonlite_1.5      rcompanion_1.13.2 moments_0.14      MASS_7.3-50      
[13] pequod_0.0-5      car_3.0-0         carData_3.0-1     sentimentr_2.3.2 
[17] bindrcpp_0.2.2    jtools_1.0.0      ggplot2_3.0.0     sjmisc_2.7.3     
[21] sjPlot_2.5.0      psych_1.8.4       magrittr_1.5      dplyr_0.7.6      

loaded via a namespace (and not attached):
  [1] readxl_1.0.0           backports_1.1.2        fastmatch_1.1-0       
  [4] miscTools_0.6-22       BSDA_1.2.0             plyr_1.8.4            
  [7] lazyeval_0.2.1         TMB_1.7.14             splines_3.4.4         
 [10] TH.data_1.0-9          digest_0.6.15          htmltools_0.3.6       
 [13] fansi_0.2.3            cluster_2.0.6          openxlsx_4.1.0        
 [16] modelr_0.1.2           RcppParallel_4.4.1     textshape_1.6.0       
 [19] sandwich_2.4-0         colorspace_1.3-2       haven_1.1.2           
 [22] crayon_1.3.4           lme4_1.1-17            bindr_0.1.1           
 [25] survival_2.41-3        zoo_1.8-3              glue_1.3.0            
 [28] stopwords_0.9.0        gtable_0.2.0           emmeans_1.2.3         
 [31] sjstats_0.16.0         spacyr_0.9.91          maxLik_1.3-4          
 [34] mlapi_0.1.0            abind_1.4-5            scales_0.5.0          
 [37] futile.options_1.0.1   mvtnorm_1.0-8          qdapRegex_0.7.2       
 [40] ggeffects_0.4.0        Rcpp_0.12.18           xtable_1.8-2          
 [43] foreign_0.8-69         text2vec_0.5.1         textclean_0.9.3       
 [46] stats4_3.4.4           prediction_0.3.6       survey_3.33-2         
 [49] RColorBrewer_1.1-2     modeltools_0.2-22      pkgconfig_2.0.1       
 [52] reshape_0.8.7          manipulate_1.0.1       nnet_7.3-12           
 [55] multcompView_0.1-7     utf8_1.1.4             tidyselect_0.2.4      
 [58] labeling_0.3           rlang_0.2.1            munsell_0.5.0         
 [61] cellranger_1.1.0       tools_3.4.4            cli_1.0.0             
 [64] ade4_1.7-11            sjlabelled_1.0.12      broom_0.5.0           
 [67] ggridges_0.5.0         evaluate_0.11          EMT_1.1               
 [70] stringr_1.3.1          arm_1.10-1             knitr_1.20            
 [73] zip_1.0.0              RVAideMemoire_0.9-69-3 purrr_0.2.5           
 [76] coin_1.2-2             WRS2_0.10-0            nlme_3.1-131.1        
 [79] formatR_1.5            compiler_3.4.4         bayesplot_1.5.0       
 [82] rstudioapi_0.7         curl_3.2               e1071_1.6-8           
 [85] syuzhet_1.0.4          tibble_1.4.2           DescTools_0.99.24     
 [88] stringi_1.2.4          hermite_1.1.2          futile.logger_1.4.3   
 [91] forcats_0.3.0          lattice_0.20-35        Matrix_1.2-12         
 [94] nloptr_1.0.4           vegan_2.5-2            permute_0.9-4         
 [97] effects_4.0-2          stringdist_0.9.5.1     pillar_1.3.0          
[100] mc2d_0.1-18            pwr_1.2-2              lmtest_0.9-36         
[103] ucminf_1.1-4           estimability_1.3       R6_2.2.2              
[106] ordinal_2018.4-19      rio_0.5.10             lexicon_1.0.0         
[109] codetools_0.2-15       lambda.r_1.2.3         boot_1.3-20           
[112] assertthat_0.2.0       rprojroot_1.3-2        withr_2.1.2           
[115] nortest_1.0-4          mnormt_1.5-5           multcomp_1.4-8        
[118] mgcv_1.8-23            expm_0.999-2           hms_0.4.2             
[121] grid_3.4.4             tidyr_0.8.1            coda_0.19-1           
[124] glmmTMB_0.2.2.0        class_7.3-14           minqa_1.2.4           
[127] rmarkdown_1.10         snakecase_0.9.1        pbdZMQ_0.3-3 ```

## Additional info

Please add any other information about the issue.
@kbenoit kbenoit self-assigned this Aug 3, 2018
kbenoit added a commit that referenced this issue Aug 3, 2018
@kbenoit
Copy link
Collaborator

kbenoit commented Aug 3, 2018

Thanks @cl50803, you're completely right. I have located a bug and am working on it.

@cl50803
Copy link
Author

cl50803 commented Aug 4, 2018

@kbenoit Thank you!

@kbenoit
Copy link
Collaborator

kbenoit commented Aug 4, 2018

It turns out that solving this is considerably harder than I thought at first, even with the original bug having been fixed. The reason is that the rules for matching observed terms to the Dale-Chall list involves applying a set of rules far more complex than fixed matching.

For details, see

Edgar Dale and Jeanne S. Chall. (1948) "A Formula for Predicting Readability: Instructions." Educational Research Bulletin 27 (2, Feb. 18, 1948): 37-54. https://www.jstor.org/stable/1473669

This would need to involve:

  1. Matching the Dale-Chall terms using their rules to a large dictionary of English words, and then using those matches for the new fixed match list. By these rules, the familiar word sharp will match sharply but not sharpen. (There are many other, more complicated examples.).
  2. Filtering those matches to meet the rules in the above source (Dale and Chall 1948).
  3. Implementing all of the other D&C rules.

This is complex enough that we might consider moving this and the other readability rules our of quanteda and into own package.

@ArthurSpirling
Copy link

ArthurSpirling commented Aug 14, 2018

@kbenoit just fyi, I looked at the korpus implementation for one of the toy examples from the Dale-Chall, 1995 (which they say is grades 5-6): alaska.txt

#load the package
require(koRpus)

#get the DC list, for use in a minute
require(quanteda)
dc.word.list <- quanteda::data_char_wordlists$dalechall

# the relevant text files (only 1 in this case)
# are here -- 
txt.file <- "C:/projects/temp/korp/"

# english tokenization
tok <- tokenize(txt.file,lang="en")
# Warning message:
# In readLines(txt, encoding = fileEncoding) :
#   incomplete final line found on 'C:/projects/temp/korp//alaska.txt'
 
# fit dale chall to it.
dc.out <-dale.chall(tok, word.list=dc.word.list)
# Warning message:
# Text is relatively short (<100 tokens), results are probably not reliable! 
print(dc.out)
# 
# Dale-Chall Readability Formula
#   Parameters: custom 
#  Not on list: 32%
#    Raw value: 23.53 
#        Grade: 11-12 
#          Age: 16-18 

# Text language: en 
# looks wrong

but this site reports grades 7-8, which isn't quite right either, but is in the correct ball park

@kbenoit
Copy link
Collaborator

kbenoit commented Aug 18, 2018

Some of the rules from the Dale and Chall (1948) paper above, and how to implement them:

  1. "Names of persons and places Japan, Smith, and so on, are familiar, even though they do not appear on the word list."
    Could be implemented using

    spacyr::spacy_parse(txt) %>% 
        subset(subset = grepl("^(PERSON|GPE)", .[["entity"]]), select = "token")
  2. U.S., A.M., and P.M. should be treated as familiar words.

  3. An abbreviation which is used several times within a 100-word sample is counted as two unfamiliar words only.

  4. Adjectives formed by adding n to a proper noun are familiar. For example, American, Austrian.
    Could add NORP to the filter list, as above.

  5. Only consider hyphenated words as familiar if all elements of the hyphenated word appear on the familiar list.

  6. Count a word unfamiliar if two or more endings are added to a word on the list. clippings is considered unfamiliar, although clip is on the list.

  7. Numerals like 1947, 18, and so on, are considered familiar.

@acholonu
Copy link

acholonu commented May 29, 2019

I am still getting the incorrect scale when I use Dale.Chall (1995). The values being returned range from -50 to positive 50. I am using package version 1.4.5

@kbenoit
Copy link
Collaborator

kbenoit commented May 30, 2019

@acholonu can you supply an example? The classic 0-10 Dale-Chall measure is measure = "Dale.Chall.old". The one used for "Dale.Chall" is the "new" D-C measure, computed as

64 - (0.95 * 100 * Nwd / Nw) - (0.69 * ASL)

meaning that if the Number of Difficult Words (Nwd) is high relative to the Number of Words (Nw) then the amount subtracted from 64 will be > 64 and hence make the value negative.

You can see here that in a few examples, this did in fact happen:

> textstat_readability(data_corpus_inaugural, 
       measure = c("Dale.Chall", "Dale.Chall.old", "Dale.Chall.PSK"))

          document Dale.Chall Dale.Chall.old Dale.Chall.PSK
1  1789-Washington -3.0285325      10.727912       9.905231
2  1793-Washington 18.1939815       9.053315       8.016478
3       1797-Adams -2.9004890      10.673285       9.877219
4   1801-Jefferson 14.3477936       9.142939       8.278966
5   1805-Jefferson  8.0248421       9.807390       8.903467
6     1809-Madison -0.4205207      10.695995       9.738012
7     1813-Madison 15.1448217       9.368313       8.315611
8      1817-Monroe 21.3385642       8.910271       7.775835
9      1821-Monroe 16.6481936       9.248561       8.181395
10      1825-Adams 10.0630138      10.036684       8.867665
11    1829-Jackson  7.2643631      10.129913       9.069118
12    1833-Jackson 12.8306914       9.499814       8.502493
13   1837-VanBuren  9.3647603      10.083897       8.926866
14   1841-Harrison 13.1137137       9.471400       8.475035
15       1845-Polk 16.8707271       9.422886       8.233200
16     1849-Taylor  2.5274606      10.626246       9.536430
17     1853-Pierce 16.1573058       9.497595       8.303565
18   1857-Buchanan 18.5969938       9.110467       8.013807
19    1861-Lincoln 23.7848381       8.566274       7.501781
20    1865-Lincoln 27.7788494       7.902038       7.016048
21      1869-Grant 22.3991802       8.710708       7.638199
22      1873-Grant 22.0088684       8.587630       7.615510
23      1877-Hayes 12.1381387       9.508565       8.547001
24   1881-Garfield 22.8218709       8.728127       7.619528
25  1885-Cleveland 11.9786791       9.787563       8.660617
26   1889-Harrison 20.9526121       8.968627       7.820596
27  1893-Cleveland 12.7842301       9.883460       8.648434
28   1897-McKinley 18.4059212       9.224108       8.067596
29   1901-McKinley 24.0677567       8.830068       7.583380
30  1905-Roosevelt 24.9740941       8.172431       7.283980
31       1909-Taft 16.3979675       9.321642       8.223569
32     1913-Wilson 28.0343697       7.981492       7.030483
33     1917-Wilson 28.3486904       7.875422       6.972181
34    1921-Harding 22.4535805       9.074016       7.770544
35   1925-Coolidge 26.0672234       8.594128       7.376260
36     1929-Hoover 23.0311531       8.974120       7.698869
37  1933-Roosevelt 25.9024231       8.528420       7.361552
38  1937-Roosevelt 29.7397547       8.094971       6.971278
39  1941-Roosevelt 33.3299010       7.446845       6.515607
40  1945-Roosevelt 34.7195617       7.103884       6.304861
41     1949-Truman 26.7756917       8.543749       7.315269
42 1953-Eisenhower 28.4906115       8.194168       7.082684
43 1957-Eisenhower 34.0379518       7.442143       6.471688
44    1961-Kennedy 27.1667406       8.047584       7.106816
45    1965-Johnson 39.0458324       6.741381       5.911947
46      1969-Nixon 36.7109985       6.830073       6.084086
47      1973-Nixon 30.3161378       7.507482       6.717706
48     1977-Carter 27.1976653       8.218925       7.168918
49     1981-Reagan 32.7731994       7.588069       6.601463
50     1985-Reagan 32.9864964       7.430830       6.530080
51       1989-Bush 39.8287269       6.584037       5.806606
52    1993-Clinton 33.9828704       7.340028       6.436859
53    1997-Clinton 33.8129309       7.388557       6.465090
54       2001-Bush 36.0627972       7.216451       6.266881
55       2005-Bush 31.8166518       7.622865       6.671411
56      2009-Obama 32.4781966       7.456305       6.569856
57      2013-Obama 29.3080331       7.845061       6.903722
58      2017-Trump 38.6682695       6.777431       5.947885

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants