# Reading level metrics

Oct 30, 2018

In [0]:
% precision 2

import nltk
nltk.download('brown', quiet=True)

brown = nltk.corpus.brown

The Brown corpus consists of samples drawn from [15 different genres](http://www.hit.uib.no/icame/brown/bcm.html#t2) of written American English:

In [2]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

We can access parts of the corpus by genre using `categories=`

In [3]:
brown.words(categories='adventure')

['Dan', 'Morgan', 'told', 'himself', 'he', 'would', ...]

The adventure texts have 69,342 tokens:

In [4]:
len(brown.words(categories='adventure'))

69342

and 4,637 sentences:

In [5]:
len(brown.sents(categories='adventure'))

4637

Let's compue the ARI ([Automated Readability Index](https://en.wikipedia.org/wiki/Automated_readability_index)) score for the adventure texts.  The formula is:

$$ARI=4.71\times\frac{\text{#chars}}{\text{#words}}+0.5\times\frac{\text{#words}}{\text{# sents}}-21.43$$

So, we'll need the number of sentences, the number of words, and the number of characters.  When counting words and characters, we'll skip puncutation.  The original developers of the ARI included punctuation in their character counts but not their word counts (as described on pg. 7 of [here](http://www.dtic.mil/dtic/tr/fulltext/u2/667273.pdf)) but concluded that it didn't make much difference either way.



In [0]:
sents = len(brown.sents(categories='adventure'))
words = len([w for w in brown.words(categories='adventure') if w.isalpha()])
chars = sum([len(w) for w in brown.words(categories='adventure') if w.isalpha()])

In [7]:
sents, words, chars

(4637, 56658, 240611)

In [8]:
4.71*(chars/words) + 0.5*(words/sents) - 21.43

4.68

And we can repeat the calculation for learned texts:

In [9]:
sents = len(brown.sents(categories='learned'))
words = len([w for w in brown.words(categories='learned') if w.isalpha()])
chars = sum([len(w) for w in brown.words(categories='learned') if w.isalpha()])

4.71*(chars/words) + 0.5*(words/sents) - 21.43

12.19

---

Now let's calculate ARI for all of the genres:

In [10]:
for cat in brown.categories():
  sents = len(brown.sents(categories=cat))
  words = len([w for w in brown.words(categories=cat) if w.isalpha()])
  chars = sum([len(w) for w in brown.words(categories=cat) if w.isalpha()])
  ari = 4.71*(chars/words) + 0.5*(words/sents) - 21.43
  print(cat, ari)

adventure 4.681417252022399
belles_lettres 11.129747928075162
editorial 9.745045633649696
fiction 5.5424122647009035
government 12.518139691080023
hobbies 9.173238223692124
humor 8.10919887928964
learned 12.188958374204795
lore 10.378538563176782
mystery 4.434945188853199
news 10.273322179798463
religion 10.578692001404171
reviews 10.868638015288823
romance 4.903643843572333
science_fiction 5.844049004266761


A slight variation, using f-strings for formatting, makes the table easier to read:

In [11]:
for cat in brown.categories():
  sents = len(brown.sents(categories=cat))
  words = len([w for w in brown.words(categories=cat) if w.isalpha()])
  chars = sum([len(w) for w in brown.words(categories=cat) if w.isalpha()])
  ari = 4.71*(chars/words) + 0.5*(words/sents) - 21.43
  print(f'{cat:15} {ari:5.2f}')

adventure        4.68
belles_lettres  11.13
editorial        9.75
fiction          5.54
government      12.52
hobbies          9.17
humor            8.11
learned         12.19
lore            10.38
mystery          4.43
news            10.27
religion        10.58
reviews         10.87
romance          4.90
science_fiction  5.84


Add the Coleman-Liau metric

$$CL=0.0588\times L-0.296\times S-15.8$$

In [12]:
for cat in brown.categories():
  sents = len(brown.sents(categories=cat))
  words = len([w for w in brown.words(categories=cat) if w.isalpha()])
  chars = sum([len(w) for w in brown.words(categories=cat) if w.isalpha()])
  ari = 4.71*(chars/words) + 0.5*(words/sents) - 21.43
  cl = 0.0588*(chars/words*100) - 0.296*(sents/words*100) - 15.8
  print(f'{cat:15} {ari:5.2f} {cl:5.2f}')

adventure        4.68  6.75
belles_lettres  11.13 10.51
editorial        9.75 10.45
fiction          5.54  7.28
government      12.52 12.74
hobbies          9.17 10.24
humor            8.11  8.79
learned         12.19 12.04
lore            10.38 10.27
mystery          4.43  6.54
news            10.27 10.86
religion        10.58 10.31
reviews         10.87 10.84
romance          4.90  6.76
science_fiction  5.84  8.12


SMOG and FK both require syllable counts, so we'll reuse the syllable count function from the last homework

$$SMOG=1.0430\times\sqrt{\text{#polysyllables}\times\frac{30}{\text{#sents}}}+3.1291$$

$$FK=0.39\times\frac{\text{#words}}{\text{#sents}}+11.8\times\frac{\text{#syllables}}{\text{#words}}-15.59$$

In [0]:
nltk.download('cmudict', quiet=True)

import re

cmudict = nltk.corpus.cmudict.dict()

def stress(pron):
  return [char for phone in pron for char in phone if char.isdigit()]

def syllables(text):
  return sum([w_syllables(w) for w in text if w.isalpha()])

def w_syllables(word):
  word = word.lower()
  if word in cmudict:
    return len(stress(cmudict[word][0]))
  else:
    s = len(re.findall(r'[aeiou]+', word))
    if re.search(r'[aeiou][^aeiou]+e$', word):
      s = s - 1
    return s

In [0]:
from math import sqrt

In [15]:
print(f'{"":15} {"ARI":>5} {"CL":>5} {"SMOG":>5} {"FK":>5}')
for cat in brown.categories():
  sents = len(brown.sents(categories=cat))
  words = len([w for w in brown.words(categories=cat) if w.isalpha()])
  chars = sum([len(w) for w in brown.words(categories=cat) if w.isalpha()])
  poly = len([w for w in brown.words(categories=cat) if w_syllables(w) > 2])
  sylls = syllables(brown.words(categories=cat))
  
  ari = 4.71*(chars/words) + 0.5*(words/sents) - 21.43
  cl = 0.0588*(chars/words*100) - 0.296*(sents/words*100) - 15.8
  smog = 1.0430*sqrt(poly*(30/sents)) + 3.1291
  fk = 0.39*(words/sents) + 11.8*(sylls/words) - 15.59
  
  print(f'{cat:15} {ari:5.2f} {cl:5.2f} {smog:5.2f} {fk:5.2f}')

                  ARI    CL  SMOG    FK
adventure        4.68  6.75  8.28  4.96
belles_lettres  11.13 10.51 13.36 11.05
editorial        9.75 10.45 12.76  9.92
fiction          5.54  7.28  8.94  5.78
government      12.52 12.74 14.91 12.41
hobbies          9.17 10.24 12.19  9.21
humor            8.11  8.79 11.29  8.37
learned         12.19 12.04 14.56 12.19
lore            10.38 10.27 12.78 10.23
mystery          4.43  6.54  8.32  4.94
news            10.27 10.86 12.68 10.10
religion        10.58 10.31 13.07 10.70
reviews         10.87 10.84 13.34 10.70
romance          4.90  6.76  8.70  5.39
science_fiction  5.84  8.12  9.75  6.33
