# Reading level metrics

Oct 25, 2018

In [1]:
% precision 2

import nltk
nltk.download('brown', quiet=True)

True

The Brown corpus consists of samples drawn from [15 different genres](http://www.hit.uib.no/icame/brown/bcm.html#t2) of written American English:

In [2]:
nltk.corpus.brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

We can access parts of the corpus by genre using `categories=`

In [3]:
nltk.corpus.brown.words(categories='adventure')

['Dan', 'Morgan', 'told', 'himself', 'he', 'would', ...]

The adventure texts have 69,342 tokens:

In [4]:
len(nltk.corpus.brown.words(categories='adventure'))

69342

and 4,637 sentences:

In [5]:
len(nltk.corpus.brown.sents(categories='adventure'))

4637

Let's compue the ARI ([Automated Readability Index](https://en.wikipedia.org/wiki/Automated_readability_index)) score for the adventure texts.  The formula is:

$$ARI=4.71\times\frac{\#chars}{\#words}+0.5\times\frac{\#words}{\# sents}-21.43$$

So, we'll need the number of sentences, the number of words, and the number of characters.  When counting words and characters, we'll skip puncutation.  The original developers of the ARI included punctuation in their character counts but not their word counts (as described on pg. 7 of [here](http://www.dtic.mil/dtic/tr/fulltext/u2/667273.pdf)) but concluded that it didn't make much difference either way.



In [0]:
sents = len(nltk.corpus.brown.sents(categories='adventure'))
words = len([w for w in nltk.corpus.brown.words(categories='adventure') if w.isalpha()])
chars = sum([len(w) for w in nltk.corpus.brown.words(categories='adventure') if w.isalpha()])

In [7]:
sents, words, chars

(4637, 56658, 240611)

In [8]:
4.71*(chars/words) + 0.5*(words/sents) - 21.43

4.68

And we can repeat the calculation for learned texts:

In [9]:
sents = len(nltk.corpus.brown.sents(categories='learned'))
words = len([w for w in nltk.corpus.brown.words(categories='learned') if w.isalpha()])
chars = sum([len(w) for w in nltk.corpus.brown.words(categories='learned') if w.isalpha()])

4.71*(chars/words) + 0.5*(words/sents) - 21.43

12.19