<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/28_Readability_Formulas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text readability**

What makes a text more or less difficult to read? The answer to this question has important pedagogical implications. Knowing the difficulty level of a text means that teachers and educators can select texts and readings most appropriate for a particular reading population. These considerations can be made based on grade level (e.g., primary vs. high school), but also language knowledge (e.g., first language vs.second language speakers).

How can we computationally measure the difficulty of a text? Using what we know so far, the easiest method involves some sort of calculation based on the words in a text. For several decades now, researchers and scientists have been developing **readability formulas** for the automatic assessment of text readability. The most well-known readability formulas are probably the Flesh-Kincaid formulas. In fact, if you have ever used Microsoft Word, the software includes this formula as a way to assess the overall difficulty of a text. There are a number of other formulas, such as Dale-Chall and the Gunning Fog Index. [If you are interested, you can read all about the different formulas on *Wikipedia*.](https://en.wikipedia.org/wiki/Readability)


These formulas have been used in many applications, and are generally accepted without criticism to be valid measures of text readability. However, these formluas have also come under criticism because they do not capture cognitive aspects of the reading process. Some of my own research has been working with teams to develop new readability formulas that might better model the reading process. [You can read about this topic in detail in one of our first articles on this topic.](https://www.tandfonline.com/doi/full/10.1080/0163853X.2017.1296264)However, some of out other research suggests that sometimes the [older readability formulas can function just as well to predict text readability.](https://onlinelibrary.wiley.com/doi/full/10.1111/1467-9817.12283)

While work continues to develop in this area, we can take this opportunity to calculate some of these classic readability metrics in Python. We will start with Flesch-Kincaid readability formulas.


## **Flesch-Kincaid**

There are two [Flesh-Kincaid formulas](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests): reading ease, and grade level. What do the formulas calculate? *Wikipedia* gives us these formulas:


**Flesch Reading Ease**
<br>
> <img src = 'https://wikimedia.org/api/rest_v1/media/math/render/svg/bd4916e193d2f96fa3b74ee258aaa6fe242e110e'>

<br>

**Flesh Grade Level**
<br>
><img src = 'https://wikimedia.org/api/rest_v1/media/math/render/svg/8e68f5fc959d052d1123b85758065afecc4150c3'>


We can see that these measures are really not that complicated. The main things we need to calculate are the total number of words and sentences in a text, plus the total number of syllables. We already know that getting the words and sentences in a text is quite easy, using `nltk.word_tokenize()` and `nltk.sents_tokenize()`, or we could even use `.split()` and `.split('\n')`. The one new measure here that we need to grapple with is measuring the syllables in a word. First of all - let's make sure we understand what a syllable is. Here is the definition from Google dictionary:



 <img src = "https://i.imgur.com/1djNlRb.png">

The crucial part of this definition is that a syllable is associated with a vowel *sound*. Well, most of the analyses we have been doing do not really take into account *sounds*, but we can at least identify vowels in a language. So, can we write a function to identify syllables in an English word?



# **Hunting syllables**

Is finding the syllables in a word as simple as counting the number of vowels? Let's try some examples. First let's make a function which just equates the number of vowels with the number of syllables.


In [None]:
def syllable_v1(word):
  vowels = 'aieouy'
  vowel_count = 0
  for char in word.lower():
    if char in vowels:
      vowel_count += 1
  print(f'word has {vowel_count} syllables')

Let's test this function out.

In [None]:
syllable_v1('cat')

In [None]:
syllable_v1('banana')

The function seems to work fine so far, at least until we find words such as `ate`, which are one syllable but contain two vowels. Clearly, this equation between number of vowels and syllables doesn't work. We actually knew this from the start, because a syllable is associated with a vowel ***sound***, not a written vowel.

If only there was an existing resource which measured the *sounds* of text input. And of course, there is, and it exist within NLTK - the CMU Pronunciation Dictionary! Let's load that resource in and recall what it does.

In [None]:
# import CMU Pronunciation Dictionray
import nltk
nltk.download('cmudict')
from nltk.corpus import cmudict

# create the dictionary
cmu = cmudict.dict()

Look at the output for `ate`. We are given all of the sounds in the word, as well as numbers indicating vowel stress. [You might want to skim their website again.](http://www.speech.cs.cmu.edu/cgi-bin/cmudict). If a syllable is a part of a word associated with a vowel sound, then we can explore the properties of the different phones in the dictionary output to figure out what the vowel sounds are.

In [None]:
cmu['ate']

In [None]:
cmu['banana']

In [None]:
cmu['victoria']

If we look at the output, we can see that the stressed vowel sounds all have a number in their output. So, we could cound the number of times such phones occur in the entry for any one word, and in turn get a good approximation for the number of syllables in a word. Create a function to do so. Because words have more than one possible proununciation, each word has these pronunciations stored as a list of lists. So, we will simplify this problem by always choosing the first list from any entry using slicing `[0]`.

There are various ways to count if a the sound has a number in it. The way that pops up when you search for this question is to slice the last part of each sound and check if that part is a digit using `.isdigit()`. I follow that approach in the function below. This function could be written in one line, but I like to break this one apart so we all understand what is going on.

In [None]:
def count_syllables(word):

  if word in cmu.keys():
    # get the entry from the dictionary, and slice the first set of sounds
    phones = cmu[word][0]

    # using a list comprehension, extract only the sounds that end in a digit
    vowel_sounds= [sound for sound in phones if sound[-1].isdigit()]

    # the number of syllabes is equal to the number of vowel_sounds,
    syllables = len(vowel_sounds)

    print(f'The word {word} has {syllables} syllables.')
    return syllables
  else:
    print(f'sorry, {word} is not in the CMU Pronunciation Dictionary')
    return 0

Test this function on our words:

In [None]:
count_syllables('ear')

In [None]:
count_syllables('syllable')

In [None]:
count_syllables('extraordinary')

In [None]:
count_syllables('hairy')

In [None]:
# darn, this word isn't in the dictionary.
count_syllables('associational')

The problem, of course, is what can we do with words that are not in the CMU Dictionary? Again we refer back to some attempt at counting vowels in the words. This problem is discussed by others, because it turns out that computationally counting the sounds in a word based on the *spelling* in English is a difficult task. [Look at the discussion here](https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word) and you will see various attempts using regular expressions as well as the CMU dictionary. Our best friend [ChatGPT gives the same answer](https://chat.openai.com/share/95538d86-c30a-4187-8562-5a200a27d9ce). Note how the ChatGPT code has a few tricks for preprocessing, such as using `filter` to lowercase all characters and remove anything that is not alphanumeric.

Line 10 includes a list comprehension which does the same thing as the `count_syllable` function above, in a less readable way. The one-liner also uses `max()` to get around the issue that words have more than on pronunciation.

Line 11 includes the fallback algorithm in case the word is not in the CMU dictionary. Can you understand the code? It keeps track of each vowel instance, and also does not double count repeated vowels such as words like `heel`.

How well doe the function work?

In [None]:
# code from ChatGPT
# Create a function to count syllables in a word
def count_syllables_GPT(word):
    # Remove any non-alphabetic characters and convert to lowercase
    word = ''.join(filter(str.isalpha, word)).lower()

    # Use the CMU Pronouncing Dictionary to count syllables
    d = cmudict.dict()
    if word in d:
        return max([len(list(y for y in x if y[-1].isdigit())) for x in d[word]])
    else:
        # If the word is not found in the dictionary, use a simple rule
        # based on the number of vowel letters
        vowels = "aeiouy"
        count = 0
        prev_char_is_vowel = False
        for char in word:
            if char in vowels:
                if not prev_char_is_vowel:
                    count += 1
                prev_char_is_vowel = True
            else:
                prev_char_is_vowel = False
        return count

Test out the function - build a helper function to print the output from the chatGPT code.

In [None]:
def syllables(word):
  syllable_count = count_syllables_GPT(word)
  print(f"The word '{word}' has {syllable_count} syllables.")

In [None]:
syllables('ear')

In [None]:
# seems to perform okay on unseen words.
syllables('associational')

## Back to Flesh Kincaid

Well, after sorting out the syllable issue, we can now start to calculate the Flesch Kincaid readability metrics for our text input.

Let's start by building a function which will calculate the basic information we need:

1. The total number of words in a text
2. The total number of sentences in a text
3. The total number of syllables in a text

We need to download some nltk functions first and then build the program.

In [None]:
import nltk
import re
nltk.download('punkt')

We will also make a tweaked version of the chatGPT based syllable finder, and remove the creation of the cmu dict from insdie the function. The chatGPT version actually takes a very long time to run on texts.

In [None]:
from nltk.corpus import cmudict

# create the dictionary
cmu = cmudict.dict()

def count_syllables_v2(word):

    if word in cmu.keys():
      # get the entry from the dictionary, and slice the first set of sounds
      phones = cmu[word][0]

      # using a list comprehension, extract only the sounds that end in a digit
      vowel_sounds= [sound for sound in phones if sound[-1].isdigit()]

      # the number of syllabes is equal to the number of vowel_sounds,
      syllables = len(vowel_sounds)

    else:
      # If the word is not found in the dictionary, use a simple rule
      # based on the number of vowel letters
      vowels = "aeiouy"
      syllables = 0
      prev_char_is_vowel = False
      for char in word:
          if char in vowels:
              if not prev_char_is_vowel:
                  syllables += 1
              prev_char_is_vowel = True
          else:
              prev_char_is_vowel = False

    #print(f'The word {word} has {syllables} syllables.')
    return syllables

In [None]:
def text_info(text):
  """
  Args:
    text: a string
  Returns:
    ...
  """
  # lowercase the text
  text = text.lower()

  # extract tokens, removing any that are just punctuation
  tokens = [token.lower() for token in nltk.word_tokenize(text) if token.isalpha()]

  # extract sentences
  sentences = [sentence for sentence in nltk.sent_tokenize(text)]

  # extract syllables
  syllables = 0

  for token in tokens:
    syllables += count_syllables_v2(token)
  print(f'this text has {len(tokens)} words, {len(sentences)} sentences, and {syllables} syllables.')
  return len(tokens), len(sentences), syllables

Read in a text to test out the function.

In [None]:
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/tmoom.txt'

In [None]:
tmoom = open('tmoom.txt').read()

In [None]:
text_info(tmoom)

### **Flesch Reading Ease Function**

Now that we can extract the necessary information, we can develop functions to calculate the different readability metrics.


In [None]:
def flesch_reading_ease(text):
  """
  calculate Flesch Reading Ease
  206.835 - 1.015(total words / total sentences) - 84.6(total syllables / total words)
  """
  words, sents, sylls = text_info(text)

  word_sents = words/sents
  syll_words = sylls/words

  reading_ease_score = 206.835 - (1.015 * word_sents) - (84.6 * syll_words)

  print(f'Flesch Reading Ease Score: {reading_ease_score}')

In [None]:
# this text scores an 86, which means it is highly readable.
flesch_reading_ease(tmoom)

In [None]:
# try a more complicated text.

!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/conversation_satire.txt'

In [None]:
conversation = open('conversation_satire.txt').read()

In [None]:
# A 47.7 means it is less readable.

flesch_reading_ease(conversation)

### Flesch Grade Level

It should be relatively simple to now create the other formula to calculate Flesch Grade Level. [We can find at least one resource explaining the scales here.](https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/)

In [None]:
def flesch_grade_level(text):
  """
  calculate Flesch grade level
  0.39 * (total words / total sentences) +  11.8 * (total syllables / total words) - 15.59
  """
  words, sents, sylls = text_info(text)

  word_sents = words/sents
  syll_words = sylls/words

  reading_grade_level = (0.39 * word_sents) + (11.8 * syll_words) -15.59

  print(f'Flesch Grade Level: {reading_grade_level}')

In [None]:
# 2.8 means this text is appropriate for basic readers
flesch_grade_level(tmoom)

In [None]:
# 12.17 means this text is for average to skills readers
flesch_grade_level(conversation)

## Gunning Fog Readability

Another of the many formulas is Gunning Fog. I chose this because it is relatively simply to calculate:
<br>
> <img src = 'https://wikimedia.org/api/rest_v1/media/math/render/svg/84cd504cf61d43230ef59fbd0ecf201796e5e577'>

In this formula, a complex word is a word that contains three or more syllables (and also does not count Names, compound words, and some other exceptions we will ignore for now).

To create this formula, we only need to get a measure of how many words are "complex" in a text. We only need to make a small change to the text information function to record this data.

In [None]:
def text_info(text):
  """
  Args:
    text: a string
  Returns:
    ...
  """
  # lowercase the text
  text = text.lower()

  # extract tokens, removing any that are just punctuation
  tokens = [token.lower() for token in nltk.word_tokenize(text) if token.isalpha()]

  # extract sentences
  sentences = [sentence for sentence in nltk.sent_tokenize(text)]

  # extract syllables
  syllables = 0
  # add complex words counter
  complex_words = 0

  # adjust these calculations so that the new variable, complex words, is iterated
  for token in tokens:
    num_sylls = count_syllables_v2(token)
    syllables += num_sylls
    if num_sylls > 2:
      complex_words += 1

  print(f'this text has {len(tokens)} words, {len(sentences)} sentences, {syllables} syllables, and {complex_words} complex words.')
  return len(tokens), len(sentences), syllables, complex_words

In [None]:
def gunning_fog(text):
  """
  calculate Gunning Fog
  0.4 * ((word/sentences) + 100 *(complex words / word))
  """
  words, sents, sylls, complex_words = text_info(text)

  word_sents = words/sents
  complexWords_words = complex_words/words

  gunning_fog = 0.4 * ((word_sents) + (complexWords_words*100))

  print(f'Gunning Fog Index: {gunning_fog}')

Test out the function:

here is the Gunning Fog score interpretation. What do you think?

Fog Index|Reading level by grade
:-:|:-:
17|	College graduate
16|	College senior
15|	College junior
14|	College sophomore
13|	College freshman
12|	High school senior
11|	High school junior
10|	High school sophomore
9|	High school freshman
8|	Eighth grade
7|	Seventh grade
6|	Sixth grade

In [None]:
gunning_fog(tmoom)

In [None]:
gunning_fog(conversation)

## **Your Turn**

I have shown you a number of formulas you can caclulate. There are even more, [which you can see here.](https://en.wikipedia.org/wiki/Readability#Popular_readability_formulas)

Find some texts you think are easy, and some you think are difficult. Do the readability formulas accurately categorise these texts based on your expectations?

Then, consider extending what is already here. Can you add more readability formulas? Can you also make a single function which calculates all of these formulas for a text? Can you develop a `regular expression` to find vowles better than the ChatGPT approach of counting non-sequential vowels in a word?