# References
- https://pypi.org/project/textstat/
- https://www.kaggle.com/ruchi798/commonlit-readability-prize-eda-baseline
- https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236626

# Introduction

How can you evaluate "**Readability**"?

**Textstat** is an easy to use library to calculate statistics from text.
It helps determine readability, complexity, and grade level.

For example, The *Flesch Reading Ease Score* is the score of readability which indicates how difficult a passage in English is to understand.
And the score is described like below.

|Score|Difficulty|
|---|---|
|90-100|Very Easy|
|80-89|Easy|
|70-79|Fairly Easy|
|60-69|Standard|
|50-59|Fairly Difficult|
|30-49|Difficult|
|0-29|Very Confusing|


If you want to know that score, you can calculate by using textstat;
```
textstat.flesch_reading_ease(text)
```
If the returned value is <code>45.8</code>, the text is *Difficult*.
And if <code>77.03</code>, *Fairly Easy*.


In addition to this, you can calculate other readability scores;
- The Flesch Reading Ease formula
- The Flesch-Kincaid Grade Level
- The Fog Scale (Gunning FOG Formula)
- The SMOG Index
- Automated Readability Index
- The Coleman-Liau Index
- Linsear Write Formula
- Dale-Chall Readability Score
And
- Readability Consensus based upon all the above tests

You can also count the number of
- Syllable
- Lexicon
- Sentence

by using textstat.


# Load data

In [None]:
import numpy as np
import pandas as pd

In [None]:
train = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')
sub = pd.read_csv('/kaggle/input/commonlitreadabilityprize/sample_submission.csv')

train

In [None]:
# The row with the min target value
train_min = train.loc[train['target'].idxmin()]
excerpt_min = train_min['excerpt']
print(train_min)
print()
print(excerpt_min)

In [None]:
# The row with the max target value
train_max = train.loc[train['target'].idxmax()]
excerpt_max = train_max['excerpt']
print(train_max)
print()
print(excerpt_max)

# Textstat
[Textstat](https://pypi.org/project/textstat/) is an easy to use library to calculate statistics from text.  
It helps determine readability, complexity, and grade level.

We will use;
- textstat.flesch_reading_ease(test_data)
- textstat.smog_index(test_data)
- textstat.flesch_kincaid_grade(test_data)
- textstat.coleman_liau_index(test_data)
- textstat.automated_readability_index(test_data)
- textstat.dale_chall_readability_score(test_data)
- textstat.difficult_words(test_data)
- textstat.linsear_write_formula(test_data)
- textstat.gunning_fog(test_data)
- textstat.text_standard(test_data)

The following functions are specifically designed for spanish language.  
They can be used on non-spanish texts, even though that use case is not recommended.  
- textstat.fernandez_huerta(test_data)
- textstat.szigriszt_pazos(test_data)
- textstat.gutierrez_polini(test_data)
- textstat.crawford(test_data)

In [None]:
!pip install textstat

In [None]:
import textstat

## List of Functions

### syllable_count
Returns the number of syllables present in the given text.  
Uses the Python module [Pyphen](https://github.com/Kozea/Pyphen) for syllable calculation.

In [None]:
print(textstat.syllable_count(excerpt_min))
print(textstat.syllable_count(excerpt_max))

### lexicon_count
Calculates the number of words present in the text. Optional removepunct specifies whether we need to take punctuation symbols into account while counting lexicons. Default value is True, which removes the punctuation before counting lexicon items.

In [None]:
print(textstat.lexicon_count(excerpt_min, removepunct=True))
print(textstat.lexicon_count(excerpt_max, removepunct=True))

### Sentence Count
Returns the number of sentences present in the given text.

In [None]:
print(textstat.sentence_count(excerpt_min))
print(textstat.sentence_count(excerpt_max))

### Flesch Reading Ease
Returns the Flesch Reading Ease Score.  
The following table can be helpful to assess the ease of readability in a document.  
The table is an example of values. While the maximum score is 121.22, there is no limit on how low the score can be. A negative score is valid.

|Score|Difficulty|
|---|---|
|90-100|Very Easy|
|80-89|Easy|
|70-79|Fairly Easy|
|60-69|Standard|
|50-59|Fairly Difficult|
|30-49|Difficult|
|0-29|Very Confusing|

Formula;
$$
206.835
-1.015 \left(\frac{\rm total ~ words}{\rm total ~ sentences}\right)
-84.6 \left(\frac{\rm total ~ syllables}{\rm total ~ words}\right)
$$

Further reading on [Wikipedia](https://simple.wikipedia.org/wiki/Flesch_Reading_Ease)

In [None]:
print(textstat.flesch_reading_ease(excerpt_min))
print(textstat.flesch_reading_ease(excerpt_max))

### The Flesch-Kincaid Grade Level

Returns the Flesch-Kincaid Grade of the given text.  
This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

Formula;
$$
0.39 \left(\frac{\rm total ~ words}{\rm total ~ sentences}\right)
+11.8 \left(\frac{\rm total ~ syllables}{\rm total ~ words}\right)
-15.59
$$

Further reading on [Wikipedia](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests)

In [None]:
print(textstat.flesch_kincaid_grade(excerpt_min))
print(textstat.flesch_kincaid_grade(excerpt_max))

## Gunning fog index
Returns the FOG index of the given text. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

Formula;
$$
0.4
\left[
\left(
\frac{\rm words}{\rm sentences}
\right)
+100
\left(
\frac{\rm complex~words}{\rm words}
\right)
\right]
$$

★What is *complex words*?

Further reading on [Wikipedia](https://en.wikipedia.org/wiki/Gunning_fog_index)

In [None]:
# gunning_fog by textstat
print(textstat.gunning_fog(excerpt_min))
print(textstat.gunning_fog(excerpt_max))

### SMOG
Returns the SMOG index of the given text.   
This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.  
Texts of fewer than 30 sentences are statistically invalid, because the SMOG formula was normed on 30-sentence samples. textstat requires at least 3 sentences for a result.


Formula;
$$
{\rm grade}=1.0430 \sqrt{{\rm number ~ of ~ polysyllables}\times \frac{30}{{\rm number ~ of ~  sentences}}}+3.1291
$$

- Count a number of sentences (at least 30)  
- In those sentences, count the polysyllables (words of 3 or more syllables).

Further reading on [Wikipedia](https://en.wikipedia.org/wiki/SMOG)

In [None]:
# smog_index by textstat
print(textstat.smog_index(excerpt_min))
print(textstat.smog_index(excerpt_max))

## Automated readability index
Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text.

For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade.

Fromula;
$$
4.71
\left(
\frac{{\rm characters}}{{\rm words}}
\right)
+0.5
\left(
\frac{{\rm words}}{{\rm sentences}}
\right)
-21.43
$$

Further reading on [Wikipedia](https://en.wikipedia.org/wiki/Automated_readability_index)


In [None]:
print(textstat.automated_readability_index(excerpt_min))
print(textstat.automated_readability_index(excerpt_max))

## The Coleman-Liau Index

Returns the grade level of the text using the Coleman-Liau Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.


$$
CLI=0.0588L-0.296S-15.8\\
$$
- $L$ : the average number of letters per 100 words
- $S$ : the average number of sentences per 100 words.

Further reading on [Wikipedia](https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index)

In [None]:
print(textstat.coleman_liau_index(excerpt_min))
print(textstat.coleman_liau_index(excerpt_max))

## Linsear Write Formula
Returns the grade level using the Linsear Write Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

Formula;  
The standard Linsear Write metric $Lw$ runs on a 100-word sample:

1. For each "easy word", defined as words with 2 syllables or less, add 1 point.
2. For each "hard word", defined as words with 3 syllables or more, add 3 points.
3. Divide the points by the number of sentences in the 100-word sample.
4. Adjust the provisional result r:
- If $r > 20, Lw = r / 2$
- If $r ≤ 20, Lw = r / 2 - 1$

The result is a "grade level" measure, reflecting the estimated years of education needed to read the text fluently.

Further reading on [Wikipedia](https://en.wikipedia.org/wiki/Linsear_Write)

In [None]:
print(textstat.linsear_write_formula(excerpt_min))
print(textstat.linsear_write_formula(excerpt_max))

## Dale–Chall readability formula
Different from other tests, since it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the New Dale-Chall Formula.

|Score|Understood by|
|---|---|
|4.9 or lower|average 4th-grade student or lower|
|5.0–5.9|average 5th or 6th-grade student|
|6.0–6.9|average 7th or 8th-grade student|
|7.0–7.9|average 9th or 10th-grade student|
|8.0–8.9|average 11th or 12th-grade student|
|9.0–9.9|average 13th to 15th-grade (college) student|

Formula;
$$
0.1579
\left(
\frac{\rm difficult ~ words}{\rm words}
\times 100
\right)
+0.0496
\left(
\frac{\rm words}{\rm sentences}
\right)
$$

★What is *difficult words*?

Further reading on [Wikipedia](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula)

In [None]:
print(textstat.dale_chall_readability_score(excerpt_min))
print(textstat.dale_chall_readability_score(excerpt_max))

In [None]:
# difficult_words
print(textstat.difficult_words(excerpt_min))
print(textstat.difficult_words(excerpt_max))

## Readability Consensus based upon all the above tests
Based upon all the above tests, returns the estimated school grade level required to understand the text.
Optional <code>float_output</code> allows the score to be returned as a <code>float</code>. Defaults to <code>False</code>.

In [None]:
print(textstat.text_standard(excerpt_min))
print(textstat.text_standard(excerpt_max))