# classical methods for text readability
In this notebook, I calculate text readability by classical methods with [readability](https://github.com/andreasvc/readability/) library.

In [None]:
!pip install https://github.com/andreasvc/readability/tarball/master
!pip install syntok

In [None]:
import readability

import numpy as np
import pandas as pd 
import os

import syntok.segmenter as segmenter

from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
train_df = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')

readability expects sentence-segmented and tokenized text.
So, I tokenize text with syntok library.



In [None]:
def _calc_readability(text):
    tokenized = '\n\n'.join(
         '\n'.join(' '.join(token.value for token in sentence)
            for sentence in paragraph)
         for paragraph in segmenter.analyze(text))
    return readability.getmeasures(tokenized, lang='en')

train_df['readability'] = train_df['excerpt'].map(lambda x: _calc_readability(x))

readability outputs some document information. For example, readability grades, sentence info like a number of words, and word usage.

In this time, I use readability grades only.


In [None]:
def _extract_feat(row):
    dic = {}
    for k, v in row.items():
        if k != 'readability grades':
            continue
        for kk, vv in v.items():
            dic.update({kk: vv})
    return dic

readability_grades_df = pd.DataFrame(train_df['readability'].map(_extract_feat).tolist())

I check each evaluation value and target value correlation.

In [None]:
readability_grades_df['target'] = train_df['target']

In [None]:
readability_grades_df.corr().T.style.background_gradient(cmap='viridis', axis=0)

In [None]:
sns.pairplot(readability_grades_df)

Let's see the absolute value of the correlation coefficient.

In [None]:
readability_grades_df.corr().abs().T.style.background_gradient(cmap='viridis', axis=0)