# Introduction
The task of assessing the readability of a specific text has a long history of rule-based approaches. 
There is a collection of readability indices out there, which were developed for various purposes, from school reading assignments to legal hurdles proving that those Data Collection Policies and Disclaimers can actually be understood by humans. Most of them rely on features that can be easily extracted from the text, such as "average sentence length" or "average number of syllables per word."
This notebook constructs those features collected from various readability indices in the wild in the hope of improving the performance of a predictive model. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from matplotlib import pyplot
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import nltk
import spacy
nlp = spacy.load("en_core_web_sm")

!pip install textstat
import textstat
from textstat.textstat import textstatistics
from tabulate import tabulate
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
data.head()

# EDA
Let's see what the hardest and the easiest texts from our data look like. Also, what is it like in the middle of that scale?

In [None]:
# view hardest and easiest texts
n = 2 # number of texts from each side

pres_df = pd.DataFrame(columns=['score', 'text'])
pres_dics = {'score':[]}
sorted_inds = np.argsort(data['target'])
min_inds = sorted_inds[:n].values # take first n elements of sorted indices
max_inds = sorted_inds[::-1][:n].values # take first n elements from reverced sorted indices
mid_inds = sorted_inds[[int(len(sorted_inds)/2) + (i-int(n/2)) for i in range(n)]].values # just trust me

# print out results
print('Easiest:')
for ind in max_inds:
    print('Score: ', data.iloc[ind]['target'])
    print(data.iloc[ind]['excerpt'])
print('Hardest:')
for ind in min_inds:
    print('Score: ', data.iloc[ind]['target'])
    print(data.iloc[ind]['excerpt'])   
print('Medium:')
for ind in mid_inds:
    print('Score: ', data.iloc[ind]['target'])
    print(data.iloc[ind]['excerpt'])


With a helpful recource - https://readabilityformulas.com/freetests/six-readability-formulas.php 
we can check how various readability metrics evaluate the easiest text according to out traning data target:

Readabilty metrics for the text with the highest target score:

Flesch Reading Ease score: 70.7 (text scale)
Flesch Reading Ease scored your text: fairly easy to read.

Gunning Fog: 8.6 (text scale)
Gunning Fog scored your text: fairly easy to read.

Flesch-Kincaid Grade Level: 6.2
Grade level: Sixth Grade.

The Coleman-Liau Index: 12
Grade level: Twelfth Grade

The SMOG Index: 7.1
Grade level: Seventh Grade

Automated Readability Index: 7.9
Grade level: 12-14 yrs. old (Seventh and Eighth graders)

Linsear Write Formula : 6
Grade level: Sixth Grade.

Surprisingly, the highest target score is not the easiest according to this metric:
The text with target score 1.7113 is judged as 'fairly easy to read', while the one with score 1.583 is 'very easy to read'.

Intuitively, I would agree, as the 1.7113 text includes a difficult word "Paleontologists", while the 1.583 one has nothing of the sort.


# Feature Engineering

First, we will find parameters that are often used in the readability metrics:

1. ACW - average characters per word = Char / Word (used as a measurement of word difficulty)

In [None]:
data['chars'] = data['excerpt'].apply(lambda x: len(x))
data['words'] = data['excerpt'].apply(lambda x: len(x.split()))
data['ACW'] = data['chars'] / data['words']
data.head()

2. ASL - average sentence lenght = Word / Sent (used as a measurement of syntactic complexity)

For that we toxenize text with spacy to break text in sentences

In [None]:
# Returns number of sentences in the text
def sentence_count(text):
    doc = nlp(text)
    sentence_tokens = [sent for sent in doc.sents]
    return len(sentence_tokens)

data['sent'] = data['excerpt'].apply(sentence_count)
data['ASL'] = data['words'] / data['sent']
data.head()

3. ASW - average number of syllables per word = Syl / Word (used as a measurement of morphological complexity)

We use textstat (already imported) - a library to calculate statistics from text 

In [None]:
def syllables_count(word):
    return textstatistics().syllable_count(word)

data['syl'] = data['excerpt'].apply(syllables_count)
data['ASW'] = data['syl'] / data['words']
data.head()

4. PHW - persentage of hard words (words of three or more syllables, excluding affixes, proper nouns, compound words) 

We use Stemmer from nltk (spacy does not have stemmer, only lemmatizer) and part of speach tagger

However, compound words were not removed

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english')

from nltk.tag import pos_tag

In [None]:
def poly_syllable_count(text):
    count = 0
    words = []
    
    text = text.lower()
    
    #remove proper nouns 
    sentence = nltk.word_tokenize(text)
    sent = pos_tag(sentence)
    new_text = [s[0] for s in sent if s[1] != 'NNP' and s[1] != 'NNPS']
    
    #remove affixes
    for word in new_text:
        words.append(stemmer.stem(word))

        
    # Count sillables in words

    for word in words:
        syllable_count = syllables_count(word)
        if syllable_count >= 3:
            #print('found!')
            count += 1
    return count

data['poly_syl'] = data['excerpt'].apply(poly_syllable_count)
data['PHW'] = data['poly_syl'] / data['words']
data.head()

Some other measurement of syntactic complexity were decided to use.
The following measurements are taken from  https://www.semanticscholar.org/paper/Classification-into-Readability-Levels-%3A-and-Larsson/04ebb3c027bc3ba74ae1c4fac1047010ed0037a8

NQ and N/NP measurements show how much information there are in a text.

Nominal quotient (NQ) counts the number of nouns, prepositions and participle divided by the number of pronouns, adverbs and verbs, calculated as a measurement per document.  The normal NQ value is 1.0 

Noun to Pronoun quotient (n/pn) is the number of nouns divided by the number of pronouns in the text, calculated as a measurement per document. Nouns are a part of speech with high information value and pronouns often repeat previous information.

In [None]:
#retrieve each part of speech count 
from collections import defaultdict
from operator import itemgetter

def pos_count(text, mode='nq'): 
    sentence = nltk.word_tokenize(text)
    sent = pos_tag(sentence)
    counts = defaultdict(int)
    for (word, tag) in sent:
        counts[tag] += 1

    #sorted(counts.items(), key=itemgetter(1), reverse=True)
    
    noun = counts['NNP']+counts['NNS']+counts['NN']+counts['NNPS']
    pronoun = counts['PRP']+counts['WP']+counts['PRP$']+counts['WP$']
    preposition = counts ['IN'] 
    participle = counts ['VBG '] 
    adverb = counts['RB']+counts['RBR']+counts['RBS']
    verb = counts['VB']+counts['VBD']+counts['VBP']+counts['VBZ']
    
    if mode == 'nq':
        # nominal quotient
        nq = (noun+preposition+participle) / (pronoun+adverb+verb)
        return nq
    
    # noun to pronoun quotient
    noun_pron = noun / (pronoun + 0.001)
    
    return noun_pron

In [None]:
data['nq'] = data['excerpt'].apply(pos_count)
data['n/pn'] = data['excerpt'].apply(pos_count, mode='blalba')
data.head()

The number of definite articles (def_art) provides a measurement of how abstract the text is. Abstract texts have less definite nouns and articles. Counted per text

In [None]:
def def_art(text):
    definite = 0
    text = text.lower()
    for word in text.split():
        if word == 'the' in text:
            definite  = definite + 1
    return definite 

data['def_art'] = data['excerpt'].apply(def_art) / data['words']
data.head()

The average number of conjunctions, adn other tags indentifying prepositional phrases and subbordinate clauses are counted as supplementary measurement of syntactical complexity.

Sentences with prepositional phrases and subordinated clauses rise text ambiguity. 

Counted with spacy

In [None]:
# subbordinate conjunctions
def sconj(text):
    sconj = 0
    doc = nlp(text)
    for token in doc:
        if token.pos_ == "SCONJ":
            sconj += 1
    return sconj

data['sconj'] = data['excerpt'].apply(sconj) / data['words']

#prepositional modifier
def prep(text):
    prep = 0
    doc = nlp(text)
    for token in doc:
        if token.head.dep_ == "prep":
            prep += 1
    return prep

data['prep'] = data['excerpt'].apply(prep) / data['sent']


# adverbial clause modifier 
def advcl(text):
    advcl = 0
    doc = nlp(text)
    for token in doc:
        if token.head.dep_ == "advcl":
            advcl += 1
    return advcl

data['advcl'] = data['excerpt'].apply(advcl) / data['sent']


# conjunctions
def conj(text):
    conj = 0
    doc = nlp(text)
    for token in doc:
        if token.head.dep_ == "conj":
            conj += 1
    return conj

data['conj'] = data['excerpt'].apply(conj) / data['sent']


# clausal complement
def ccomp(text):
    ccomp = 0
    doc = nlp(text)
    for token in doc:
        if token.head.dep_ == "ccomp":
            ccomp += 1
    return ccomp
data['ccomp'] = data['excerpt'].apply(ccomp) / data['sent']


# relative clause modifier (whose, who)
def relcl(text):
    relcl = 0
    doc = nlp(text)
    for token in doc:
        if token.head.dep_ == "relcl":
            relcl += 1
    return relcl

data['relcl'] = data['excerpt'].apply(ccomp) / data['sent']
data.head()

Syntactic depth counts the maximum depth of every sentence.

Complex sentences are less readable and rise text ambiguity.

In [None]:
# sentence depth

def tree_height(root):
    
    if not list(root.children):
        return 1
    else:
        return 1 + max(tree_height(x) for x in root.children)
    
def get_average_heights(text):
    
    doc = nlp(text)
    roots = [sent.root for sent in doc.sents]
    #print(text)
    #print(roots)
    #print([tree_height(root) for root in roots])
    return np.mean([tree_height(root) for root in roots])

data['sen_depth'] = data['excerpt'].apply(get_average_heights)
data.head()

Now that we constructed all features commonly related to readibility, let's check how they correlate with the target readability score. 

In [None]:
names = ['target', 'standard_error', 'words', 'ACW', 'sent', 'ASL', 'syl', 'ASW', 'poly_syl','PHW','nq','n/pn',
        'def_art', 'sconj','prep', 'advcl','conj', 'ccomp', 'relcl', 'sen_depth' ]
        
correlations = data[names].corr()
# plot correlation matrix
fig = pyplot.figure(figsize=(20,10))
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)

ticks = np.arange(0,20,1) 
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names) 
ax.set_yticklabels(names)

pyplot.show()

In [None]:
# select relevant features and construct a dataset that consist only of engineered features
selected_names = ['syl','ACW','ASL','ACW','PHW','nq','n/pn','def_art','sconj', 'prep','advcl','conj','ccomp','relcl','sen_depth']

y = data['target']
X = data[selected_names]
X.head()

In [None]:
# define score metrics
rmse = lambda y_true, y_pred: np.sqrt(mse(y_true, y_pred))
rmse_loss = lambda Estimator, X, y: rmse(y, Estimator.predict(X))

In [None]:
# check performance of a simple Linear Regression on engineered features
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import mean_squared_error as mse

model = LinearRegression()

val_score = cross_val_score(
    model, 
    X, 
    y, 
    scoring=rmse_loss
).mean()

print(f'Train Score for Linear Regression: {val_score}')

In [None]:
# normalize features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
X_norm = scaler.transform(X)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Compare different algorithms
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, Ridge, BayesianRidge
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

import matplotlib.pyplot as plt

# prepare models
models = []
models.append(('LR', LinearRegression())) 
models.append(('Ridge', Ridge()))
models.append(('NB', BayesianRidge()))
models.append(('SVM', SVR(C=0.5)))


# evaluate each model in turn
results = []
names = []
scoring = rmse_loss
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7, shuffle=True)
    label_results = []
    cv_results = cross_val_score(model, Xtrain, ytrain, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) 
    print(msg)

# boxplot algorithm comparison
fig = plt.figure() 
fig.suptitle('Algorithm Comparison') 
ax = fig.add_subplot(111) 
plt.boxplot(results) 
ax.set_xticklabels(names) 
ax.set_ylabel('ROC-AUC (weighted)') 
plt.show()

We can see that using only the features we constracted without the text yields reasonable regression performance. Now we will vectorize the text and add extract some features to expand our training data.

In [None]:
# vectorization
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

vectorizer = TfidfVectorizer()
X_vect = vectorizer.fit_transform(data['excerpt'])
X_vect.shape

In [None]:
# use dimensionality reduction 
from sklearn.decomposition import TruncatedSVD

pca = TruncatedSVD(n_components=1000)
fit = pca.fit(X_vect)

# summarize components
plt.plot(fit.explained_variance_ratio_) 
#plt.yscale('log')

In [None]:
pca = TruncatedSVD(n_components=300)
fit = pca.fit(X_vect)
X_vect_r = pca.transform(X_vect)

In [None]:
# combine vectorized text data with feture engineered data
X_vect_feat = np.hstack([X_vect_r ,X_norm])
Xtrain, Xtest, ytrain, ytest = train_test_split(X_vect_feat, y, test_size=0.2, random_state=0)

Finally, we will train a neural network model on the resulting dataset.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler, ReduceLROnPlateau

n_features = Xtrain.shape[1]
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(n_features)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(32, activation='relu'), 
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(1)
])


# Default learning rate for the Adam optimizer is 0.001
# Let's slow down the learning rate by 10.
learning_rate = 0.0001
model.compile(loss='mean_squared_error',optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=[RootMeanSquaredError()])
model.summary()


In [None]:
learning_rate_reduction = ReduceLROnPlateau(monitor='val_root_mean_squared_error', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)


early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=5, # how many epochs to wait before stopping
    restore_best_weights=True,
)

In [None]:
num_epochs = 50
history = model.fit(Xtrain, ytrain, epochs=num_epochs,
                    callbacks=[early_stopping,learning_rate_reduction],
                    validation_split=0.1)

In [None]:
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()


In [None]:
plot_graphs(history, "root_mean_squared_error")
plot_graphs(history, "loss")