In this notebook, I will be making observations regarding the data provided in the [CommonLit Readability Prize](https://www.kaggle.com/c/commonlitreadabilityprize/overview).

I have made some submissions in the competitions.
- [Decision Tree with score 0.941](https://www.kaggle.com/aniketsharma00411/commonlit-readability-decision-tree)
- [Random Forest with score 0.780](https://www.kaggle.com/aniketsharma00411/commonlit-readability-random-forest)

I will be using the observations I make in this notebook to improve these models and create further better models using other Regression techniques.

As in the other notebooks, I will be using the [readability](https://pypi.org/project/readability/) Python package to create features from excerpts.

# Initialization

I am using the [readability](https://pypi.org/project/readability/) and [syntok](https://pypi.org/project/syntok/) to evaluate readability of each excerpt and [textblob](https://pypi.org/project/textblob/) for sentiment analysis.

In [None]:
! pip install -q /kaggle/input/readability/readability-0.3.1-py3-none-any.whl
! pip install -q /kaggle/input/syntok/syntok-1.3.1-py3-none-any.whl
from textblob import TextBlob
import readability
import syntok.segmenter as segmenter

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train_data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test_data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')

In [None]:
train_data.info()
train_data.head()

In [None]:
test_data.info()
test_data.head()

In [None]:
pd.read_csv('/kaggle/input/commonlitreadabilityprize/sample_submission.csv')

# Functions

In [None]:
def sentiment_analysis(text):
    return TextBlob(text).sentiment.polarity

In [None]:
def tokenize(text):
    """Tokenizing and creating excerpts in the format suggested in the README of readability project."""
    return '\n\n'.join(
        '\n'.join(
            ' '.join(token.value for token in sentence)
            for sentence in paragraph)
        for paragraph in segmenter.analyze(text))

# Creating Features

In [None]:
train_data.loc[:,'readability_object'] = train_data.apply(lambda row: readability.getmeasures(tokenize(row.excerpt), lang='en'), axis=1)

In [None]:
train_data.info()
train_data.head()

In [None]:
train_data.loc[0, 'readability_object']['readability grades'].keys()

The readability module provides 9 readability grades. By definition none of these are better than other.

So, to decide which one to use we will calculate a correlation matrix of these with our target value and then take the grade with the higest correlation value.

In [None]:
readability_grades = pd.DataFrame(train_data['id'])
readability_grades.loc[:, 'Kincaid'] = train_data.apply(lambda row: row.readability_object['readability grades']['Kincaid'], axis=1)
readability_grades.loc[:, 'ARI'] = train_data.apply(lambda row: row.readability_object['readability grades']['ARI'], axis=1)
readability_grades.loc[:, 'Coleman-Liau'] = train_data.apply(lambda row: row.readability_object['readability grades']['Coleman-Liau'], axis=1)
readability_grades.loc[:, 'FleschReadingEase'] = train_data.apply(lambda row: row.readability_object['readability grades']['FleschReadingEase'], axis=1)
readability_grades.loc[:, 'GunningFogIndex'] = train_data.apply(lambda row: row.readability_object['readability grades']['GunningFogIndex'], axis=1)
readability_grades.loc[:, 'LIX'] = train_data.apply(lambda row: row.readability_object['readability grades']['LIX'], axis=1)
readability_grades.loc[:, 'SMOGIndex'] = train_data.apply(lambda row: row.readability_object['readability grades']['SMOGIndex'], axis=1)
readability_grades.loc[:, 'RIX'] = train_data.apply(lambda row: row.readability_object['readability grades']['RIX'], axis=1)
readability_grades.loc[:, 'DaleChallIndex'] = train_data.apply(lambda row: row.readability_object['readability grades']['DaleChallIndex'], axis=1)
readability_grades.loc[:, 'target'] = train_data['target']

In [None]:
readability_grades.info()
readability_grades.head()

In [None]:
read_corr = readability_grades.corr()
read_corr.info()
read_corr

In [None]:
fig, ax =plt.subplots(figsize=(8, 6))
plt.title("Correlation Plot")
sns.heatmap(read_corr,
            mask=np.zeros_like(read_corr, dtype=np.bool),
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)
plt.show()

As expected, the readability grades are highly correlated with each other. So, using some or all of them will increase redundancy. Therefore, we will chose a single best readability grade, the one which is most correlated with target variable.

In [None]:
read_corr.target

**SMOGIndex** is the most correlated with the target variable. So, it is best to use it for training.

Now, we will take all the features readabiltiy module can give us and then remove them by calculating their correlation with each other and with the target variable.

In [None]:
X = pd.DataFrame(train_data['id'])
X.loc[:,'readability'] = train_data.apply(lambda row: row.readability_object['readability grades']['SMOGIndex'], axis=1)
X.loc[:,'sentiment'] = train_data.apply(lambda row: sentiment_analysis(row.excerpt), axis=1)
X.loc[:,'characters_per_word'] = train_data.apply(lambda row: row.readability_object['sentence info']['characters_per_word'], axis=1)
X.loc[:,'syll_per_word'] = train_data.apply(lambda row: row.readability_object['sentence info']['syll_per_word'], axis=1)
X.loc[:,'words_per_sentence'] = train_data.apply(lambda row: row.readability_object['sentence info']['words_per_sentence'], axis=1)
X.loc[:,'sentences_per_paragraph'] = train_data.apply(lambda row: row.readability_object['sentence info']['sentences_per_paragraph'], axis=1)
X.loc[:,'type_token_ratio'] = train_data.apply(lambda row: row.readability_object['sentence info']['type_token_ratio'], axis=1)
X.loc[:,'characters'] = train_data.apply(lambda row: row.readability_object['sentence info']['characters'], axis=1)
X.loc[:,'syllables'] = train_data.apply(lambda row: row.readability_object['sentence info']['syllables'], axis=1)
X.loc[:,'words'] = train_data.apply(lambda row: row.readability_object['sentence info']['words'], axis=1)
X.loc[:,'wordtypes'] = train_data.apply(lambda row: row.readability_object['sentence info']['wordtypes'], axis=1)
X.loc[:,'sentences'] = train_data.apply(lambda row: row.readability_object['sentence info']['sentences'], axis=1)
X.loc[:,'long_words'] = train_data.apply(lambda row: row.readability_object['sentence info']['long_words'], axis=1)
X.loc[:,'complex_words'] = train_data.apply(lambda row: row.readability_object['sentence info']['complex_words'], axis=1)
X.loc[:,'complex_words_dc'] = train_data.apply(lambda row: row.readability_object['sentence info']['complex_words_dc'], axis=1)
X.loc[:,'tobeverb'] = train_data.apply(lambda row: row.readability_object['word usage']['tobeverb'], axis=1)
X.loc[:,'auxverb'] = train_data.apply(lambda row: row.readability_object['word usage']['auxverb'], axis=1)
X.loc[:,'conjunction'] = train_data.apply(lambda row: row.readability_object['word usage']['conjunction'], axis=1)
X.loc[:,'pronoun'] = train_data.apply(lambda row: row.readability_object['word usage']['pronoun'], axis=1)
X.loc[:,'preposition'] = train_data.apply(lambda row: row.readability_object['word usage']['preposition'], axis=1)
X.loc[:,'nominalization'] = train_data.apply(lambda row: row.readability_object['word usage']['nominalization'], axis=1)

In [None]:
X.info()
X.head()

In [None]:
corr = X.corr()
corr.info()
corr

In [None]:
fig, ax =plt.subplots(figsize=(8, 6))
plt.title("Correlation Plot")
sns.heatmap(corr,
            mask=np.zeros_like(corr, dtype=np.bool),
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)
plt.show()

We can see that following groups of features has high correlation
 - characters_per_word and syll_per_word
 - characters and syllables
 - long_words, complex_words and complex_words_dc
 - sentences and sentences_per_paragraph
 
So, for all these three groups, we can only take one features.

In [None]:
tar_corr = pd.merge(X, train_data['target'], left_index=True, right_index=True).corr().loc['target']
tar_corr

Correlations of features with target are **not very high** but still reasonable.

We can still remove features with very less correlation values like sentiment.