# CommonLit Readability Prize EDA
<img src="https://library.ucf.edu/wp-content/uploads/sites/5/2016/03/sort-by-color.jpg" alt="title">

## Introduction
The goal of this competition is to build a model that is able to estimate the readability score of short text excerpts. Traditionally, this can be achieved by applying various readability formulas that depend on linguistic features (e.g. average number of syllables per word, averane sentence length, etc.). Often these features are indeed highly correlated with the reading difficulty, however general measure of the readability score is more complex, as it involves level of abstraction, the use of images and difficult concepts, active and passive voice and so on. Therefore, a data-driven approach together with recent developments in the field of Natural Language Processing can be the next step to improve the current estimations.

The good models will significantly help teachers, administrators, students, and literacy curriculum developers. 

# Importing packages and loading data

In [None]:
!pip install syllables
!pip install rich

import numpy as np # linear algebra
import pandas as pd # data processing
pd.set_option("display.max_colwidth", None)  # setting the maximum width in characters when displaying pandas column. "None" value means unlimited.

import os      
import syllables
import matplotlib.pyplot as plt
import seaborn as sns
import random
from wordcloud import WordCloud
from rich import theme, console

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_data = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
test_data = pd.read_csv('../input/commonlitreadabilityprize/test.csv')

# Basic Exploratory Data Analysis

In [None]:
print(f"Total entries in train.csv: {len(train_data)}")
print(f"Total entries in test.csv: {len(test_data)}")

In [None]:
train_data.head(3)

In [None]:
test_data.head(3)

In [None]:
train_data.info()

In [None]:
test_data.info()

For having clearer and styled output in the terminal, we can configure the print output by using *console* and *theme* classes from the Python **rich** library (check out [the official link](https://github.com/willmcgugan/rich)).

In [None]:
custom_theme = theme.Theme({
    "info" : "cyan",
    "warning": "magenta",
    "danger": "bold red"
})

console = console.Console(theme=custom_theme)

Now, we can print out a random text passage together with the corresponding target and the standard error.

In [None]:
i = random.randrange(train_data.shape[0])   # selecting a random row from the dataframe

console.print(train_data.iloc[i].excerpt+'\n',style='warning')
console.print(f"Target (standard error) --> {train_data.iloc[i].target} ({train_data.iloc[i].standard_error})", style = 'info')

# Advanced exploratory data analysis
In this section, we are going to display some graphs and figures using the raw data in *train.csv*.

First, we can visualize the distribution of target values and standard errors using *seaborn* and *matplotlib*.

In [None]:
train_target = train_data['target'].values
train_se = train_data['standard_error'].values

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,6))

# hardcoding the appearance of the plots

sns.kdeplot(train_target, ax=ax1, label='Train', lw=5, alpha=0.6)
ax1.axvline(train_target.mean(),linestyle='--', linewidth=2)
ax1.annotate('mean', fontsize=18, xy=(train_target.mean(), 0.2), 
            xytext=(0.8, 0.9), textcoords='axes fraction',
            arrowprops=dict(facecolor='black', shrink=0.02, connectionstyle="arc3,rad=-0.3",),
             bbox=dict(boxstyle="square", fc="w", ec="k")
            )
ax1.tick_params(axis='both', which='major', labelsize=18)
ax1.set_xlabel('target', fontsize=18)
ax1.set_ylabel('density', fontsize=18)


sns.kdeplot(train_se, ax=ax2, label='Train', lw=5, alpha=0.6)
ax2.axvline(train_se.mean(),linestyle='--', linewidth=2)
ax2.annotate('mean', fontsize=18, xy=(train_se.mean(), 6),  
            xytext=(0.4, 0.9), textcoords='axes fraction',
            arrowprops=dict(facecolor='black', shrink=0.02, connectionstyle="arc3,rad=0.3",),
             bbox=dict(boxstyle="square", fc="w", ec="k")
            )
ax2.tick_params(axis='both', which='major', labelsize=18)
ax2.set_xlabel('standard_error', fontsize=18)
ax2.set_ylabel('density', fontsize=18)


plt.suptitle('Distribution of targets and standard errors in train.csv', fontsize = 24)
plt.show()

Then, we render a scatter plot to reveal the relationship between targets and standard errors.

In [None]:
sns.set_theme(style="whitegrid")

fig, ax = plt.subplots(figsize=(8,6))
sns.scatterplot(y=train_target, x=train_se, color = 'blue', alpha = 0.3, s=50)
ax.tick_params(axis='both', which='major', labelsize=18)
ax.set_xlabel('standard error', fontsize=18)
ax.set_ylabel('target', fontsize=18)
plt.show()

It looks like one entry has weird numeric values. Let's inspect the raw dataframe for confirmation.

In [None]:
train_data.sort_values(by='standard_error').head(3)

Indeed, the excerpt with id=*436ce79fe* has both target and standard error equal to zero. However, as mentioned in [this discussion](http://https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236403), this sample is the baseline for all other samples.

We can also combine results from two previous figures into one using *jointplot* from the **seaborn** package.

In [None]:
sns.jointplot(y=train_data['target'], x=train_data['standard_error'], 
              kind='hex', 
              edgecolor='tab:blue', 
              marginal_kws=dict(bins=25), 
              height=8)
plt.show()

Finally, let's display the wordclouds. The WordCloud is a technique for showing which words are the most frequent in a given text. 

In [None]:
console.print(f"Word cloud generated from the whole dataset:", style = "info")

plt.figure(figsize=(10,10))
wordcloud = WordCloud( background_color='white',
                        width=600,
                        height=500).generate(" ".join(train_data['excerpt']))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
i = random.randrange(train_data.shape[0])

console.print(f"Word cloud generated from the {i}-th excerpt:", style = "info")

plt.figure(figsize=(10,10))
wordcloud = WordCloud( background_color='white',
                        width=600,
                        height=500).generate(train_data.iloc[i].excerpt)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# Feature Engineering
In the final section, we will construct some extra linguistic features on top of the raw data:
- letter_count - the total number of letters
- word_count - the total number of words 
- sentence_count - the total number of sentences
- syllable_count - the total number of syllables (for this, we can use the Python **syllables** library, see [this reference](https://github.com/prosegrinder/python-syllables))
- avg_word_len - the average length of words
- avg_sentence_len - the average length of sentences
- avg_word_syll - the average number of syllables per word
- avg_sen_syll - the average number of syllables per sentence

In [None]:
train_data['letter_count'] = train_data['excerpt'].str.len()
train_data['word_count'] = train_data['excerpt'].str.split().apply(len)
train_data['sentence_count'] = train_data['excerpt'].str.split(pat='[.!?]+').str.len()-1
train_data['syllable_count'] = train_data['excerpt'].apply(lambda x: syllables.estimate(x.lower()))

In [None]:
train_data['avg_word_len'] = train_data['letter_count']/train_data['word_count']
train_data['avg_sentence_len'] = train_data['word_count']/train_data['sentence_count']
train_data['avg_word_syll'] = train_data['syllable_count']/train_data['word_count']
train_data['avg_sen_syll'] = train_data['syllable_count']/train_data['sentence_count']

This time, we print out a random text passage together with the corresponding target, the standard error, and some newly generated linguistic features.

In [None]:
i = random.randrange(train_data.shape[0])

console.print(train_data.iloc[i].excerpt+'\n',style='warning')
console.print(f"Target (standard error) --> {train_data.iloc[i].target} ({train_data.iloc[i].standard_error})", style = 'info')
console.print(f"Total sentences: {train_data.iloc[i].sentence_count}", style = 'info')
console.print(f"Total words: {train_data.iloc[i].word_count}", style = 'info')
console.print(f"Total syllables: {train_data.iloc[i].syllable_count}", style = 'info')
console.print(f"Total letters: {train_data.iloc[i].letter_count}", style = 'info')

To be able to see the relationship between numeric data in the training dataset, we display the correlation matrix.

In [None]:
corr = train_data.corr()

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True  # so that we do not duplicate values in the correlation matrix

fig, ax = plt.subplots(figsize=(8,8))
ax = sns.heatmap(corr, ax=ax, annot=True, mask=mask, square=True, fmt = '.2f', annot_kws={"fontsize":15}, cbar=False)
ax.set_xticklabels(ax.get_xmajorticklabels(), fontsize = 18)
ax.set_yticklabels(ax.get_ymajorticklabels(), fontsize = 18)
plt.title('Correlation matrix', fontsize=20)
plt.show()

For example, as one can see, there is a strong correlation between the target and the average number of syllables per word (which is not surprise). So, let's plot the scatter plot for these two columns.

In [None]:
sns.jointplot(y=train_data['target'], x=train_data['avg_word_syll'], 
              kind='hex', 
              edgecolor='tab:blue', 
              marginal_kws=dict(bins=25), 
              height=8)
plt.show()

It is also possible to display all correlations and distributions using *pairplot* from the Python **seaborn** library.

In [None]:
sns.pairplot(train_data, diag_kind='kde')

To conclude, we have performed exploratory data analysis on our training dataset. This helps us to understand what kind of input we will be working with when building a readability model.