# Exploratory Data Analysis 
#### [CommonLit Readability Prize competition](https://www.kaggle.com/c/commonlitreadabilityprize)

---

# <a id=contents>Table of Contents</a>

---

#### [About](#about)
- [The Problem Statement](#statement)
- [Evaluation Metric](#metric)
- [Data Description](#data)

#### [Overview of the Data](#overview)
- [Training Set](#train)
- [Test Set](#test)
- [Target and Standard Error](#target)
- [Text Type and Complexity Level](#text_level)
- [Domain Names](#domain)
- [Licenses](#licenses)
- [Excerpts](#excerpts)

#### [Further Exploration](#further)
- [Readability, Complexity, and Grade Level](#readability)
- [Text Preprocessing Pipeline](#prep)
- [Word Frequency Diagram](#word_freq)
- [Word Clouds](#word_cloud)
- [Ngrams Analysis](#ngrams)
- [Corpus Statistics](#corpus_stats)
- [Correlation Matrix](#corr)

#### [Simple Baseline](#baseline)
- [Feature Scaling and Transformation](#trans)
- [Sentence Embedding](#embeddings)
- [Performance with Extended Features](#improved)

#### [Final Thoughts](#final)

#### [References](#ref)

# <a id=about>About</a>
    
---

This is an Exploratory Data Analysis (EDA) for the [CommonLit Readability Prize](https://www.kaggle.com/c/commonlitreadabilityprize), which I hope will provide you with meaningful insights and help you create better machine learning solutions.

## <a id=statement>The Problem Statement</a>
    
---

The aim of this Natural Language Processing (NLP) challenge is to improve readability rating methods in order to make the process of learning English more effective and engaging. This is a code competition, and our specific goal is to build an algorithm that can predict the complexity level of a text while meeting the following requirements:
    
- CPU Notebook not more than 3 hours run-time
- GPU Notebook not more than hours run-time
- Internet access disabled
- Freely and publicly available external data is allowed, including pre-trained models
- Submission file must be named `submission.csv`

This is a regression problem in which we need to extract features from the text excerpts and predict the continuous target values. 

## <a id=metric>Evaluation Metric</a>
    
---    

The evaluation metric is the root mean squared error (RMSE) which defined as:
    
$$RMSE = \sqrt{\frac{1}{n}{\sum_{i=1}^{n}(y_i-\hat{y_i})^2}}$$
    
where $y_i$ is the predicted value, $\hat{y_i}$ is the original value, and $n$ is the number of rows in the test data.
    
In practice, we can write our own function or just use the `mean_squared_error` from the [scikit-learn library](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html), setting the `squared` parameter equal **False** (if True returns MSE value, if False returns RMSE value):
    
```python
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(y_true, y_pred, squared=False)
```
    
## <a id=data>Data Description</a>
    
--- 
    
#### Files

- **train.csv** - the training
- **test.csv** - the test set 
- **sample_submission.csv** - a sample submission file in the correct format

#### Columns

- `id` - unique ID for excerpt
- `url_legal` - URL of source - this is blank in the test set.
- `license` - license of source material - this is blank in the test set.
- `excerpt` - text to predict reading ease of
- `target` - reading ease
- `standard_error` - measure of spread of scores among multiple raters for each excerpt. Not included for test data.
    
As was mentioned in [description](https://www.kaggle.com/c/commonlitreadabilityprize/data), the texts taken from various domains, there are passages from several time periods and a wide range of reading ease scores.

#### Additional Information
    
From one of the competition hosts, [Scott Crossley](https://www.kaggle.com/cookiecutters), we also know that:
    
>The target value is the result of a Bradley-Terry analysis of more than 111,000 pairwise comparisons between excerpts. Teachers spanning grades 3-12 (a majority teaching between grades 6-10) served as the raters for these comparisons.

>Standard error is included as an output of the Bradley-Terry analysis because individual raters saw only a fraction of the excerpts, while every excerpt was seen by numerous raters. The test and train sets were split after the target scores and s.e. were computed.

As Scott Crossley said in the [same discussion](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423), teachers grading passages were only asked one question: **Which text is easier for the students to understand?** 
    
More about the [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) analysis and pairwise comparison you can find in [this notebook](https://www.kaggle.com/shaz13/code-how-bradley-terry-model-works/).

**[Back to Table of Contents](#contents)**

So let’s start!

# <a id=overview>Overview of the Data</a>

---



In [None]:
%%capture
!pip install contractions
!pip install textstat

In [None]:
import numpy as np
import pandas as pd
import regex as re
import os

import contractions
import urllib.parse

from tabulate import tabulate
from collections import defaultdict, Counter
from termcolor import colored, cprint

import nltk
from nltk.util import ngrams
from wordcloud import WordCloud
import textstat

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

np.random.seed(0)

In [None]:
# Set font size
plt.rcParams.update({'font.size': 12})

# Set seaborn plot axis style
sns.set_style('ticks')

# Set color pallete
PAL = sns.color_palette('BrBG', 9)
sns.palplot(PAL)

## <a id="train">Training Set</a>

---

In [None]:
train_df = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
train_df.head()

In [None]:
train_df.info()

In [None]:
# A summary for numerical columns
train_df.describe().T

In [None]:
# A summary for nonnumerical columns
train_df.select_dtypes(include=[object]).describe().T

In [None]:
def tab_data(df):
    headers = ['Column', 'Null Count', 'Unique Count']
    meta_list = []
    cols = [i for i in df.columns]
    for col in cols:
        temp = []
        temp.append(col)
        temp.append(df[col].isna().sum())
        temp.append(df[col].nunique())
        meta_list.append(temp)
    print(tabulate(meta_list, headers, tablefmt='rst'))

print('Train Set: Missing and Unique Values Summary')
tab_data(train_df)

In [None]:
# Check if the url_legal and license columns share the same rows with missing values
url_no_license = len(train_df.loc[train_df['url_legal'].notna() & train_df['license'].isna()])
print(f'Number of URLs without a license: {url_no_license}')

In [None]:
# Check the excerpts refer to the same URL
url_count = train_df.groupby(['url_legal']).agg({'url_legal': 'count'})
url_count.columns = ['url_count']
url_count.loc[url_count.url_count > 1].drop_duplicates(subset='url_count')

In [None]:
# Check the rows with the same values in target and standard_error column
train_df.loc[train_df['target'] == train_df['standard_error']]

## Training Set Overview Summary

The training set represented by 2834 records in the 6 columns shown in the table above. 

### Missing Values 

Two columns `url_legal` and `license` contain 2004 missing values represented in the same rows. All 830 URLs are provided with licenses.

Next, I want to quote a fragment from the data description mentioned the nature of the text:

> Note that the test set includes a slightly larger proportion of **modern** texts (the type of texts we want to generalize to) than the training set.

We have not explanation and can only guess what is meant by **modern** texts. But let's step back and try to think what the possible reason of the missing values in the dataset? 

The first thing that comes to mind is that the missing values can be explained by another way of collecting/processing data. For example, it could be excerpts from classical literature, school textbooks, or other teaching materials that were manually collected and included in the dataset. Thus, such passages are, so to speak, **"classic"**, and the texts scraped from the web sites are **"modern"**. 

On the other hand we have a warning:

> Also note that while licensing information is provided for the public test set (because the associated excerpts are available for display / use), **the hidden private test set includes only blank license / legal information.**

So as far as I understand, we have neither licenses nor URLs in the hidden test set. That means that the information in the `url_legal` and `license` columns are completely meaningless. 

For the sake of completeness, we'll look at these columns in the EDA and even go a little deeper, but our assumption about the nature of **"modern"** and **"classic"** texts does not seem promising in this case. At least, simple filtering by source will not be enough to sort the texts.

### Unique Values

Among the 2834 rows of the training dataset we have 667 unique URLs, supported by 15 types of digital licenses. The rest values in the columns are absolutely unique including the `target` and `standard_error` columns represented by continues numerical values. In light of the fact that we are already know about how the information was collected and ranked it seems to be sane. 

667 unique URLs out of 830 URLs in total means that some excerpts have the same URL and their targets should be very close to each other. But after testing this assumption, we found that only two excerpts (`195bb7384`, `7a1723dd0`) share to the same web page. Another 164 texts are refer to https://www.africanstorybook.org/ and can be explained by how the API of this site works (all books and pages within them are linked to the same URL)

Finally, in the training dataset, we have one row in which the target and standard error values are the same and equal to zero. As [was mentioned by one of the competition hosts](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236403), it’s nothing but the baseline for all other comparisons. 

**[Back to Table of Contents](#contents)**

## <a id="test">Test Dataset</a>

---

In [None]:
test_df = pd.read_csv('../input/commonlitreadabilityprize/test.csv')
test_df.head()

In [None]:
test_df.info()

In [None]:
print('Test Set: Missing and Unique Values Summary')
tab_data(test_df)

## Test Dataset Summary

---

The public test set has only 7 entries with 4 columns (`id`, `url_legal`, `license`, `excerpt`), but this shouldn't mislead you, the size of the hidden test set, as [was mentioned by the competition host](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236335), is around 2000.

**[Back to Table of Contents](#contents)**

## <a id="target">Target and Standard Error</a>

---


In [None]:
def plot_error_vs_target(df, x, y, ax):
    x_label = x.replace('_', ' ').title()
    y_label = y.replace('_', ' ').title()
    sns.scatterplot(
        df[x],
        df[y],
        color=PAL[8],
        alpha=.7,
        ax=ax
    )
    sns.kdeplot(
        df[x],
        df[y],
        levels=4,
        color=PAL[0],
        linewidths=2,
        ax=ax
    )
    ax.set_title(
        f'{y_label} vs {x_label}', 
        fontsize=20, 
        fontweight='bold', 
        fontfamily='serif'
    )
    ax.set_xlabel(f'{x_label}')
    ax.set_ylabel(f'{y_label}')
    
    if y == 'standard_error':
        ax.set_ylim([0.4, None])
    if x == 'target':
        limit_x = [
            round(df[x].min(), 2),
            round(df[x].max(), 2)
        ]
        ax.set_xticks(list(ax.get_xticks()) + limit_x)
        ax.set_xlim([-4, 2])

def plot_grid(df, feats, n_rows=3, n_cols=2, clr=[0, 1, 7, 8]):
    
    fig = plt.figure(figsize=(n_cols*7, n_rows*7))
    fig.patch.set_facecolor(PAL[4])
    
    grid = plt.GridSpec(n_rows, n_cols)
    
    means = {feat: round(df[f'{feat}'].mean(), 2) for feat in feats}
    
    for col, (feat, mean) in enumerate(means.items()):
        for row in range(n_rows):
            ax = plt.subplot(grid[row, col])
            label = feat.replace('_', ' ')
            if col == 1:
                clr.sort(reverse=True)
            if row == 0:
                sns.histplot(
                    df[feat],
                    color=PAL[clr[3]],
                    alpha=0.8,
                    bins=20,
                    ax=ax,
                    label=label,
                    kde=True,
                )
                ax.axvline(
                    mean,
                    color=PAL[clr[0]],
                    linestyle='--',
                    linewidth=2,
                    label=f'mean: {mean}',
                )
                ax.legend(loc='upper left')
                ax.set_xlabel(label.title())
                ax.set_title(
                    f'{label.title()} Distribution',
                    fontsize=20,
                    fontweight='bold',
                    fontfamily='serif'
                )
            elif row == 1:
                sns.boxplot(
                    df[feat],
                    orient='h',
                    color=PAL[clr[2]],
                    ax=ax
                )
                ax.set_xlabel(label.title())
                ax.set_title(
                    f'Box Plot {label.title()}',
                    fontsize=20, 
                    fontweight='bold', 
                    fontfamily='serif'
                )
            elif row == 2:
                ax =  plt.subplot(grid[row, :])
                plot_error_vs_target(df, feats[0], feats[1], ax)
    
    plt.show()


In [None]:
plot_grid(train_df, ['target', 'standard_error'])

## Target and Standard Error Summary

---

Target distribution is very close to normal with the values ranged from -3.68 to 1.71, and mean value -0.96. Most of the values in the standard error distribution range from 0.43 to 0.57 with a mean of 0.49, and as we can clarify from the box plot, there are quite a few outliers.

Given that the number of pairwise comparisons is the same for all passages, the outliers in the standard error distribution can be explained by the distribution of the evaluators. As mentioned earlier, excerpts were assessed by teachers in grades 3–12. Most of them teach in grades 6-10, a minority work with students in grades 3-5 and 11-12, respectively. The difference in the scores between these two subgroups from the minority may cause outliers corresponding to the complexity of the text on its poles.

**[Back to Table of Contents](#contents)**

## <a id="text_level">Text Type and Complexity Level</a>

---

In [None]:
# Binning the target to three levels
target_bins = [i for i in range(-4, 3, 2)]
target_labels = ['complex', 'medium', 'simple']
train_df['level'] = pd.cut(
    train_df['target'], bins=target_bins, labels=target_labels
).astype('str')

# Set text types
train_df['text_type'] = 'classic'
train_df.loc[train_df['url_legal'].notna(), 'text_type'] = 'modern'

# Count level by text type and merge with train dataframe
level_count = train_df.groupby(['text_type', 'level']).agg({'level': 'count'})
level_count.columns = ['level_count']
train_df = train_df.merge(level_count, how='left', on=['text_type', 'level'])

# Create plot supporting dataframe
level_df = train_df[['level', 'text_type', 'level_count']].copy()
level_df = level_df.drop_duplicates().reset_index(drop=True)
level_df = level_df.pivot(index='text_type', columns='level', values='level_count')

# Get pecentage values
totals = level_df.sum(axis=1).tolist()
levels = [i for i in level_df]
for level in levels:
    level_df[f'{level}_%'] = (level_df[level] / totals * 100).astype('float16')

#level_df.head()

In [None]:
# Create figure
fig = plt.figure(figsize=(15, 15))
fig.patch.set_facecolor(PAL[4])

# Define grid and subplots
grid = plt.GridSpec(2, 2)
ax1 = plt.subplot(grid[0, 0])
ax2 = plt.subplot(grid[0, 1])
ax3 = plt.subplot(grid[1, :])

count_plot = level_df[['simple', 'medium', 'complex']].plot(
    kind='bar',
    stacked=True,
    color=[PAL[i] for i in [8, 1, 0]],
    alpha=0.8,
    rot=0,
    xlabel='Text Type',
    ylabel='Count',
    ax=ax1,
)

for container in count_plot.containers:
    count_plot.bar_label(
        container,
        label_type='center',
        fontsize=12,
    )

ax1.set_title(
    'Level Count vs Text Type',
    fontsize=20,
    fontweight='bold',
    fontfamily='serif',
)
ax1.legend(title='Text Level')

percent_plot = level_df[['simple_%', 'medium_%', 'complex_%']].plot(
    kind='bar',
    stacked=True,
    color=[PAL[i] for i in [8, 1, 0]],
    alpha=0.7,
    rot=0,
    xlabel='Text Type',
    ylabel='Percent',
    ax=ax2,
)

for container in percent_plot.containers:
    percent_plot.bar_label(
        container,
        label_type='center',
        fontsize=12,
        fmt='%.2f%%'
    )

ax2.set_title(
    'Level Percentage vs Text Type',
    fontsize=20,
    fontweight='bold',
    fontfamily='serif',
)
ax2.get_legend().remove()


# Plot most common domains vs target
sns.kdeplot(
    train_df['target'],
    hue=train_df['text_type'],
    fill=PAL,
    color=PAL,
    palette=([PAL[i] for i in [1, 7]]),
    lw=0.1,
    alpha=0.8,
    ax=ax3,
)

ax3.set_title(
    'Text Type Distributions Comparison',
    fontsize=20,
    fontweight='bold',
    fontfamily='serif',
)
ax3.set_xlabel('Target')
ax3.get_legend().set_title('Text Type')

plt.show()

## Text Type and Complexity Level Summary

---

In the graphs above, the excerpts are filtered by source existence and sorted by level from simple to complex. The texts without sources are labeled as "classic", and the passages supported by URLs are called "modern". This is just an assumption that needs to be refined and tested during the prototyping phase to possible use within the feature generating loop.

In the "modern" text distribution (excerpts from the Internet), there are almost half as many complex passages and almost twice as many simple ones in comparison with the "classic" text distribution (excerpts without source). To find the possible reason for this, let's take a closer look at the `url_leagal` distribution, extract the domain names, and examine the level of complexity of the texts inside.

**[Back to Table of Contents](#contents)**

## <a id="domain">Domain Names</a>

---


In [None]:
# Get domain name
def get_domain(url):
    parsed_url = urllib.parse.urlparse(url)
    domain = parsed_url.netloc
    return domain


train_df.loc[train_df['url_legal'].notna(), 'domain'] = train_df.loc[
    train_df['url_legal'].notna(), 'url_legal'
].apply(get_domain)

# Domain count
domain_count = train_df['domain'].value_counts()
domain_dc = domain_count.to_dict()

train_df['domain_count'] = train_df['domain'].map(domain_dc)

print(tabulate(domain_dc.items(), headers=['Domain Name', 'Count']))

In [None]:
wiki = ['simple.wikipedia.org', 'en.wikipedia.org', 'en.wikibooks.org']
kids = ['kids.frontiersin.org', 'www.africanstorybook.org', 'freekidsbooks.org']
train_df['wiki_kids'] = train_df['domain']
train_df['wiki_kids'] = train_df['wiki_kids'].replace(wiki, 'Wikipedia')
train_df['wiki_kids'] = train_df['wiki_kids'].replace(kids, 'Literature for Kids')

In [None]:
# Create figure
fig = plt.figure(figsize=(14, 21))
fig.patch.set_facecolor(PAL[4])

# Define grid and subplots
grid = plt.GridSpec(3, 2)
ax1 = plt.subplot(grid[0, :])
ax2 = plt.subplot(grid[1, :])
ax3 = plt.subplot(grid[2, :])

# Domain plot
domain_plot = domain_count.plot(kind='barh', color=PAL[8], alpha=0.8, ax=ax1)
ax1.set_title(
    'Domain Names',
    fontsize=20,
    fontweight='bold',
    fontfamily='serif'
)

ax1.set_xlabel('Count')

for container in domain_plot.containers:
    domain_plot.bar_label(
        container,
        fontsize=12,
        color=PAL[0],
        padding=4,
    )

ax1.invert_yaxis()

# Most common domains vs target
sns.kdeplot(
    train_df['target'],
    hue=train_df.loc[train_df['domain_count'] > 100]['domain'],
    fill=PAL,
    color=PAL,
    palette=([PAL[i] for i in [1, 0, 7, 8]]),
    lw=0.1,
    alpha=0.8,
    ax=ax2,
)

ax2.set_title(
    'Target Distributions of the Most Common Domain Names',
    fontsize=20,
    fontweight='bold',
    fontfamily='serif',
)
ax2.set_xlabel('Tagret')
ax2.get_legend().set_title('Domain Names')


# Wikipedia and Kids Literature vs target
sns.kdeplot(
    train_df['target'],
    hue=train_df.loc[train_df['wiki_kids'].isin(['Wikipedia', 'Literature for Kids'])]['wiki_kids'],
    fill=PAL,
    color=PAL,
    palette=([PAL[i] for i in [1, 8]]),
    lw=0.1,
    alpha=0.8,
    ax=ax3,
)

ax3.set_title(
    'Wikipedia and Literature for Kids',
    fontsize=20,
    fontweight='bold',
    fontfamily='serif',
)
ax3.set_xlabel('Tagret')
ax3.get_legend().set_title('Domain Groups')

plt.show()

## Domain Names Summary

---

We can see that most of the URLs (over 87%) come from the top four domain names. In addition, among the domains, we can distinguish two groups of three elements each, representing 91% of all texts in the distribution. Wikipedia ("simple.wikipedia.org", "en.wikipedia.org", "en.wikibooks.org") provided a total of 380 exсerpts and almost repeat the target distribution of the training dataset. Other tree domains (“kids.frontiersin.org”, “www.africanstorybook.org”, “freekidsbooks.org”), which appear to be sources of children's literature, give us another 374 texts with a shifted complexity level towards simple. The vast amount of children's literature can explain the previously observed differences in the distribution of "modern" and "classic" texts.

**[Back to Table of Contents](#contents)**

## <a id="licenses">Licenses</a>

---


In [None]:
license_count = train_df['license'].value_counts()
license_dc = license_count.to_dict()
train_df['license_count'] = train_df['license'].map(license_dc)

print(tabulate(license_dc.items(), headers=['License', 'Count']))

In [None]:
# Create figure
fig = plt.figure(figsize=(15, 15))
fig.patch.set_facecolor(PAL[4])

# Define grid and subplots
grid = plt.GridSpec(2, 1)
ax1 = plt.subplot(grid[0, 0])
ax2 = plt.subplot(grid[1, 0])

# Licenses plot
license_plot = license_count.plot(
    kind='barh',
    color=PAL[0],
    alpha=0.8,
    ax=ax1,
)

ax1.set_title(
    'Licenses',
    fontsize=20,
    fontweight='bold',
    fontfamily='serif'
)

ax1.set_xlabel('Count')

for container in license_plot.containers:
    license_plot.bar_label(
        container,
        fontsize=12,
        color=PAL[8],
        padding=4,
    )

ax1.invert_yaxis()


# Most common licenses vs target
sns.kdeplot(
    train_df['target'],
    hue=train_df.loc[train_df['license_count'] > 100]['license'],
    fill=PAL,
    color=PAL,
    palette=([PAL[i] for i in [7, 0, 8]]),
    lw=0.1,
    alpha=0.8,
    ax=ax2,
)

ax2.set_title(
    'Target Distributions of the Most Common Licenses',
    fontsize=20,
    fontweight='bold',
    fontfamily='serif',
)
ax2.set_xlabel('Target')

plt.show()

In [None]:
domain_4_dc = train_df[train_df.license == 'CC BY 4.0']['domain'].value_counts().to_dict()
print(tabulate(domain_4_dc.items(), headers=['Domain', 'CC BY 4.0']))

## License Findings

---

Three licenses cover almost 94% of all excerpts provided with legal information. The most common is CC BY 4.0, with a distribution very similar to the "Literature for Kids" domain group distribution we examined earlier. And after a quick check, we just approved this assumption. However, legal information does not appear in the hidden test set, so it is even more useless than URLs and our search for a modern/classic pattern.

## So What About Modern and Classic Excerpst?

---

After exploring the basics from the training dataset we can admit that continuing to search in this direction is not the greatest idea, and here's why:

Most "modern" excerpts are given by several websites, so the target distribution is biased and its shift towards simple texts is not a useful finding. Even if all texts in the hiding test set would have the URLs and they are the same as we observed in the training set there is no chance that they have the same distribution. More likely, that in addition, we'll get the excerpts from the rare URLs such as osu.edu, ck12.org, etc. And this part of the so-called "modern" excerpts will definitely be different.

In the next step, we'll finally take a close look at the excerpts themselves and use the concept of modern/classic text types one last time.

**[Back to Table of Contents](#contents)**

## <a id="excerpts">Excerpts</a>

___

First, let's define a function to generate random samples with different text type and complexity level, and extract basic metadata from the excepts to build statistics.

In [None]:
def get_samples(df, col_1='level', col_2='text_type'):
    samples = {}
    level = df[col_1].unique().tolist()
    level.sort(reverse=True)
    text_type = df[col_2].unique().tolist()
    for i in level:
        for j in text_type:
            idx = df.loc[(df[col_1] == i) & (df[col_2] == j)].sample().index[0]
            target = round(df['target'][idx], 2)
            excerpt = df['excerpt'][idx]
            samples[f'{i} {j}'] = [target, excerpt]
    return samples

## Get Metadata

---

In [None]:
# Extract basic metadata from excerpts
def get_meta(df, col):
    df['sentences_per_excerpt'] = df[col].apply(lambda x: x.count('.'))
    df['words_per_excerpt'] = df[col].str.split().map(lambda x: len(x))
    df['characters_per_excerpt'] = df[col].apply(lambda x: len(x))

    df['words_per_sentence'] = df[col].str.split('.').apply(
    lambda x: [len(i.split()) for i in x]).map(
    lambda x: np.mean(x[:-1])
    )
    df['characters_per_sentense'] = df[col].str.split('.').apply(
    lambda x: [len(i) for i in x]).map(
    lambda x: np.mean(x[:-1])
    )
    df['characters_per_word'] = df[col].str.split().apply(
    lambda x: [len(i) for i in x]).map(
    lambda x: np.mean(x)
    )
    return df

train_df = get_meta(train_df, 'excerpt')

## <a id="reading">Viewing Excerpts</a>

---

To study the passages, we will generate and print random samples of texts of different types and complexity level. By re-running the cell below each time you will receive new set of excerpts and can compare them one by one. 

In [None]:
samples = get_samples(train_df)
n_words = train_df['words_per_excerpt'].tolist()

print(
    colored(
        f'Train contains {train_df.shape[0]}' 
        + f' excerpts, ranging from {min(n_words)} to {max(n_words)}' 
        + f' (avg {round(np.mean(n_words))}) words long.',
        'yellow',
        attrs=['bold']
    )
)

for level, (target, text) in samples.items():
    if target > 0:
        color = 'green'
    elif target < -2:
        color = 'red'
    else:
        color = 'yellow'
    print(colored('---' * 10, color))
    level = colored(level.upper(), color, attrs=['bold'])
    target = colored(target, color, attrs=['bold'])
    print(f'Train sample of {level} text with the target value: {target}')
    print(colored('---' * 10, color))
    print(text)

## Plot Extracted Metadata

---

In [None]:
def plot_stats(df, features, target, clr=[0,8]):

    # Mean values
    means = {}
    for feat in features:
        means[feat] = round(df[feat].mean())

    # Create figure
    n_rows = len(features)
    n_cols = 2
    fig = plt.figure(figsize=(n_cols * 7, n_rows * 7))
    fig.patch.set_facecolor(PAL[4])

    # Define grid and subplots
    grid = plt.GridSpec(n_rows, n_cols)
    for row, (feat, mean) in enumerate(means.items()):
        title = feat.replace('_', ' ').title().replace(' Per ', ' per ')
        for col in range(n_cols):
            ax = plt.subplot(grid[row, col])
            if col == 0:
                sns.histplot(
                    df[feat],
                    color=PAL[clr[1]],
                    alpha=0.8,
                    bins=10,
                    ax=ax,
                    label=title.lower(),
                    kde=True,
                )
                ax.axvline(
                    mean,
                    color=PAL[clr[0]],
                    linestyle='--',
                    linewidth=2,
                    label=f'mean: {mean}',
                )
                ax.legend(loc='upper right')
                ax.set_xlabel(title)
                ax.set_ylabel('Excerpt Count')
                ax.set_title(
                    title,
                    fontsize=20,
                    fontweight='bold',
                    fontfamily='serif'
                )

            elif col == 1:
                sns.regplot(
                    df[feat],
                    df[target],
                    order=1,
                    color=PAL[clr[1]],
                    line_kws={'color': PAL[clr[0]]},
                    scatter_kws={'alpha': 0.3},
                    ax=ax,
                )
                ax.set_xlabel(title)
                ax.set_ylabel(target.capitalize())
                ax.set_title(
                    f'{target.capitalize()} vs {title}',
                    fontsize=20,
                    fontweight='bold',
                    fontfamily='serif',
                )

            else:
                print(f'Column with value {col} is out of plot grid')

    plt.show()

In [None]:
excerpt_features = [i for i in train_df.columns[-6:]]
plot_stats(train_df, excerpt_features, 'target')

## Excerpts Summary 

---

By **comparing passages by hand**, we can confirm that there are seen readability boundaries among the groups of texts divided by the complexity level. As for assumptions about text types, we do not find great differences between the so-called "modern" and "classic" texts within the same groups.

We can also indicate that the `targets` are not always reliable. Sometimes texts with different target values not much distinguished from each other, on the other hand, some pairs with near the same target scores are clearly have not the same ease of reading. Of course, this is a subjective observation, but from time to time we meet obvious cases. For example, look at the excerpts `a666c1db9` and `b55026bd9` with almost the same target scores:

<div class="alert alert-block alert-info">
<b>Target: 0.2316</b> 
<p>In capitalism, people may sell or lend their property, and other people may buy or borrow it. If one person wants to buy, and another person wants to sell to them, they do not need to get permission from higher power. People can have a market (buying and selling with each other) without anyone else telling them to. People who own capital are sometimes called capitalists (people who support capitalism are called capitalists, too). They can hire anyone who wants to work in their factories, shops or lands for them for the pay they offer.The word capital can be used to mean things that produce more things or money. For example, lands, factories, shops, tools and machines are capital. If someone has money that can be invested, that money is capital too. In capitalist systems, many people are workers (or proletarians). They are employed to earn money for living. People can choose to work for anyone who will hire them in a free market.</p>
</div>

<div class="alert alert-block alert-warning">
<b>Target: 0.2303</b> 
<p>This is Cat. This is Dog. Cat and Dog live in a house. A house with a door. A house with a roof. Cat and Dog have a ball. The ball is red and blue and green. Cat and Dog play with the ball. Cat throws the ball to Dog. Dog catches the ball. Dog throws the ball to Cat. Cat catches the ball. Then Cat throws the ball very high. Oh! oh! The ball is on the roof. The ball is on the roof of the house. Cat and Dog can see the ball. Cat and Dog cannot get to the ball. Cat and Dog cry. Then Elephant comes by. Elephant is big. Elephant can see the ball. Elephant can get to the ball. Elephant gets the ball from the roof. Elephant takes the ball from the roof of the house. Elephant gives the ball to Cat and Dog. Cat and Dog smile. Elephant smiles. Cat and Dog and Elephant smile.</p>
</div>

Hopefully, these cases are not widespread and can be considered as noise. But we need to think about how to identify and approach such errors, taking into account the presence of the same noise in the test dataset.

**Talking about statistics** we've extracted from the raw texts, we can note almost all correlated with the target and can perform as signals of the complexity of the texts. At first glance, some relationships are not obvious, such as `Sentence for Excerpt`, which indicates an inverse correlation with the target value (the more sentences in a passage, the easier it is to read). But if we get a closer look at related features such as the number of `Words per Excerpt`, `Words per Sentence`, etc., we verify that everything is reliable. In our case, fewer sentences in a passage means longer and more complex sentences, therefore, such texts are less readable and more difficult for students.

**[Back to Table of Contents](#contents)**

# <a id="further">Further Exploration</a>

---



## <a id="readability">Readability, Complexity, and Grade Level</a>

---

For this tasks we'll use [Texstat](https://pypi.org/project/textstat/) library. It helps us determine readability, complexity, and grade level of the excerpts. A brief features description we'll use: 

- `flesch_reading_ease`: returns the [Flesch Reading Ease Score](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease). While the maximum score is 121.22, there is no limit on how low the score can be. A negative score is valid;
- `flesch_kincaid_grade`: returns the [Flesch-Kincaid Grade](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level) of the given text. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document;
- `gunning_fog`: returns the [FOG index](https://en.wikipedia.org/wiki/Gunning_fog_index) of the given text. The Fog Scale (Gunning FOG Formula) grade formula in that a score of 9.3 means that a ninth grader would be able to read the document;
- `smog_index`: returns the [SMOG index](https://en.wikipedia.org/wiki/SMOG) of the given text. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document; Texts of fewer than 30 sentences are statistically invalid, because the SMOG formula was normed on 30-sentence samples. textstat requires at least 3 sentences for a result;
- `automated_readability_index`: returns the [ARI (Automated Readability Index)](https://en.wikipedia.org/wiki/Automated_readability_index) which outputs a number that approximates the grade level needed to comprehend the text. For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade;
- `coleman_liau_index`: returns the grade level of the text using the [Coleman-Liau Formula](https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index). This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document;
- `linsear_write_formula`: returns the grade level using the Linsear [Write Formula](https://en.wikipedia.org/wiki/Linsear_Write). This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document;
- `dale_chall_readability_score`: different from other tests, since it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the [New Dale-Chall Formula](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula);
- `text_standard`: based upon all the above tests, returns the estimated school grade level required to understand the text. Optional float_output allows the score to be returned as a float. Defaults to False: *textstat.text_standard(text, float_output=False)*.

In [None]:
def get_stat(df, col):
    df['flesch_reading_ease'] = df[col].apply(lambda x: textstat.flesch_reading_ease(x))
    df['flesch_kincaid_grade'] = df[col].apply(lambda x: textstat.flesch_kincaid_grade(x))
    df['gunning_fog'] = df[col].apply(lambda x: textstat.gunning_fog(x))
    df['smog_index'] = df[col].apply(lambda x: textstat.smog_index(x))
    df['automated_readability_index'] = df[col].apply(lambda x: textstat.automated_readability_index(x))
    df['coleman_liau_index'] = df[col].apply(lambda x: textstat.coleman_liau_index(x))
    df['linsear_write_formula'] = df[col].apply(lambda x: textstat.linsear_write_formula(x))
    df['dale_chall_readability_score'] = df[col].apply(lambda x: textstat.dale_chall_readability_score(x))
    df['text_standard'] = df[col].apply(lambda x: textstat.text_standard(x, float_output=True))
    return df

In [None]:
train_df = get_stat(train_df, 'excerpt')
textstat_features = train_df.columns[-9:].tolist()
plot_stats(train_df, textstat_features, 'target', clr=[1, 7])

**[Back to Table of Contents](#contents)**

## <a id="prep">Text Preprocessing Pipeline</a>

---

For further analysis, we need to preprocess our excerpts. First, we'll expand contractions (this reduces data redundancy and makes the code computationally cheaper), then convert strings to all lowercase, tokenize them, and finally remove all stopwords. For the EDA purposes, these should be sufficient, so our simple preprocessing pipeline will consist of four steps.

In [None]:
# Tokenize 
def tokenize(text):
    return re.findall(r'[\w-]*\p{L}[\w-]*', text)


# Remove stop words
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(tokens):
    return [token for token in tokens if token not in stopwords]


# Piplene
pipeline = [contractions.fix, str.lower, tokenize, remove_stopwords]


# Preprocess
def preprocess(text, pipeline):
    tokens = text
    for transform in pipeline:
        tokens = transform(tokens)
    return tokens

train_df['tokens'] = train_df['excerpt'].apply(preprocess, pipeline=pipeline)

## <a id="word_freq">Word Frequency Diagram</a>

---


In [None]:
def count_words(df, column='tokens', preprocess=None, min_freq=2):

    # Process tokens and update counter
    def update(doc):
        tokens = doc if preprocess is None else preprocess(doc)
        counter.update(tokens)

    # Create counter and run through all data
    counter = Counter()
    df[column].map(update)

    # Transform counter into data frame
    freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['frequency'])
    freq_df = freq_df.query('frequency >= @min_freq')
    freq_df.index.name = column
    
    return freq_df.sort_values('frequency', ascending=False)

In [None]:
freq_df = count_words(train_df)

In [None]:
def plot_freq(df, n_items=20):
    
    fig = plt.figure(figsize=(15, 10))
    fig.patch.set_facecolor(PAL[4])
    ax = plt.subplot()

    diagram = df.head(n_items).plot(
        kind='barh',
        color=PAL[8],
        alpha = 0.7,
        ax=ax
    )
    
    title = df.index.name.title()
    ax.set_title(
             f'{n_items} Most Common {title}', 
             fontsize=20, 
             fontweight='bold', 
             fontfamily='serif'
    )

    ax.set_xlabel('Count')
    ax.set_ylabel(title)

    for container in diagram.containers:
        diagram.bar_label(
            container,
            fontsize=12,
            color=PAL[0],
            padding=4,
        )
    
    ax.invert_yaxis()
    plt.show()

In [None]:
plot_freq(freq_df)

**[Back to Table of Contents](#contents)**

## <a id="word_cloud">Word Clouds</a>

---

In [None]:
def plot_wordcloud(word_freq, max_words=200):
    
    wordcloud = WordCloud(
        width=800,
        height=400,
        background_color= 'white',
        max_font_size=150,
        max_words=max_words
    )
    
    wordcloud.generate_from_frequencies(word_freq)
    
    fig = plt.figure(figsize=(15, 7))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

In [None]:
plot_wordcloud(freq_df.frequency.to_dict(), max_words=200)

## <a id="ngrams">Ngrams Analysis</a>

---


In [None]:
def ngrams(tokens, n=2):
    return [' '.join(ngram) for ngram in zip(*[tokens[i:] for i in range(n)])]

In [None]:
train_df['bigrams'] = train_df['tokens'].apply(ngrams, n=2)
bigrams = count_words(train_df, 'bigrams')

In [None]:
plot_freq(bigrams)

## Bigrams Wordcloud

In [None]:
plot_wordcloud(bigrams.frequency.to_dict(), max_words=150)

In [None]:
train_df['trigrams'] = train_df['tokens'].apply(ngrams, n=3)
trigrams = count_words(train_df, 'trigrams')

In [None]:
plot_freq(trigrams)

## Trigrams Wordcloud

In [None]:
plot_wordcloud(trigrams.frequency.to_dict(), max_words=150)

**[Back to Table of Contents](#contents)**

## <a id="corpus_stats">Corpus Statistics</a>

---


In [None]:
freq_dict = freq_df.frequency.to_dict()

rare_words = {k: v for k, v in freq_dict.items() if v <= 5}
print('Examples of Rare Words:', list(rare_words.keys())[:5])

common_words = {k: v for k, v in freq_dict.items() if v > 100}
print('Examples of Common Words:', list(common_words.keys())[:5])

long_df = count_words(
    train_df,
    column='excerpt',
    preprocess=lambda text: re.findall(r'\w{6,}', text),
)
long_words = long_df.frequency.to_dict()
print('Examples of Long Words:', list(long_words.keys())[:5])

In [None]:
def count_freq(corpus, vocab=rare_words):
    counter = 0
    for word in vocab:
        if word in corpus:
            counter += 1
    return counter

In [None]:
%%time
def get_tokenstat(df, col):
    df['tokens_per_corpus'] = df[col].map(len)
    df['characters_per_corpus'] = df[col].apply(
        lambda x: sum([len(i) for i in x])
    )
    df['characters_per_token'] = df[col].apply(
        lambda x: [len(i) for i in x]).map(
        lambda x: np.mean(x)
    )
    df['rare_tokens'] = df[col].apply(count_freq)
    df['common_tokens'] = df[col].apply(count_freq, vocab=common_words)
    df['long_tokens'] = df[col].apply(count_freq, vocab=long_words)
    
    return df

train_df = get_tokenstat(train_df, 'tokens')

In [None]:
token_features = [i for i in train_df.columns[-6: ]]
plot_stats(train_df, token_features, 'target', clr=[8,0])

## Corpus Statistics Summary

---

**[Back to Table of Contents](#contents)**

## <a id="corr">Correlation Matrix</a>

---

In [None]:
features = excerpt_features + textstat_features + token_features

In [None]:
# Compute the correlation matrix
corr = train_df[features + ['target']].corr()

# Create figure
fig, ax = plt.subplots(figsize=(15, 15))

# Generate colormap
cmap = sns.color_palette('BrBG', 20, as_cmap=True)

# Draw and configurate the heatmap
sns.heatmap(corr, cmap=cmap, annot=True, fmt='.2f', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.5})

plt.show()

**[Back to Table of Contents](#contents)**

# <a id="baseline">Simple Baseline</a>

---

So far we have twenty-one features that correlated to the target and we can create our first baseline model. It could be one of the regressors from the standard scikit-learn library. Here we'll train and compare four of them: two linear (LinearRegression, Ridge), and two bayesian (BayesianRidge, ARDRegression) regressors.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, BayesianRidge, ARDRegression

In [None]:
X = train_df[features].values
y = train_df.target.values.reshape(-1, 1)
X.shape, y.shape

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
models = [LinearRegression(), Ridge(), BayesianRidge(), ARDRegression()]
performance = {}
for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    rmse = mean_squared_error(y_pred, y_valid, squared=True)
    performance[model] = rmse
print(tabulate(performance.items(), headers=['Model', 'RMSE']))

## <a id="trans">Feature Scaling and Transformation</a>

---

To increase the number of features we can perform some basic preprocessing techniques like scaling, binning, linear dimensionality reduction, and so on. Here we'll use just a several, for more information on how they perform fill free to explore scikit-learn documentation.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import FeatureUnion

In [None]:
# Transformers 
transforms = list()
transforms.append(('mms', MinMaxScaler()))
transforms.append(('ss', StandardScaler()))
transforms.append(('rs', RobustScaler()))
transforms.append(('mas', MaxAbsScaler()))
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transforms.append(('kbd', KBinsDiscretizer(n_bins=20, encode='ordinal', strategy='uniform')))
transforms.append(('pca', PCA(n_components=7)))
transforms.append(('svd', TruncatedSVD(n_components=7)))

# Concatenate the results
fu = FeatureUnion(transforms)

In [None]:
trans_features = fu.fit_transform(train_df[features])
trans_features.shape

**[Back to Table of Contents](#contents)**

## <a id="embeddings">Sentence Embeddings</a>

---

The second thing how we can extend the feature space is to generate sentence embeddings for each excerpt in the dataset and so get several hundred new features. One of the handy libraries for this is Sentence Transformers. We'll use a small model here - for our purpose is more than enough, but you can go ahead, play around, and chose the best one from over a hundred [pre-trained models](https://www.sbert.net/docs/pretrained_models.html) available. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`. More useful information can be found on [GitHub](https://github.com/UKPLab/sentence-transformers) and/or [Sentence Transformers documentation](https://www.sbert.net/docs/quickstart.html)

In [None]:
%%capture
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [None]:
%%time
train_embeddings = model.encode(train_df['excerpt'])

In [None]:
train_embeddings.shape

## <a id="improved">Performance with Extended Features</a>

---


In [None]:
# Concatenate obtained features with original
X = np.concatenate((trans_features, train_embeddings, X), axis=1)
y = train_df.target.values.reshape(-1, 1)
X.shape, y.shape

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
models = [LinearRegression(), Ridge(), BayesianRidge(), ARDRegression()]
performance = {}
for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    rmse = mean_squared_error(y_pred, y_valid, squared=True)
    performance[model] = rmse
print(tabulate(performance.items(), headers=['Model', 'RMSE']))

**[Back to Table of Contents](#contents)**

# <a id="final">Final Thoughts</a>

---

### Work in progress

**[Back to Table of Contents](#contents)**

# <a id="ref">References</a>

---

1. [CommonLit Readability Prize](https://www.kaggle.com/c/commonlitreadabilityprize)
2. [Latex Syntax](https://en.wikibooks.org/wiki/LaTeX/Mathematics)
3. [Mean Squared Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)
3. [Additional Information About Dataset](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423)
4. [Bradley-Terry Model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
5. [How Bradley-Terry Model Works](https://www.kaggle.com/shaz13/code-how-bradley-terry-model-works/)
7. [Contractions](https://github.com/kootenpv/contractions)
8. [WordCloud API](https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html#wordcloud.WordCloud)
9. [Textstat - Readability, Complexity, and Grade Level](https://pypi.org/project/textstat/)
10. [Sentence Transformers Documentation](https://www.sbert.net/docs/quickstart.html)

**[Back to Table of Contents](#contents)**