In [None]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_theme(style="whitegrid")

# General idea

The dataset available in this comptetition gives the sources of the text. Annoted texts are extracted from a wider article/book/etc. When the datasource is available, general language models could be pre-trained on those (as proposed on different notebooks).

This notebook explore available information to have an insight of the problem and download the dataset sources.

## Summary
- general exploration
- (WIP) data download

# General exploration

Input data is distributed in two files train.csv and test.csv. Both have the input source in the column "url_legal" :

In [None]:
train_data = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
train_data.info()

In [None]:
test_data = pd.read_csv("../input/commonlitreadabilityprize/test.csv")
test_data.info()

Only 30% of training examples are sourced but it might be a good start for pretraining.

Looking at the input data tab in kaggle notebook edit environment, input data seems to come from some limited source list :

In [None]:
from urllib.parse import urlparse
train_data['base_url'] = train_data.apply(lambda row: urlparse(row.url_legal).netloc if not pd.isna(row.url_legal) else 'NA', axis=1)

In [None]:
fig = plt.figure(figsize=(25,10))
g = sns.countplot(x="base_url", data=train_data, palette="Set3")
g.set_xticklabels(pd.unique(train_data.base_url), rotation=30)
sns.despine(left=True)
plt.show()

Without not annoted data :

In [None]:
fig = plt.figure(figsize=(25,10))
g = sns.countplot(x="base_url", data=train_data[train_data.base_url != "NA"], palette="Set3")
g.set_xticklabels(pd.unique(train_data[train_data.base_url != "NA"].base_url), rotation=30)
sns.despine(left=True)
fig.show()

Looking at source names, they might be correlated to the target score :

In [None]:
fig = plt.figure(figsize=(25,10))
g = sns.violinplot(x="base_url", y="target", data=train_data, palette="Set3", linewidth=1, scale="width")
g.set_xticklabels(pd.unique(train_data.base_url), rotation=30)
sns.despine(left=True, bottom=True)
plt.show()

With some distribution comparison :

In [None]:
fig = plt.figure(figsize=(25,10))
sns.ecdfplot(x="target", hue="base_url", data=train_data, palette="Set3", linewidth=2)
sns.despine(left=True, bottom=True)
fig.show()

Unsurprisingly, the datasource seems to give some insight on the readability.
For pretraining every data source is then needed.

Could it be possible to train a masked language model on known site with known readability probability and estimate the readability with the error between the masked token proability and the ground truth ?

# Data download for pretraining

This part download the source data in the output folder of this notebook for models pretraining.

Some data cannot be retreived : 
- Africanstorybooks.org sources are not linked here and cannot be retreived.
- digitallibrary.io returns 404 error.

[Work in progress.]
Some PDF are present in the database. PDF read will come in the future.

In [None]:
!pip install bs4

In [None]:
import requests
import os
from bs4 import BeautifulSoup

# Cleaning before run
for f in os.listdir("./"):
    os.remove(os.path.join("./", f))

def extract_data(row):
    if not pd.isna(row.url_legal):
        response = requests.get(row.url_legal)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            text = ""

            if row.base_url in ["simple.wikipedia.org", "en.wikipedia.org", "en.wikibooks.org"]:
                contents = soup.find_all(id="mw-content-text")
                for content in contents:
                    paragraphs = content.find_all('p')
                    for paragraph in paragraphs:
                        text = text + paragraph.get_text()
            elif row.base_url in ["kids.frontiersin.org"]:
                contents = soup.find_all('div', class_="fulltext-content")
                for content in contents:
                    paragraphs = content.find_all('p')
                    for paragraph in paragraphs:
                        text = text + paragraph.get_text()
            #elif row.base_url in ["www.commonlit.org"]:
                # Not allowed ! see : https://www.kaggle.com/c/commonlitreadabilityprize/discussion/245665
                #contents = soup.find_all('div', class_="cl-text__excerpt-line-container")
                #for content in contents:
                #    paragraphs = content.find_all('p')
                #    for paragraph in paragraphs:
                #        text = text + paragraph.get_text()
                
            if len(text) > 0:
                with open(f'train_{row.base_url}_{row.id}.txt', 'w') as file:
                    file.write(text)
                    
        return response.status_code
    return 404

train_data['data_status'] = train_data.apply(extract_data, axis=1)