<h1>Generate extra data with web scraping</h1>

This is the first competition in which I've used transformers, and I've learned a lot from many kagglers and researchers. One of the most inspiring sources has been this article: <a href='https://arxiv.org/abs/1905.05583'>How to fine-tune BERT for text classification</a>. In it, the authors demonstrate that pretraining BERT models with task specific data or domain specific data can improve the results when fine-tuning a downstream task, especially when the dataset for the downstream task is small, as is the case in this competition, with only 2834 instances in the training set. 

In this notebook, I'll show how to extract additional excerpts by webscraping <a href='https://simple.wikipedia.org/wiki/Main_Page'>simple.wikipedia.org</a>, a source already present in the training set, using two Python packages: <a href='https://pypi.org/project/requests/'>requests</a> and <a href='https://pypi.org/project/beautifulsoup4/'>bs4</a>.

In [None]:
# Imports

import bs4
from bs4 import BeautifulSoup
import requests
import json
import urllib
import pandas as pd
import os
import time
import re
import numpy as np

from tqdm.auto import tqdm
tqdm.pandas()

pd.options.display.max_colwidth = None

In [None]:
# Constants

INPUT_DIR = '../input/commonlitreadabilityprize'

# Min and max lengths (in characters) of the excerpts in the training set.
MIN_P_LENGTH = 669
MAX_P_LENGTH = 1341

LICENSE = 'CC BY-SA 3.0 and GFDL'     # License reference

There are already 196 excerpts in the training set from Simple Wikipedia. I'll filter those into a dataset for further processing.

In [None]:
data = pd.read_csv(os.path.join(INPUT_DIR, 'train.csv'))
data['url_legal'].fillna('', inplace=True)    # Nulls have to be converted to '' for the contains function to work
simple_wiki = data[data.url_legal.str.contains('simple.wiki')]
print(f'There are {len(simple_wiki)} urls from simple Wiki')
simple_wiki.head()

<h2>Utilities</h2>

The next cell presents the two main utilities I'm using to web scrape the web.

In [None]:
# Join consecutive paragraphs greedily so their length is between MIN_P_LENGTH and MAX_P_LENGTHS
def join_paragraphs(paragraphs):
    
    lengths = [len(p) for p in paragraphs]
    
    def candidate_length(i, j):
        return np.sum(lengths[i:j])
    
    result = []
    L = len(lengths)
    i = 0
    j = 1
    while i < L and j < L:
        l = candidate_length(i,j)
        if l >= MIN_P_LENGTH and l <= MAX_P_LENGTH:
            p = '\r\n'.join([p for p in paragraphs[i:j]])
            result.append(p)
            i = j
            j = j+1
        else:
            j = j+1
    return result

# Get a list of the paragraphs and references to extra articles of an article (text retrieved with BeautifulSoup)
def get_paragraphs_and_links(article):
    
    soup = BeautifulSoup(article)
    
    # Find the article content div, and get all the paragraphs and links contained in it

    content = soup.find_all('div', id='mw-content-text')[0].find('div', class_='mw-parser-output')
    if not content is None:
        paragraphs = content.find_all('p', recursive=False)

        # Extract text and filter references
        paragraphs = [re.sub(r'\[\d+\]', '', p.text) for p in paragraphs]
    
        # Join paragraphs to be in the length range
        paragraphs = join_paragraphs(paragraphs)
    
        # Find extra links to simple wiki articles
        links = content.find_all('a', href=re.compile('^/wiki/[A-Za-z0-9_-]+$'))
    
        return paragraphs, links
    else:
        return [], []

<h2>Main section</h2>

In [None]:
# First run to grab extra excerpts from the same articles
url_legal = []
excerpts = []
extra_articles = []
for link in tqdm(simple_wiki.url_legal):
    article = requests.get(link).text
    paragraphs, links = get_paragraphs_and_links(article)
    url_legal.extend([link]*len(paragraphs))
    excerpts.extend(paragraphs)
    extra_articles.extend(links)

print(f'There are {len(excerpts)} excerpts and {len(extra_articles)} links to extra articles')

extra = pd.DataFrame(data = {'url_legal': url_legal, 'license': LICENSE, 'excerpt': excerpts})
extra.head()

I won't do it here, though I show the code below, because it would take some time (more than an hour in my machine), but I second run could be done with the extra articles, in which case an additional <b>16000</b> extra excerpts could be retrieved in the same way.

In [None]:
# Second run with extra articles

extra_run = False

if extra_run:
    for article in tqdm(extra_articles):
        link = f"https://simple.wikipedia.org{article['href']}"
        article = requests.get(link).text
        paragraphs, _ = get_paragraphs_and_links(article)
        url_legal.extend([link]*len(paragraphs))
        excerpts.extend(paragraphs)