### Scrape URLs for CommonLit Readability Prize competition 

1. There contains over 600 URLs to scrape. I only scrape ~570 URLs from 3-4 separate domains. 
2. Wikipedia was the most annoying to scrape cleanly. 
3. There may be some undetected artifacts in the text so use with caution.
3. You should perform your own exploratory data analysis to discover any remaining artifacts that occured during scraping.
4. I created a [notebook](https://www.kaggle.com/teeyee314/readability-external-data-eda)  with additional preparation for use with competition training.

You're welcome :)

In [1]:
!pip install -q bs4

In [2]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
import requests
import re
import warnings
warnings.filterwarnings("ignore")

BASE_DIR = '../input/commonlitreadabilityprize'

print(os.listdir(BASE_DIR))

['sample_submission.csv', 'train.csv', 'test.csv']


In [3]:
train = pd.read_csv(os.path.join(BASE_DIR, 'train.csv'))

In [4]:
train

Unnamed: 0,id,url_legal,license,excerpt,target,standard_error
0,c12129c31,,,When the young people returned to the ballroom...,-0.340259,0.464009
1,85aa80a4c,,,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805
2,b69ac6792,,,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676
3,dd1000b26,,,And outside before the palace a great garden w...,-1.054013,0.450007
4,37c1b32fb,,,Once upon a time there were Three Bears who li...,0.247197,0.510845
...,...,...,...,...,...,...
2829,25ca8f498,https://sites.ehe.osu.edu/beyondpenguins/files...,CC BY-SA 3.0,When you think of dinosaurs and where they liv...,1.711390,0.646900
2830,2c26db523,https://en.wikibooks.org/wiki/Wikijunior:The_E...,CC BY-SA 3.0,So what is a solid? Solids are usually hard be...,0.189476,0.535648
2831,cd19e2350,https://en.wikibooks.org/wiki/Wikijunior:The_E...,CC BY-SA 3.0,The second state of matter we will discuss is ...,0.255209,0.483866
2832,15e2e9e7a,https://en.wikibooks.org/wiki/Geometry_for_Ele...,CC BY-SA 3.0,Solids are shapes that you can actually touch....,-0.215279,0.514128


In [5]:
# select rows that have urls
has_text = train[~train['url_legal'].isnull()]

# grab the domain name
has_text['domain'] = has_text['url_legal'].apply(lambda x: x.split('/')[2])

In [6]:
# list all reference urls by frequency in descending order
has_text['url_legal'].apply(lambda x: x.split('/')[2]).value_counts()

simple.wikipedia.org          196
kids.frontiersin.org          191
en.wikipedia.org              176
www.africanstorybook.org      164
www.commonlit.org              41
freekidsbooks.org              19
www.digitallibrary.io          19
en.wikibooks.org                8
static.ehe.osu.edu              6
drive.google.com                3
ukuqonda.co.za                  2
www.ck12.org                    2
emedia.uen.org                  1
sites.ehe.osu.edu               1
beyondpenguins.ehe.osu.edu      1
Name: url_legal, dtype: int64

| Count | Url |
|--- | --- |
| 196 | simple.wikipedia.org |  
| 191 | kids.frontiersin.org |
| 176 | en.wikipedia.org |
| 8 | en.wikibooks.org |
| 571 | Total |
| 95 | Missing | 

In [7]:
def show_html(text):
    soup = BeautifulSoup(text, 'html.parser')

    words = []

    for paragraph in soup.find_all('p'):
        if paragraph.sup:
            for support in paragraph.find_all('sup'):
                support.decompose()
        words.append(paragraph.get_text())

    return words

def clean_newline(soup=''):
    return re.sub(r'\n', '', soup)

def clean_http(soup=''):
    soup = list(map(lambda x: '' if re.search('http',x) else x, soup))
    soup = list(filter(lambda x: x != '', soup))
    return soup

def clean_frontiersin(soup=''):
    soup = list(map(lambda x: '0' if re.search('\n', x) else x, soup))
    soup = list(map(lambda y: '1' if re.search('↑', y) else y, soup))
    soup = list(filter(lambda x: x != '0', soup))
    soup = list(map(clean_brackets, soup))
    soup = list(map(remove_http_url, soup))
    try:
        soup = soup[:soup.index('1')]
        
    except Exception as e:
        pass
    
    return soup

def remove_copyright(soup=''):
    text = ['The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.',
            'The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.']
    for t in text:
        try:
            soup = soup[:soup.index(t)]
        except Exception as e:
            pass
    return soup

# remove some artifacts not present in competition data
def clean_brackets(text):
    cleaned = re.sub(r'\[([a-zA-Z0-9]+)\]', '', text)
    cleaned = re.sub(r'\((Figure(s) .+)\)', '', cleaned)
    cleaned = re.sub(r'\((see Figure .+)\)', '', cleaned)
    cleaned = re.sub(r'\[([\w\d\s\W]+)\]', '', cleaned)
    return cleaned

def remove_http_url(soup):
    soup = re.sub(r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))", '', soup)
    return soup

# kids.frontiersin.org

In [8]:
frontier = has_text[has_text['domain'] == 'kids.frontiersin.org'].reset_index(drop=True)

In [9]:
%%time
frontier_text = frontier['url_legal'].map(requests.get)

CPU times: user 3.33 s, sys: 254 ms, total: 3.59 s
Wall time: 1min 53s


In [10]:
frontier_soup = frontier_text.apply(lambda x: x.text)
frontier_soup = frontier_soup.map(show_html)
frontier_soup = frontier_soup.map(clean_frontiersin)
frontier_soup = frontier_soup.map(remove_copyright)
frontier_soup = frontier_soup.map(lambda x: '\n'.join(x))

In [11]:
frontier['external_text'] = frontier_soup

# en.wikibooks.org

In [12]:
wikibooks = has_text[has_text['domain'] == 'en.wikibooks.org'].reset_index(drop=True)

In [13]:
%%time
wikibooks_text = wikibooks['url_legal'].map(requests.get)

CPU times: user 115 ms, sys: 9.97 ms, total: 125 ms
Wall time: 3.16 s


In [14]:
wikibooks_soup = wikibooks_text.apply(lambda x: x.text)
wikibooks_soup = wikibooks_soup.map(show_html)

In [15]:
wikibooks_soup[0] = list(filter(lambda x: x != '\n', clean_http(wikibooks_soup[0])[:-5]))
wikibooks_soup[1] = list(filter(lambda x: x != '\n',clean_http(wikibooks_soup[1])[:-1]))
wikibooks_soup[2] = list(filter(lambda x: x != '\n', wikibooks_soup[2]))
wikibooks_soup[6] = wikibooks_soup[6][:7] + wikibooks_soup[6][9:]
wikibooks_soup = wikibooks_soup.map(lambda x: ''.join(x))

In [16]:
wikibooks['external_text'] = wikibooks_soup

# simple.wikipedia.org

In [17]:
def show_html_wiki(text):
    soup = BeautifulSoup(text, 'html.parser')
    words = []
    
    # remove tables
    for table in soup.find_all('table'):
        table.decompose()
    
    # remove spans
    for span in soup.find_all('span'):
        span.decompose()
        
    # remove un-ordered lists
    for ul in soup.find_all('ul'):
        ul.decompose()
        
    # remove ordered lists
    for ol in soup.find_all('ol'):
        ol.decompose()

    for paragraph in soup.find_all('p'):
        # remove sup tags
        if paragraph.sup:
            for support in paragraph.find_all('sup'):
                support.decompose()
        cleaned = remove_ufeff(paragraph.get_text())
        cleaned = remove_xa0(cleaned)
        words.append(cleaned)
    
    return words

#  remove artifact from using requests library on wikipedia
def remove_ufeff(text):
    return re.sub(r'\ufeff', '', text)

# remove another artifact
def remove_xa0(text):
    return re.sub(r'\xa0', '', text)

def filter_newline(text):
    text = text.split('\n')
    return '\n'.join(list(filter(lambda x: x != "", text)))

In [18]:
simple_wiki = has_text[has_text['domain'] == 'simple.wikipedia.org'].reset_index(drop=True)

In [19]:
%%time
simple_wiki_text = simple_wiki['url_legal'].map(requests.get)

CPU times: user 3.18 s, sys: 181 ms, total: 3.36 s
Wall time: 1min 27s


In [20]:
simple_wiki_soup = simple_wiki_text.apply(lambda x: x.text)
simple_wiki_soup = simple_wiki_soup.map(show_html_wiki)
simple_wiki_soup = simple_wiki_soup.map(lambda x: ''.join(x))
simple_wiki_soup = simple_wiki_soup.map(filter_newline)

In [21]:
simple_wiki['external_text'] = simple_wiki_soup

# en.wikipedia.org

In [22]:
wiki = has_text[has_text['domain'] == 'en.wikipedia.org'].reset_index(drop=True)

In [23]:
%%time
wiki_text = wiki['url_legal'].map(requests.get)

CPU times: user 3.24 s, sys: 207 ms, total: 3.45 s
Wall time: 1min 30s


In [24]:
wiki_soup = wiki_text.apply(lambda x: x.text)
wiki_soup = wiki_soup.map(show_html_wiki)
wiki_soup = wiki_soup.map(lambda x: ''.join(x))
wiki_soup = wiki_soup.map(filter_newline)

In [25]:
wiki['external_text'] = wiki_soup

In [26]:
external = pd.concat([wiki, simple_wiki, wikibooks, frontier])
external.to_csv('external.csv', index=False)
external

Unnamed: 0,id,url_legal,license,excerpt,target,standard_error,domain,external_text
0,0d3a8f33b,https://en.wikipedia.org/wiki/Big_data,CC BY-SA 3.0,Big data is a term for data sets that are so l...,-1.634185,0.517197,en.wikipedia.org,Big data is a field that treats ways to analyz...
1,7073d1ef3,https://en.wikipedia.org/wiki/Biodiesel,CC BY-SA 3.0,Biodiesel can also be used as a heating fuel i...,-2.492674,0.521320,en.wikipedia.org,Biodiesel is a form of diesel fuel derived fro...
2,e83e2cc69,https://en.wikipedia.org/wiki/Biodiversity,CC BY-SA 3.0,"Biodiversity, a contraction of ""biological div...",-1.719896,0.473570,en.wikipedia.org,Biodiversity is the biological variety and var...
3,e4d810c98,https://en.wikipedia.org/wiki/Biotechnology,CC BY-SA 3.0,Although not normally what first comes to mind...,-2.112264,0.504813,en.wikipedia.org,"Biotechnology is a broad area of biology, invo..."
4,688e3c808,https://en.wikipedia.org/wiki/Bitcoin,CC BY-SA 3.0,Bitcoin is a digital asset and a payment syste...,-1.561801,0.468540,en.wikipedia.org,Bitcoin (₿) is a decentralized digital currenc...
...,...,...,...,...,...,...,...,...
186,465d65831,https://kids.frontiersin.org/article/10.3389/f...,CC BY 4.0,Researchers around the world are spending more...,0.574607,0.508896,kids.frontiersin.org,Is it important to have friends? Why do we enj...
187,6fc3d6ef3,https://kids.frontiersin.org/article/10.3389/f...,CC BY 4.0,There are numerous experimental studies using ...,-0.842189,0.533197,kids.frontiersin.org,"When I was a kid, I thought of going to school..."
188,389343d57,https://kids.frontiersin.org/article/10.3389/f...,CC BY 4.0,Described in the scientific literature more th...,-0.634987,0.490213,kids.frontiersin.org,"Have you ever felt left out, isolated, rejecte..."
189,9eea14ccb,https://kids.frontiersin.org/article/10.3389/f...,CC BY 4.0,Just as wildebeest are the main grazers of the...,-2.459246,0.502968,kids.frontiersin.org,Copepods are amongst the most abundant animals...
