# Scraping for data augmentation

In this notebook we are going to use scraping to increse the size of our dataset.

**How is done?**

In the train data there are urls that reference where the data to train come from. All text in the train data come from other bigger article. We can scrape the entire article and divide it in paragrahps with similar target. 

For example we can take this url https://en.wikipedia.org/wiki/Battle_of_Britain and scrape the complete page. In the train data there is only paragraph of ~175 words that belong to this article. we can make the **assumption** that if the target of this paragraph is -0.165467 all the article will have a similar target.

**Let rock it!**

# 1. Import libraries and load data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install scrapy
!pip install beautifulsoup4

#Scraping libraries
import requests
from scrapy.selector import Selector
from bs4 import BeautifulSoup
import warnings

import re
from collections import deque
from time import sleep

from tqdm import tqdm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import truncnorm

In [None]:
train = pd.read_csv("/kaggle/input/commonlitreadabilityprize/train.csv")

# 2. Create utils to scrape

Here we are going to create 3 Classes, one parent class called `Extractor` with the main attributes that every `Extractor` must have, and their two child class that inherit from `Extractor`.

**Every child class will be for scrape a different page**, for example for scrape Wikipedia we will create `WikipediaExtractor` and for scrape Frontiersin we will create `FrontiersinExtractor` 

With this two extractor **we cover the ~80%** of the urls that a appear in the dataset, you can create others extractors if you want to scrape other pages.

In [None]:
#Parent class

class Extractor:
    """
    Parent class with all the attributes that a Extractor must have
    
    Parameters
    ----------
    url (str):
        The url to scrape
    
    errors (bool): False:
        If is True, if the response of the page scrape doesn't belong to 20*
        them the scraping will stop. If it's false will continue.
    
    Attributes:
    -----------
    response: tell if the scraping is correct or not
    
    status_code: tell the code response of the url 200, 300, 400, etc.
    
    selector: is the selector for parse the page.
    """
    
    def __init__(self, url, errors=False) -> None:
        self.url = url
        rq = requests.get(url)
        self.response = rq.reason
        self.status_code = rq.status_code

        if "20" in str(self.status_code):
            self.selector = Selector(text=rq.text)
        elif errors:
            raise ValueError("La página no tiene un código 200")

In [None]:
#Child classes

class WikipediaExtractor(Extractor):
    """
    Inhteir from Extractor, scrape the content of wikipedia pages
    
    Method:
    --------
    parse_text: return to us wikipedia page parsed without HTML text.
    """

    def __init__(self, url, errors=False) -> None:
        super().__init__(url, errors=False)

    def parse_text(self):
        #Select the text
        text_html = self.selector.xpath("//div[@class='mw-parser-output']/p").extract()

        text_parsed = []
        
        #Clean HTML and other stuff from wikipedia
        for paragraph in text_html:
            text_souped = BeautifulSoup(paragraph, "lxml").get_text()
            soup_cleaned = re.sub("\[.*?\]", "", text_souped) #quitamos los corchetes
            text_parsed.append(soup_cleaned)
        
        #keep the white spaces correctly
        self.text = "\n".join(text_parsed).replace("\n", " ").replace("  ", " ").strip()
        return self.text


class FrontiersinExtractor(Extractor):
    """
    Inhteir from Extractor, scrape the content of wikipedia pages
    
    Method:
    --------
    parse_text: return to us a Frontiersin page parsed without HTML text.
    """
        
    def __init__(self, url, errors=False) -> None:
        super().__init__(url, errors=False)

    def parse_text(self):
        #Clean HTML and other stuff from Frontiersin
        text_html = self.selector.xpath("//div[@class='size size-small fulltext-content']").extract()
        text_souped = BeautifulSoup(text_html[0], "lxml").get_text()
        text_clean = re.sub("\[.*?\]", "", text_souped)
        text_clean = re.sub("\s+", " ", text_clean)

        soup_splited = text_clean.split(" ")
        
        #We don't want that the references of the article appear be part of our data.
        soup_without_references = []
        for idx, word in enumerate(soup_splited):
            if word == "Conflict" and soup_splited[idx+1] == "of" and soup_splited[idx+2] == "Interest":
                break
            else:
                soup_without_references.append(word)

        soup_without_references = " ".join(soup_without_references)
        soup_without_references = soup_without_references.strip()

        self.text = soup_without_references
        return self.text

In [None]:
def urls_to_scrape(urls, target):
    """
    Return the urls with their target. 

    Warning, urls and target must be ordered in the same way 
    
    Parameters:
    ------------
    urls: (list(str))
        A list of urls to scrape
        
    target: (list(float))
        A list of target that have earch url
    """
    
    urls_to_scrape = []
    targets_urls_to_scrape = []

    for url, target in zip(urls, target):
        #We only are doing with wikipedia and frontiersin
        if type(url) == str and ("wikipedia" in url or "frontiersin" in url):
            urls_to_scrape.append(url)
            targets_urls_to_scrape.append(target)
            
    return urls_to_scrape, targets_urls_to_scrape

In [None]:
def scraper(urls_to_scrape):
    """
    Scrape all the urls data from wikipedia and frontiersin
    
    Parameters
    ----------
    urls_to_scrape: (list(str))
    """
    
    texts_parsed = []
    for url in tqdm(urls_to_scrape):
        sleep(0.1)
        if "wikipedia" in url:
            wkex = WikipediaExtractor(url, True)
            try:
                texts_parsed.append(wkex.parse_text())
            except:
                warnings.warn(f"The url {url} can't be scraped")
                texts_parsed.append(np.nan)
                
        elif "kids.frontiersin" in url:
            frontex = FrontiersinExtractor(url, True)
            try:
                texts_parsed.append(frontex.parse_text())
            except:
                warnings.warn(f"The url {url} can't be scraped")
                texts_parsed.append(np.nan)
            
    return texts_parsed

# 3. Scrape the data

Now that we have all we need, we can start the scraping and get the data!

In [None]:
train

In [None]:
# Extract the urls from the dataframe
urls, target = urls_to_scrape(train.url_legal.values, train.target.values)
urls[:5]

In [None]:
#Scrape
texts_parsed = scraper(urls)

In [None]:
 #Create a DataFrame
df_text_parsed = pd.DataFrame(texts_parsed)
df_text_parsed.columns = ["excerpt"]
df_text_parsed["target"] = target


df_text_parsed = df_text_parsed.dropna().reset_index(drop=True)

#Save the data
df_text_parsed.to_csv("./data_augmentation_raw.csv", index=False)
df_text_parsed

# 4. Prepare and cleaning the text

The last point is to **complete clean and make the new texts similar to our train**. 

In this case I take the following decisions:

1. Complete cleaning of the text.

2. The scraped articles have thousands of words, but in our train each data have an average of ~175 words length. It's needed to divide the articles in diferent parts, in order to do that we will create differents parts **following a normal distribution** to have a mean of 175 words lenght each paragraph. This mean that for one article of 1000 words length we will have ~5 new data.

3. To **avoid same target** we will change randomly the target between a 0.05 or -0.05

**First will create the functions that will use in this part:**

In [None]:
def truncated_normal(mean=180, sd=17, low=135, high=205):
    """
    Return a number that belong to a normal distribution
    
    Parameters:
    -----------
    
    mean: (int/float)
        Mean of the distribution
        
    sd: (int/float)
        Standar deviation of the distribution
        
    low: (int/float)
        Lowest number fo the distribution
        
    high: (int/float)
    """
    return truncnorm( (low - mean) / sd, (high - mean) / sd, loc=mean, scale=sd ).rvs()


def clean_references(sentence):
    "Delete the 'references' in the article"
    
    sentence_split = sentence.split(" ")

    new_sentence = []

    for word in sentence_split:
        normalize_word = word.lower()
        if normalize_word == "references" or normalize_word == "reference":
            break
        else:
            new_sentence.append(word)

    return " ".join(new_sentence)

def clean_simbols(sentence):
    """
    Delete all inside brackets, parentheses, and other symbols that frequently appear
    in wikipedia and frontiersin
    """
    sentence = re.sub("([\(\[]).*?([\)\]])", "", sentence) #clean brackets
    sentence = re.sub(r"http\S+", "", sentence) #clean urls
    sentence = sentence.replace("↑", "") #clean this symbol
    sentence = re.sub("\s+", " ", sentence) #spaces
    return sentence


def gen_data(data_augmented_raw):
    """
    Pass a DataFrame with the text column 'excerpt' and target column to create the augmented DataFrame
    """
    
    all_data_aumented = []

    for row_idx in range(data_augmented_raw.shape[0]):
        rand_div = truncated_normal()
        dq_data_raw = deque(data_augmented_raw["excerpt"][row_idx].split(" "))
        n_divs = len(data_augmented_raw["excerpt"][0].split(" ")) / rand_div
        differents_text = []
        for i in range(int(n_divs)):
            div = []
            for j in range(int(rand_div)):
                try:
                    div.append(dq_data_raw.popleft())
                except:
                    break

            length_div = len(div)
            if length_div > 125:
                div = " ".join(div)
                target_row = data_augmented_raw["target"][row_idx]
                min_target = target_row - 0.10
                final_target = np.random.uniform(low=min_target, high=target_row)
                differents_text.append([div, final_target])

        all_data_aumented.append(differents_text)
    return all_data_aumented

## 4.1 Prove what I said at the begining of point 4

In [None]:
#The train have a mean leangth of 171.65
length_excerpt = train.excerpt.map(lambda x: len(x.split(" "))).values
print(length_excerpt)

#distribution
plt.hist(length_excerpt)

## 4.2 Complete cleaning of the text

In [None]:
df_text_parsed["excerpt"] = df_text_parsed["excerpt"].map(clean_references)
df_text_parsed["excerpt"] = df_text_parsed["excerpt"].map(clean_simbols)
df_text_parsed

## 4.3 Generate the new data

In [None]:
final_data_augment = gen_data(df_text_parsed)
final_data_augment = [data for groups_data in final_data_augment for data in groups_data]


df_final_data_augment = pd.DataFrame(final_data_augment, columns=["excerpt", "target"])
df_final_data_augment.to_csv("data_augmented_clean.csv", index=False)
df_final_data_augment

# 4.4 Prove that the length and the distribution are similar to the train

In [None]:
#The train have a mean leangth of 171.65
length_excerpt_augment = df_final_data_augment.excerpt.map(lambda x: len(x.split(" "))).values
print(length_excerpt_augment)

#distribution
plt.hist(length_excerpt_augment)

# 5. Conclusions

We created around 2000 new data that can be added to our training!!!

1. First of all remember that the main hipothesis was done under the assumption of the text target being representative of the full scraped article.

2. This notebok was make with the purpose of testing this data augmentation concept. Functions can be improved, clean can be improved, and adding other extractors can be made, etc.