# 1. Data fetching

This is a brief notebook that includes the automated download of the protocols that will be used for the *Protocols* track of Hercules Challenge. If those protocols have already been downloaded and placed in the _data\/protocols_ folder, you can skip this and go directly to notebook _2. Data Exploration_.

## Setup

For the download of every protocol we have a list of protocol urls that will be fetched. This list has been obtained from the official Hercules challenge documentation, and has been written to the file _protocol\_urls.txt_ located in the data folder.

In [1]:
%run __init__.py

INFO:root:Starting logger


## Getting the protocols URLs

In the following cells we are going to read every protocol url from the file described above:

In [2]:
PROTOCOL_URLS_FILE = os.path.join(DATA_DIR, 'protocol_urls.txt')

def get_protocols_urls(file_name):
    with open(file_name, 'r') as f:
        urls = [url.rstrip() for url in f]
    return urls


In [3]:
urls = get_protocols_urls(PROTOCOL_URLS_FILE)
len(urls)

100

As we can see, there are 100 protocols that will be used for this track.

In [4]:
urls[0]

'https://bio-protocol.org/e16'

## Fetching the protocols

Now that we have every protocol url loaded, we are going to define a simple class that will be in charge of fetching the data from the [bio-protocol](https://bio-protocol.org/) website:

In [5]:
import requests
import time

from tqdm import tqdm

BASE_URL = "https://bio-protocol.org/"

class BioProtocolScrapper():
    def __init__(self, output_dir, throttle_time=.5,
                 username=None, password=None):
        self.output_dir = output_dir
        self.throttle_time = throttle_time
        if username and password:
            self.login(username, password)
    
    def fetch_urls(self, url_list):
        pbar = tqdm(url_list)
        for url in pbar:
            pbar.set_description("Processing %s" % url)
            self.fetch_url(url)
            time.sleep(self.throttle_time)
    
    def fetch_url(self, url):
        filename = url.split('/')[-1] if '/' in url else url
        filename += ".html"
        response = requests.get(url)
        with open(os.path.join(self.output_dir, filename), 'w', encoding='utf-8') as f:
            f.write(response.text)
    
    def _login(self, user, password):
        payload = {'txtEmail': user, 'txtPassword': password}
        url = f'{BASE_URL}/ifrlogin.aspx/?sign=in&p=4'
        requests.post(url, data=payload)


Finally, we will be making use of this class to fetch every url and save the resulting _html_ to the _data/protocols_ folder. In the following notebook we will be loading and parsing this data to make our initial exploration of the dataset.

In [6]:
scrapper = BioProtocolScrapper(PROTOCOLS_DIR)
scrapper.fetch_urls(urls)

Processing https://bio-protocol.org/e3436: 100%|██████████| 100/100 [03:59<00:00,  2.39s/it]
