# Scraping & Normalization

## Copyright notice

This version (c) 2020 Fabian Offert, [MIT License](LICENSE).

## Colab setup

Only run this cell if you are running this notebook via Google Colab!

In [None]:
!wget https://zentralwerkstatt.org/files/wga.zip
!unzip wga.zip
import sys
!git clone https://github.com/zentralwerkstatt/HUJI
!pip install lap
sys.path.append('HUJI/lib/')

## Imports

We are using the `BeautifulSoup`library to find specific tags on websites and the `requests` library to download, i.e. "request" websites.

In [9]:
import sys
sys.path.append('lib/')
from vc_toolbox import *

from numpy.random import choice as random_choice
from bs4 import BeautifulSoup
import requests
import csv
from shutil import move as movefile

## Our toy dataset: WGA-small (2200 samples)

This is a (for ML) tiny dataset, scraped from the web gallery of art and consisting of 2x1100 high-quality images in two classes: "portrait" and "landscape" paintings. 1000 images of each classes are reserved for *training*, 100 images of each class are reserverd for *validating and testing* our machine learning classifier. It is available as part of the workshop repository in the `wga`folder. The below code presents some randomly picked samples from the dataset.

In [None]:
folder = 'wga' # Relative path
img_files = get_all_files(folder, extension='.jpg')
print(f'{len(imgs)} files found')
random_img_files = random_choice(img_files, 1)
for img_file in random_img_files:
    show_img(img_file)

## Scraping a dataset: MoMA example 

The New York City Museum of Modern Art collection consist of almost 200,000 works, 81,000 of which are available online. Some datasets are harder to scrape then others. The MoMA website is a particular easy example. Generally, the process is always the same: inspect the URL and source code of the website with regard to how it presents a single work/image file. Then automate the this process.

![](img/moma-back.jpg)

![](img/moma-front.jpg)

We need a directory to save the downloaded images.

In [None]:
folder = 'moma'
total_pages = 1000000 # Unclear what the limit is, getting 404s/no images is "cheap" enough though to brute-force
new_dir(folder)

A function to save an image file from a direct image URL, specific to the MoMA website. This version also has the option to save some metadata into a CSV file.

In [None]:
def save(url, meta=None):
    data = requests.get(url).content
    name = url.split('?sha=')[-1] # SHA as name
    file = f'{folder}/{name}.jpg'
    with open(file, 'wb') as f:
        f.write(data)
    if meta:
        meta.append(file) # Also write the name of the local file
        with open('meta.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(meta)

A function to process one page on the MoMA website.

In [None]:
def process_page(page):
    url = f'https://www.moma.org/collection/works/{page}'  
        
    response = requests.get(url)
    if response.status_code == 200: # If we get a positive response from the server...
        
        soup = BeautifulSoup(response.content, 'html.parser') # Parse the page
        imgs = soup.findAll('img', 'picture__img--scale--focusable') # Find a specific class of the img tag
        
        # Find the metadata on the page
        # We know that it is the second 'meta' tag with the name 'stitle' that we want
        # We also know the format is 'author. work. year.' so we can split by '. '
        meta = soup.findAll('meta', {'name':'stitle'})
        content = meta[1].get('content').split('. ')
        if len(content) > 3:
            content = [content[0], content[1], content[2]+'. '+content[3]]
              
        if imgs:
            src = imgs[0].get('src') # Get the URL of the first found image
            save(f'https://www.moma.org{src}', meta=content) # Save the image
            return True # Only return true if image was downloaded
        else:
            print(f'No image links on page {page}')
            return False
        
    else: 
        print(f'Response {response.status_code} for page {page}')
        return False

Start scraping!

In [None]:
# Every work has a unique page number, starting (for some reason) with 200000 - found by trial and error
n = 0
for page in range(200000, total_pages):
    n += process_page(page) # Keep track of nr. of downloaded images
    if page % 20 == 0: # Print status every 20 pages
        print(f'{n} images downloaded so far...')

Please note that this process can be sped up massively by using multiple threads to download the images. An implementation is provided as a [Python script here](moma-scraper.py)

## Cleaning our dataset

There is always the chance that we will end up with data that is at least partially corrupted. It is thus a good idea to check the data before we attempt to do anything else with it (like feeding it to a machine learning classifier, for instance!)

In [None]:
files = get_all_files(folder)
new_dir('rejects')
total = len(files)
removed = 0
for file in files:
    try: 
        img = PIL.Image.open(file) # If PIL can't open it we don't want it
    except:
        movefile(file, 'rejects')
        removed+=1
print(f'{total} found, {total-removed} kept')

## <font color='red'>Exercises</font>

Find an online image dataset that interests you to scrape - it does not have to be gigantic, nor does it have to be high-quality. Start by looking at the page source to see if the scraping could be automated. Look for possible tags to find images. Finally, try to adapt the script above for your dataset.