DSC160 Data Science and the Arts - Twomey - Spring 2020 - [dsc160.roberttwomey.com](http://dsc160.roberttwomey.com)

## Example Scraping Code

This example uses the Beautiful Soup library and other python modules to scrape a set of paintings from WikiArt. This notebook downloads the work of abstract expressionist Lee Krasner. For Exercise 1, you will work with the paintings of Mark Rothko.

First we import the necessary libraries

In [1]:
from bs4 import BeautifulSoup
import os
import requests

set up our data paths and URLs

In [2]:
DATA_DIR = '../data/'
ARTIST_URL = 'https://www.wikiart.org/en/{artist}/all-works/text-list'
PAINTING_URL = 'https://www.wikiart.org{painting_path}'

and set up file storage for downloaded images

In [3]:
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

### Get list of paintings for artist

In [4]:
artist_name = 'lee-krasner'

In [5]:
url_query = ARTIST_URL.format(artist=artist_name)

In [6]:
artist_page = requests.get(url_query)

request WikiArt list of works page

In [None]:
# check for request error
try:
    artist_page.raise_for_status()
except requests.exceptions.HTTPError as e:
    print("Error trying to retrieve {}".format(artist_page.url))
    raise e

In [None]:
soup = BeautifulSoup(artist_page.text, 'lxml')

create image storage directory for artist if it doesn't exist `../data/artist-name`

In [None]:
IMAGE_DIR = os.path.join(DATA_DIR, artist_name)
if not os.path.exists(IMAGE_DIR):
    os.makedirs(IMAGE_DIR)

parse all painting pages from list of works page

In [None]:
painting_paths = []

# retreive all rows in painting-list
for li in soup.find_all('li', {'class': 'painting-list-text-row'}):

    # retrieve all links in the current row
    for link in li.find_all('a'):
        href = link.get('href')
        # store in dictionary
        painting_paths.append(href)

print(len(painting_paths))
# painting_paths

### Download Paintings

In [None]:
def download_and_save(painting_url):
    r_painting_page = requests.get(painting_url)
    soup = BeautifulSoup(r_painting_page.text, 'lxml')
    for img in soup.find_all('img', {'class': 'ms-zoom-cursor'}):
        img_url = img['src']
        img_url = img_url.split('!')[0]
        filename = img_url.split('/')[-1]

        outfile = os.path.join(IMAGE_DIR, filename)                       
        if not os.path.exists(outfile):                        
            print("downloading {}: {}".format(filename, img_url))
            r = requests.get(img_url, outfile)
            with open(outfile, 'wb') as f:
                f.write(r.content)
        else:
            #print("skipping {}".format(filename))
            pass

In [None]:
for path in painting_paths:
    painting_path = PAINTING_URL.format(painting_path=path)
    download_and_save(painting_path)

## Extensions
- Scrape painting metadata and store along with with the painting images as a text file on disk. These metadata will be useful for further analysis.
- Store URLs and all available metadata in a pandas dataframe. 

## Reference
- Lee Krasner biography: [https://www.biography.com/artist/lee-krasner](https://www.biography.com/artist/lee-krasner)
- Beautiful Soup documentation: [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- WikiArt terms of use: [https://www.wikiart.org/en/terms-of-use](https://www.wikiart.org/en/terms-of-use)