# MoMA Online Archive Scrape

![img](img/banner.png)

This project uses `urllib` and `BeautifulSoup` libraries to scrape data from the Museum of Modern Art's [online collection](https://www.moma.org/collection/works?&with_images=1&on_view=1).

## Structure of Response Data

The relevant HTML section we want to extract data has the following structure:

```
<li> - class "grid-item--work"
└── <a> tag - link to artwork page
    ├── <div> 
    │   └── <picture> tag
    │       └── <img> tag
    └── <div> with artwork data
        └── <h3>
            ├── <span> - artist name
            ├── <span> - artwork name
            └── <span> - artwork date
```

The goal is to pull the artwork data inside the `h3` tag.

In [8]:
# Each artist's archive has an associated ID.
# Create a dictionary that contains a mapping between the artist name and their ID
ARTISTS = {
    "le corbusier": {
        "name": "Le Corbusier",
        "id": 3426},
    "mies van der rohe": {
        "name": "Ludwig Mies van der Rohe",
        "id": 7166
    },
    "antoni gaudí": {
        "name": "Antoni Gaudí",
        "id": 2096
    },
}

In [9]:
import urllib
import re
from bs4 import BeautifulSoup
import numpy as np

class ExtraSoup(BeautifulSoup):
    """Child class of BeautifulSoup with additional methods"""
    
    """HTML Response Retrieval Helper Functions"""
    
    def get_artist_info(self, artist_info):
        """Initialize artist and artist_id attributes"""
        self.artist_name = artist_info['name']
        self.artist_id = artist_info['id']
    
    def build_url(self, query):
        """Build a url to query MoMA website"""
        url = (f'https://www.moma.org/artists/{query}?locale=en&page=1')
        return url

    def get_url_response(self, url):
        """Get HTML response from url"""
        req = urllib.request.Request(url)
        with urllib.request.urlopen(req) as response:
            return response.read()

    def make_soup(self):
        """Create a BeautifulSoup class from HTML response"""
        self.html = BeautifulSoup(
            self.get_url_response(self.build_url(self.artist_id)),
            'html.parser'
        )
        
    """HTML Parsing Helper Functions"""
    
    def get_artwork_year(self, string):
        """Extra year string from HTML tag.
        Year data may come various forms like "1932-34", "1923-1924", "c.1934".
        Not a perfect solution, but in this case get the first instance of a valid
        year (ie. "1923-1924" would extract 1923)
        
        Reference: https://developers.google.com/edu/python/regular-expressions
        """
        result = re.search(r'\d{4}', string).group(0)
        return int(result)

    def validate_artist(self, metadata):
        """Validate whether artwork was made by artist we are querying for,
        since search results may return a variety of artists.
        """
        if metadata:
            if self.artist_name.lower() in metadata.lower():
                return True
        return False

    def validate_string(self, metadata):
        """Validate whether string exists and return a cleaned string"""
        # Some titles are wrapped in an <em> tag inside span
        if metadata.find("em"):
            return metadata.find("em").string
        elif metadata.string is not None:
            # Span tag contents are listed with lots of spaces and newlines
            return metadata.string.strip()
        return None

    def get_artwork_metadata(self):
        """Parse through HTML tags and extract artwork date"""
        artist_works = self.html.find_all('li', 'grid-item--work')
        self.artworks = []
        for artwork in artist_works:
            artwork_dict = {}
            # get span tags that contain artwork data
            artwork_metadata = artwork.find_all("span")
            artist = self.validate_string(artwork_metadata[0])
            if self.validate_artist(artist):
                artwork_dict['artist'] = artist
                artwork_dict['title'] = self.validate_string(artwork_metadata[1])
                artwork_dict['year'] = self.get_artwork_year(
                    self.validate_string(artwork_metadata[2])
                )
            self.artworks.append(artwork_dict)
    
    """Functions for analysis"""
    
    def get_artwork_year_median(self):
        """Get the median year of all the artworks"""
        years = []
        for work in self.artworks:
            years.append(work['year'])
        return int(np.median(years))
    
    def get_artwork_count(self):
        """Get a count of how many artworks were found in the online collection"""
        return len(self.artworks)
    
    """Todo: Add export functions (to JSON, CSV)"""

## Le Courbusier

<img src="img/cb_savoye.jpg" width="500"/>

In [10]:
le_corbusier = ExtraSoup()
le_corbusier.get_artist_info(ARTISTS['le corbusier'])
le_corbusier.make_soup()
le_corbusier.get_artwork_metadata()
le_corbusier.artworks

[{'artist': 'Le Corbusier (Charles-Édouard Jeanneret)',
  'title': 'Bol, Pipes, et Papiers Enroulés',
  'year': 1919},
 {'artist': 'Le Corbusier (Charles-Édouard Jeanneret)',
  'title': 'Still Life',
  'year': 1920},
 {'artist': 'Le Corbusier (Charles-Édouard Jeanneret)',
  'title': "L'Esprit Nouveau letterhead (Letter to Naum Gabo from Paul Dermée)",
  'year': 1920},
 {'artist': 'Le Corbusier (Charles-Édouard Jeanneret)',
  'title': 'Still Life with Objects',
  'year': 1923},
 {'artist': 'Alfred Roth, Le Corbusier (Charles-Édouard Jeanneret)',
  'title': 'Bed Frame',
  'year': 1927},
 {'artist': 'Le Corbusier (Charles-Édouard Jeanneret), Pierre Jeanneret',
  'title': 'Les Terrasses, Villa Stein-de-Monzie, Garches, France',
  'year': 1926},
 {'artist': 'Le Corbusier (Charles-Édouard Jeanneret), Pierre Jeanneret, Charlotte Perriand',
  'title': 'Chaise Longue (LC/4)',
  'year': 1928},
 {'artist': 'Le Corbusier (Charles-Édouard Jeanneret), Pierre Jeanneret, Charlotte Perriand',
  'title'

## Ludwig Mies van der Rohe

<img src="img/mies_barcelona.jpg" width="500"/>

In [11]:
mies_van_der_rohe = ExtraSoup()
mies_van_der_rohe.get_artist_info(ARTISTS['mies van der rohe'])
mies_van_der_rohe.make_soup()
mies_van_der_rohe.get_artwork_metadata()
mies_van_der_rohe.artworks

[{'artist': 'Ludwig Mies van der Rohe',
  'title': 'Bismarck Monument Project, Bingen, Germany, Elevation',
  'year': 1910},
 {'artist': 'Ludwig Mies van der Rohe',
  'title': 'Bismarck Monument Project, Bingen, Germany (Perspective)',
  'year': 1910},
 {'artist': 'Ludwig Mies van der Rohe',
  'title': 'Bismarck Monument, project, Bingen, Germany, Perspective view of courtyard',
  'year': 1910},
 {'artist': 'Ludwig Mies van der Rohe',
  'title': 'Werner House, Berlin-Zehlendorf,  Germany, Four elevations',
  'year': 1913},
 {'artist': 'Ludwig Mies van der Rohe',
  'title': 'Werner House, Berlin-Zehlendorf, Germany (Four elevations, section)',
  'year': 1913},
 {'artist': 'Ludwig Mies van der Rohe',
  'title': 'Werner House, Berlin-Zehlendorf, Germany, Four plans',
  'year': 1913},
 {'artist': 'Ludwig Mies van der Rohe',
  'title': 'Werner House, Berlin-Zehlendorf, Germany, Four elevations, section',
  'year': 1913},
 {'artist': 'Ludwig Mies van der Rohe',
  'title': 'Werner House, Berl

## Antoni Gaudí 

<img src="img/gaudi_sagrada.jpg" width="500"/>

In [12]:
antoni_gaudi = ExtraSoup()
antoni_gaudi.get_artist_info(ARTISTS['antoni gaudí'])
antoni_gaudi.make_soup()
antoni_gaudi.get_artwork_metadata()
antoni_gaudi.artworks

[{'artist': 'Antoni Gaudí',
  'title': 'Floor Tiles from the Casa Mila, Barcelona',
  'year': 1905},
 {'artist': 'Antoni Gaudí',
  'title': 'Grille from the Casa Milá (La Pedrera), Barcelona, Spain',
  'year': 1906},
 {'artist': 'Antoni Gaudí', 'title': 'Prayer Bench', 'year': 1898},
 {'artist': 'Antoni Gaudí',
  'title': 'Church of the Sagrada Familia, Barcelona, Spain (Model of a column)',
  'year': 1883},
 {'artist': 'Antoni Gaudí',
  'title': 'Church of the Sagrada Familia, Barcelona, Spain (Model of a nave window)',
  'year': 1883},
 {'artist': 'Antoni Gaudí', 'title': 'Side Chair', 'year': 1905},
 {'artist': 'Aurèlia Muñoz, Antoni Gaudí',
  'title': 'Study of a catenary arch for the Gaudí crypt at Colonia Güell',
  'year': 1996}]

## Summary

In [13]:
def print_artwork_stats(artist):
    print(f'{artist.artist_name}: \n\
          1. Total artworks found in the collection: {artist.get_artwork_count()} \n\
          2. Median year of the artist\'s artworks in the collection: {artist.get_artwork_year_median()}')

In [14]:
artists = [le_corbusier, mies_van_der_rohe, antoni_gaudi]
for artist in artists:
    print_artwork_stats(artist)

Le Corbusier: 
          1. Total artworks found in the collection: 77 
          2. Median year of the artist's artworks in the collection: 1959
Ludwig Mies van der Rohe: 
          1. Total artworks found in the collection: 100 
          2. Median year of the artist's artworks in the collection: 1925
Antoni Gaudí: 
          1. Total artworks found in the collection: 7 
          2. Median year of the artist's artworks in the collection: 1905
