# More Explorations with MoMA Scrape

![image](img/gallery.webp)

This notebook further explores web data scraped from the MoMA collection.

The [previous notebook](https://github.com/tnakatani/python2_ccac/blob/master/wk_8/moma_scrape/moma_scrape.ipynb) explored scraped data of specific artists.  This time, the script was revised to randomly sample artworks from the collection in order to explore trends in the larger collection.

The [moma_scrape.py](moma_scrape.py) script created an artwork dataset with the following steps:
1. NumPy's [`randint`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html?highlight=randint#numpy.random.randint) method creates a list of random integers.
2. The `ArtworkSoup` class contains a method `scrape`, which takes the integers as a parameter.  The class subsequently builds a URL with the integer and makes a HTTP request.  Both successful and unsuccessful HTTP calls are logged in a separate `scrape.log` file.
3. If the HTTP request is successful, the class instantiates a `BeautifulSoup` object, extracts relevant artwork data from the HTML and builds a data structure from it.
4. Once all HTTP calls are made, the resulting data structure is dumped to a CSV file.

One flaw in the method: Each integer in the 300K list of random integer does not necessarily guarantee a a valid request.  It was not clear to me how the MoMA site assigns these artwork URL IDs, so I attempted to throw a large number of a randomized sample and see how many valid requests I can receive.  Using the scraping logs, I calculated the rate of 404 errors:

```python
with open('scrape.log', 'r') as f:
    count = 0
    count_404 = 0
    for line in f:
        count += 1
        if 'INFO:root:404' in line:
            count_404 += 1
    print(f'Total HTTP requests resulting in 404 response: {count_404}')
    print(f'Total HTTP requests: {count}')    
    print(f'Ratio of 404 response versus 200: {count_404/count*100:.2f}%')

# Total requests resulting in 404: 17567
# Total HTTP requests: 30000
# Ratio of 404 response versus 200: 58.56%
```

Thus, more than half of the 300K requests resulted in a 404 error!

In [48]:
import pandas as pd
import numpy as np

df = pd.read_csv('artwork_data.csv', index_col=0)
df.head()

Unnamed: 0,id,artist,title,date,manufacturer,medium,dimensions,credit,object number,department,...,delineator,associated work,type,periodical,architectural firm,fabricator,producer,design firm,designer,editor
0,1166,Bruno Munari,Maldive Tray,1960,"danese s.r.l., italy",silver,"1 3/4 x 5 3/4 x 5 3/4"" (4.5 x 14.6 x 14.6 cm)",gift of the manufacturer,2272.2001,architecture and design,...,,,,,,,,,,
1,6060,Joe Scorsone,Moholy Nagy Unpublished Images and Documents,1974,,offset lithograph,"17 x 22"" (43.2 x 55.8 cm)",gift of the designer,282.1980,architecture and design,...,,,,,,,,,,
2,4242,IBM Corporation,Ferrite Memory Core,c. 1955,"ibm, east fishkill, ny","copper wiring, ferrite, and plastic","10 3/8 x 10 3/8"" (26.4 x 26.4 cm)",gift of the manufacturer,497.1990.1,architecture and design,...,,,,,,,,,,
3,27282,Fernand Léger,Plate (page 26) fromCirque(Circus),1950,,one from an illustrated book with eighty-three...,"composition (irreg.): 14 1/16 × 9 15/16"" (35.7...",the louis e. stern collection,890.1964.23,drawings and prints,...,,,,,,,,,,
4,7248,Leonetto Cappiello,Chocolat Frigor,1929,,lithograph,"50 5/8 x 35 3/4"" (128.5 x 90.6 cm)",acquired by exchange,493.1983,architecture and design,...,,,,,,,,,,


In [15]:
df.keys()

Index(['id', 'artist', 'title', 'date', 'manufacturer', 'medium', 'dimensions',
       'credit', 'object number', 'department', 'author', 'publisher',
       'printer', 'edition', 'copyright', 'illustrated book', 'portfolio',
       'collaborating artist', 'model maker', 'state/variant', 'impresssion',
       'delineator', 'associated work', 'type', 'periodical',
       'architectural firm', 'fabricator', 'producer', 'design firm',
       'designer', 'editor'],
      dtype='object')

In [38]:
cols = ['artist', 'title', 'medium', 'credit']
for c in cols:
    print(c.upper())
    print(df[c].value_counts().head(10), '\n')

ARTIST
Pablo Picasso               675
Thomas Bewick               311
Joan Miró                   291
Pierre Bonnard              210
Sol LeWitt                  200
Unknown Artist              164
E. McKnight Kauffer         146
Fernand Léger               120
Henri Matisse               120
Ludwig Mies van der Rohe    118
Name: artist, dtype: int64 

TITLE
Vase                                                                          92
Bowl                                                                          77
Untitled fromFound Masks 1975–1978                                            37
Fabric Sample                                                                 29
Untitled fromTwenty-Four Personal Attitudes Related to Closed Spaces          28
The Manhattan Transcripts Project, New York, New York, Episode 1: The Park    24
Armchair                                                                      23
Plate fromLe Surréalisme en 1947                                       

In [17]:
df['artist'].value_counts().head()

Pablo Picasso     675
Thomas Bewick     311
Joan Miró         291
Pierre Bonnard    210
Sol LeWitt        200
Name: artist, dtype: int64

In [18]:
df['title'].value_counts().head()

Vase                                                                    92
Bowl                                                                    77
Untitled fromFound Masks 1975–1978                                      37
Fabric Sample                                                           29
Untitled fromTwenty-Four Personal Attitudes Related to Closed Spaces    28
Name: title, dtype: int64