*This notebook is dedicated to scraping all images from MoMA. For this project I use the 2,000+ collection of paintings in order to generate my own modern art*

In [None]:
from bs4 import BeautifulSoup
import requests
from fake_useragent import UserAgent
import time, os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
import pandas as pd

chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

*The url here, is a link to all modern art paintings in the online collection. When clicking the 'Show more results' button, the url actually changes and will no longer include html for the first 40 images, so we have to do the first 40 separately*

In [None]:
url = 'https://www.moma.org/collection/?utf8=%E2%9C%93&q=&classifications=9&date_begin=Pre-1850&date_end=2020'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

In [None]:
grid = soup.find(class_='grid') # grid contains all the images
lis = grid.find_all('li') # all images are stored in a li within the grid, along with their title and artist
for li in lis:
    title = []
    for span in li.find('h3').find_all('span'): #the h3 contains artist, title, year, all in separate spans
        title.append(span.text) #the three components of the name are compressed here into a list
    name = '_'.join([re.sub(' ', '-', bit.strip()) for bit in title])+'.jpeg' #I join the list into a string and make it a jpeg
    name = re.sub('/', '-', name) #any slashes will ruin the path and not allow it to save correctly
    try:
        href = li.find('picture').find('img')['src']
        response = requests.get('https://www.moma.org'+href)
        file = open(f"moma_plus/{name}", "wb")
        file.write(response.content)
        file.close()
    except:
        print(name)

In [None]:
driver.quit()

*The above code only obtains the initial 40 images. To get the other 2,000+, I use the code below. I have to use Selenium in order to load all the images onto the page.*

In [None]:
url = 'https://www.moma.org/collection/?utf8=%E2%9C%93&q=&classifications=9&date_begin=Pre-1850&date_end=2020&page=2&direction=fwd'
driver = webdriver.Chrome(chromedriver)
driver.get(url)

In [None]:
import time
while True:
    driver.find_element_by_xpath('//*[contains(text(), "Show more results")]').click() 
    #Everytime the button is clicked, another 40 or so images is loaded
    time.sleep(8) #The 8 seconds of sleep is necessary to allow the page to load before trying to click again

*Once all images are loaded, I save the page to BeautifulSoup and follow the same process as before*

In [None]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
driver.quit()

In [None]:
grid = soup.find(class_='grid') # grid contains all the images
lis = grid.find_all('li') # all images are stored in a li within the grid, along with their title and artist
for li in lis:
    title = []
    for span in li.find('h3').find_all('span'): #the h3 contains artist, title, year, all in separate spans
        title.append(span.text) #the three components of the name are compressed here into a list
    name = '_'.join([re.sub(' ', '-', bit.strip()) for bit in title])+'.jpeg' #I join the list into a string and make it a jpeg
    name = re.sub('/', '-', name) #any slashes will ruin the path and not allow it to save correctly
    try:
        href = li.find('picture').find('img')['src']
        response = requests.get('https://www.moma.org'+href)
        file = open(f"moma_plus/{name}", "wb") #The image is saved according to its official name
        file.write(response.content)
        file.close()
    except:
        print(name)