#Stage One - The Data Collection
#Assessing What Makes a Painting Display-Worthy?
#Predicting Likelihood of Painting Display at The Metropolitan Museum of Art, rather than being left in storage

This notebook collects painting metadata from The Met's open-access dataset and then enriches it with API data.

#Steps:
1. Download The Met Museum's full collection CSV from GitHub (of roughly 470,000 objects in total, including paintings but also scuplture, prints etc.)
2. Filter to paintings only using the classification column
3. Enrich this data with the API for fields not in the CSV itself (structured measurements, tags)

# Why this approach?  The Met Museum's search API does not support filtering by object classification
as a parameter, so using `q=*&classification=Paintings` returns unreliable results. The GitHub
CSV gives me the full collection with a reliable classification column to filter further on.

#Data source: is The Met Open Access on GitHub - https://github.com/metmuseum/openaccess
#API docs: The Met Collection API]- https://metmuseum.github.io/
#Licence: Creative Commons Zero 

#One - The Initial Setup

In [1]:
import requests
import pandas as pd
import json
import time
import os
from tqdm import tqdm  # progress bar — install with: pip install tqdm

BASE_URL = "https://collectionapi.metmuseum.org/public/collection/v1"

# Create data directories
os.makedirs("data/raw", exist_ok=True)
os.makedirs("data/clean", exist_ok=True)

print(f"Working directory: {os.getcwd()}")
print(f"data/raw exists: {os.path.exists('data/raw')}")

print("Setup Is Now Complete.")

Working directory: /Users/rosswilson/Projects/ironhack/Final Project
data/raw exists: True
Setup Is Now Complete.


#Two - Downloading the Full Met Museum Collection as a CSV

Note thay the Met publishes a regularly updated CSV of their entire collection on GitHub. This is much more
reliable than trying to use the search endpoint for bulk retrieval for this number of objects, and gives me immediate access
to most of the metadata needed for this project.

#Note - This file is v large at nearly 300MB. Therefore some actions will take a time to download and then load. Saving at different steps will help bring security. 

In [2]:
CSV_URL = "https://media.githubusercontent.com/media/metmuseum/openaccess/master/MetObjects.csv"
CSV_PATH = "data/raw/MetObjects.csv"

if not os.path.exists(CSV_PATH):
    print("Downloading the Met Museum Objects")
    response = requests.get(CSV_URL, stream=True)
    response.raise_for_status()
    with open(CSV_PATH, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print(f"Downloaded to {CSV_PATH}")
else:
    print(f"Using cached file: {CSV_PATH}")

# Load the downloaded CSV
df_all = pd.read_csv(CSV_PATH, low_memory=False)
print(f"\nFull collection: {len(df_all):,} objects")
print(f"Columns: {len(df_all.columns)}")

Using cached file: data/raw/MetObjects.csv

Full collection: 484,956 objects
Columns: 54


#Three - Now Filter to Paintings (focus of the research project)

The CSV has a classification column we I use to filter directly

In [3]:
# Now check all classification values that contain "paint" (case-insensitive)
paint_classes = df_all["Classification"].dropna().unique()
paint_related = [c for c in paint_classes if "paint" in c.lower()]
print("Classifications containing 'paint':")
for c in sorted(paint_related):
    count = len(df_all[df_all["Classification"] == c])
    print(f"  {c}: {count:,}")

Classifications containing 'paint':
  Aerophone-Organ|Paintings: 1
  Bark-Paintings: 351
  Barkcloth|Paintings|Textiles-Painted: 7
  Ceramics-Paintings: 6
  Chordophone and Aerophone-struck piano/ free reed keyboard|Paintings: 1
  Chordophone-Zither-plucked-harpsichord|Paintings: 4
  Chordophone-Zither-plucked-virginal|Paintings: 1
  Costumes-Printed and Painted: 6
  Drawings|Paintings: 4
  Drawings|Paintings|Photographs: 1
  Enamels-Painted: 175
  Fans|Paintings: 25
  Glass-Painted: 109
  Jewelry|Paintings: 1
  Metalwork-Gold and Platinum|Paintings: 2
  Metalwork-Silver In Combination|Paintings: 2
  Miscellaneous-Paintings: 76
  Miscellaneous-Paintings & Portraits: 9
  Musical instruments|Paintings: 1
  Painted Canvases: 4
  Paintings: 9,005
  Paintings-Canvas: 3
  Paintings-Decorative: 96
  Paintings-Decorative|Paintings|Woodwork: 3
  Paintings-Fresco: 15
  Paintings-Frescoes: 1
  Paintings-Icons: 2
  Paintings-Panels: 34
  Paintings|Drawings: 5
  Paintings|Fans: 2
  Paintings|Frames

In [4]:
# Filter to the main Paintings classification
# Adjust this filter if the cell above reveals other relevant classifications to include
df_paintings = df_all[df_all["Classification"] == "Paintings"].copy()
print(f"Paintings found: {len(df_paintings):,}")

Paintings found: 9,005


In [5]:
# Assess what columns are available in the CSV
print("Available columns and quantity rates:")
for col in df_paintings.columns:
    non_null = df_paintings[col].notna().sum()
    pct = (non_null / len(df_paintings)) * 100
    print(f"  {col}: {non_null:,} non-null ({pct:.0f}%)")

Available columns and quantity rates:
  Object Number: 9,005 non-null (100%)
  Is Highlight: 9,005 non-null (100%)
  Is Timeline Work: 9,005 non-null (100%)
  Is Public Domain: 9,005 non-null (100%)
  Object ID: 9,005 non-null (100%)
  Gallery Number: 967 non-null (11%)
  Department: 9,005 non-null (100%)
  AccessionYear: 8,967 non-null (100%)
  Object Name: 8,918 non-null (99%)
  Title: 6,775 non-null (75%)
  Culture: 4,424 non-null (49%)
  Period: 2,731 non-null (30%)
  Dynasty: 0 non-null (0%)
  Reign: 0 non-null (0%)
  Portfolio: 0 non-null (0%)
  Constituent ID: 8,021 non-null (89%)
  Artist Role: 8,021 non-null (89%)
  Artist Prefix: 8,021 non-null (89%)
  Artist Display Name: 8,021 non-null (89%)
  Artist Display Bio: 8,021 non-null (89%)
  Artist Suffix: 8,021 non-null (89%)
  Artist Alpha Sort: 8,021 non-null (89%)
  Artist Nationality: 8,021 non-null (89%)
  Artist Begin Date: 8,021 non-null (89%)
  Artist End Date: 8,021 non-null (89%)
  Artist Gender: 765 non-null (8%)
  Ar

#Four - Check What the CSV Provides

Before making the API calls assess what's there and what's missing.  
The CSV seems to include most metadata but lacks structured measurements of paintings (e.g. numeric height/width dimesions) and detailed tags.

In [6]:
df_paintings.head(3)

Unnamed: 0,Object Number,Is Highlight,Is Timeline Work,Is Public Domain,Object ID,Gallery Number,Department,AccessionYear,Object Name,Title,...,River,Classification,Rights and Reproduction,Link Resource,Object Wikidata URL,Metadata Date,Repository,Tags,Tags AAT URL,Tags Wikidata URL
29856,2009.224,False,True,True,35155,374.0,Arms and Armor,2009,Painting,"Guidobaldo II della Rovere, Duke of Urbino (15...",...,,Paintings,,http://www.metmuseum.org/art/collection/search...,https://www.wikidata.org/wiki/Q56042443,,"Metropolitan Museum of Art, New York, NY",Armor|Men|Portraits,http://vocab.getty.edu/page/aat/300226591|http...,https://www.wikidata.org/wiki/Q20793164|https:...
30297,09.3,False,False,True,35968,,Asian Art,1909,Pictorial map,清 佚名 台南地區荷蘭城堡|Forts Zeelandia and Provintia ...,...,,Paintings,,http://www.metmuseum.org/art/collection/search...,https://www.wikidata.org/wiki/Q79003782,,"Metropolitan Museum of Art, New York, NY",Maps|Houses|Cities|Boats|Ships,http://vocab.getty.edu/page/aat/300028094|http...,https://www.wikidata.org/wiki/Q4006|https://ww...
30298,12.37.135,False,False,False,35969,,Asian Art,1912,Hanging scroll,,...,,Paintings,,http://www.metmuseum.org/art/collection/search...,,,"Metropolitan Museum of Art, New York, NY",,,


In [7]:
#Select the key fields
key_fields = [
    "Object ID", "Title", "Artist Display Name", "Artist Nationality",
    "Object Begin Date", "Object End Date", "Medium", "Dimensions",
    "Department", "Culture", "Period", "Credit Line", "AccessionYear",
    "Is Public Domain", "Is Highlight", "Gallery Number", "Classification",
    "Tags"
]

print("Key fields are:")
for field in key_fields:
    if field in df_paintings.columns:
        non_null = df_paintings[field].notna().sum()
        pct = (non_null / len(df_paintings)) * 100
        print(f"  ✓ {field}: {non_null:,} non-null ({pct:.0f}%)")
    else:
        print(f"  ✗ {field}: NOT IN CSV")

Key fields are:
  ✓ Object ID: 9,005 non-null (100%)
  ✓ Title: 6,775 non-null (75%)
  ✓ Artist Display Name: 8,021 non-null (89%)
  ✓ Artist Nationality: 8,021 non-null (89%)
  ✓ Object Begin Date: 9,005 non-null (100%)
  ✓ Object End Date: 9,005 non-null (100%)
  ✓ Medium: 9,001 non-null (100%)
  ✓ Dimensions: 8,991 non-null (100%)
  ✓ Department: 9,005 non-null (100%)
  ✓ Culture: 4,424 non-null (49%)
  ✓ Period: 2,731 non-null (30%)
  ✓ Credit Line: 9,000 non-null (100%)
  ✓ AccessionYear: 8,967 non-null (100%)
  ✓ Is Public Domain: 9,005 non-null (100%)
  ✓ Is Highlight: 9,005 non-null (100%)
  ✓ Gallery Number: 967 non-null (11%)
  ✓ Classification: 9,005 non-null (100%)
  ✓ Tags: 7,211 non-null (80%)


#Five - Enriching Existing Data with API (for Measurements and Tags)

The CSV has a Dimensions column with text e.g. *"Oil on canvas, 25 x 30 in. (63.5 x 76.2 cm)"*,
but need numeric height and width values. The API returns structured measurement` data that
is much easier to parse. It also returns structured tags (i.e. subject matter keywords).

Therefore will call the API for each painting to get these additional fields. 
With between 5-10k paintings and a 15ms delay between requests, this should take around 10 minutes max.

*If the CSV already contains Tags with good coverage, then may only need the API for measurements.
So will vheck the output of the key fields cell above to decide.

In [8]:
def fetch_object_with_retry(object_id, max_retries=3):
    """Fetch object with automatic retry on failure."""
    url = f"{BASE_URL}/objects/{object_id}"
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=15)
            
            # If rate limited, wait and retry
            if response.status_code == 429:
                wait_time = 10 * (attempt + 1)
                print(f"  Rate limited on {object_id}. Waiting {wait_time}s...")
                time.sleep(wait_time)
                continue
            
            response.raise_for_status()
            return response.json()
            
        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1:
                time.sleep(5 * (attempt + 1))
                continue
            return {"objectID": object_id, "error": str(e)}
    
    return {"objectID": object_id, "error": "Max retries exceeded"}


def extract_api_fields(obj):
    """Extract painting measurements and tags from an API response."""
    if "error" in obj:
        return {"object_id": obj["objectID"], "height_cm": None, "width_cm": None, "tags_api": "", "gallery_number_api": ""}
    
    # Extract painting height and width from measurements
    height = None
    width = None
    if obj.get("measurements"):
        for m in obj["measurements"]:
            if m.get("elementName") == "Overall":
                dims = m.get("elementMeasurements", {})
                height = dims.get("Height")
                width = dims.get("Width")
                break
        # Fallback: use first available measurement if Overall not found
        # Some Met records use different element naming conventions
        if height is None and obj["measurements"]:
            dims = obj["measurements"][0].get("elementMeasurements", {})
            height = dims.get("Height")
            width = dims.get("Width")
    
    # Extract tag terms
    tags = []
    if obj.get("tags"):
        tags = [t["term"] for t in obj["tags"] if "term" in t]
    
    return {
        "object_id": obj.get("objectID"),
        "height_cm": height,
        "width_cm": width,
        "tags_api": "|".join(tags) if tags else "",
        "gallery_number_api": obj.get("GalleryNumber", ""),
    }

In [9]:
# Check for existing checkpoint to allow resuming
CHECKPOINT_PATH = "data/raw/api_enrichment_checkpoint.json"
painting_ids = df_paintings["Object ID"].tolist()

if os.path.exists(CHECKPOINT_PATH):
    with open(CHECKPOINT_PATH, "r") as f:
        api_records = json.load(f)
    fetched_ids = {r["object_id"] for r in api_records}
    remaining_ids = [pid for pid in painting_ids if pid not in fetched_ids]
    print(f"Resuming from checkpoint: {len(api_records)} already fetched, {len(remaining_ids)} remaining.")
else:
    api_records = []
    remaining_ids = painting_ids
    print(f"Starting fresh. {len(remaining_ids)} paintings to fetch from API.")

Resuming from checkpoint: 9269 already fetched, 0 remaining.


In [10]:
#Fetching API data with progress bar plus checkpointing and retry logic for issues
CHECKPOINT_EVERY = 500
DELAY = 0.01  # reduced from first attempt of 0.05 which would have taken two hours
              # network latency already provides enough spacing

failed_ids = []

for i, object_id in enumerate(tqdm(remaining_ids, desc="Enriching paintings")):
    result = fetch_object_with_retry(object_id)
    extracted = extract_api_fields(result)
    
    if "error" in result:
        failed_ids.append(object_id)
    
    api_records.append(extracted)
    
    # Save checkpoint periodically
    if (i + 1) % CHECKPOINT_EVERY == 0:
        with open(CHECKPOINT_PATH, "w") as f:
            json.dump(api_records, f)
        tqdm.write(f"  Checkpoint saved at {len(api_records)} paintings.")
    
    time.sleep(DELAY)

# Final checkpoint save
with open(CHECKPOINT_PATH, "w") as f:
    json.dump(api_records, f)

# Save failed IDs separately for review
if failed_ids:
    with open("data/raw/api_failed_ids.json", "w") as f:
        json.dump(failed_ids, f)
    print(f"Failed IDs saved to data/raw/api_failed_ids.json")

print(f"\nDone. Enriched {len(api_records)} paintings.")
print(f"Failed requests: {len(failed_ids)}")
if failed_ids:
    print(f"Failed IDs (first 20): {failed_ids[:20]}")

Enriching paintings: 0it [00:00, ?it/s]


Done. Enriched 9269 paintings.
Failed requests: 0





#Six - Merging CSV Data and the API Data

Combining the CSV base (which holds the vast majority of metadata) with the API enrichment (which has additional aspects such as measurements and tags).

In [11]:
#Creating the API Enrichment DataFrame
df_api = pd.DataFrame(api_records)
print(f"API enrichment records: {len(df_api):,}")

#Merging on object ID
df = df_paintings.merge(df_api, left_on="Object ID", right_on="object_id", how="left")
print(f"Merged DataFrame: {len(df):,} rows")

API enrichment records: 9,269
Merged DataFrame: 9,269 rows


#Seven - Creating Target Variable and Select Columns

Create the `is_on_display` target variable and retain only the columns needed

In [12]:
#Creating target variable from live API gallery number only
#Not falling back to CSV Gallery Number as that data is up to three years old
#Display status changes regularly as using the CSV fallback would risk labelling paintings incorrectly if their display status has changed
df["is_on_display"] = df["gallery_number_api"].apply(
    lambda x: 1 if pd.notna(x) and str(x).strip() not in ["", "nan"] else 0
)

#Keep gallery_number_api as gallery_number for reference in EDA and Tableau
df["gallery_number_clean"] = df["gallery_number_api"].fillna("")

#Using API tags if available, fall back to CSV Tags column
if "Tags" in df.columns:
    df["tags_combined"] = df["tags_api"].where(df["tags_api"] != "", df["Tags"].fillna(""))
else:
    df["tags_combined"] = df["tags_api"]

print(f"Target variable created from live API data only")
print(f"On display: {df['is_on_display'].sum():,} ({df['is_on_display'].mean()*100:.1f}%)")

Target variable created from live API data only
On display: 1,388 (15.0%)


In [13]:
#Selecting and renaming columns in order to build a clean working dataset
columns_map = {
    "Object ID": "object_id",
    "Title": "title",
    "Artist Display Name": "artist_name",
    "Artist Nationality": "artist_nationality",
    "Artist Begin Date": "artist_begin_date",
    "Artist End Date": "artist_end_date",
    "Object Begin Date": "object_begin_date",
    "Object End Date": "object_end_date",
    "Medium": "medium",
    "Dimensions": "dimensions",
    "Department": "department",
    "Culture": "culture",
    "Period": "period",
    "Credit Line": "credit_line",
    "AccessionYear": "accession_year",
    "Is Public Domain": "is_public_domain",
    "Is Highlight": "is_highlight",
    "Object URL": "object_url",
}

df_clean = df.rename(columns=columns_map)

#Building the final column list
final_cols = [v for v in columns_map.values() if v in df_clean.columns]
final_cols += ["height_cm", "width_cm", "gallery_number_clean", "is_on_display", "tags_combined"]
final_cols = [c for c in final_cols if c in df_clean.columns]

df_clean = df_clean[final_cols]

#Tidying up column names
df_clean = df_clean.rename(columns={
    "gallery_number_clean": "gallery_number",
    "tags_combined": "tags"
})

print(f"Final dataset for analysis is: {df_clean.shape[0]:,} rows x {df_clean.shape[1]} columns")
print(f"\nColumns are: {list(df_clean.columns)}")

Final dataset for analysis is: 9,269 rows x 23 columns

Columns are: ['object_id', 'object_id', 'title', 'artist_name', 'artist_nationality', 'artist_begin_date', 'artist_end_date', 'object_begin_date', 'object_end_date', 'medium', 'dimensions', 'department', 'culture', 'period', 'credit_line', 'accession_year', 'is_public_domain', 'is_highlight', 'height_cm', 'width_cm', 'gallery_number', 'is_on_display', 'tags']


#Eight - Reviewing the Initial Data

Assessing what has been collected overall before saving.

In [14]:
df_clean.head()

Unnamed: 0,object_id,object_id.1,title,artist_name,artist_nationality,artist_begin_date,artist_end_date,object_begin_date,object_end_date,medium,...,period,credit_line,accession_year,is_public_domain,is_highlight,height_cm,width_cm,gallery_number,is_on_display,tags
0,35155,35155,"Guidobaldo II della Rovere, Duke of Urbino (15...",,,,,1555,1610,Oil on copper,...,,"Purchase, Arthur Ochs Sulzberger Gift, 2009",2009,True,False,,,374.0,1,Armor|Men|Portraits
1,35155,35155,"Guidobaldo II della Rovere, Duke of Urbino (15...",,,,,1555,1610,Oil on copper,...,,"Purchase, Arthur Ochs Sulzberger Gift, 2009",2009,True,False,,,374.0,1,Armor|Men|Portraits
2,35968,35968,清 佚名 台南地區荷蘭城堡|Forts Zeelandia and Provintia ...,Unidentified artist,,,,1800,1899,Wall hanging; ink and color on deerskin,...,,"Gift of J. Pierpont Morgan, 1909",1909,True,False,110.6,,,0,Houses|Cities|Boats|Ships|Maps
3,35968,35968,清 佚名 台南地區荷蘭城堡|Forts Zeelandia and Provintia ...,Unidentified artist,,,,1800,1899,Wall hanging; ink and color on deerskin,...,,"Gift of J. Pierpont Morgan, 1909",1909,True,False,,0.317501,,0,Houses|Cities|Boats|Ships|Maps
4,35969,35969,,Jin Zunnian,Chinese,1700.0,1800.0,1732,1732,Hanging scroll; ink and color on silk,...,Qing dynasty (1644–1911),"Rogers Fund, 1912",1912,False,False,170.2,96.5,,0,Parrots


In [15]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9269 entries, 0 to 9268
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   object_id           9269 non-null   int64  
 1   object_id           9269 non-null   int64  
 2   title               7006 non-null   object 
 3   artist_name         8270 non-null   object 
 4   artist_nationality  8270 non-null   object 
 5   artist_begin_date   8270 non-null   object 
 6   artist_end_date     8270 non-null   object 
 7   object_begin_date   9269 non-null   int64  
 8   object_end_date     9269 non-null   int64  
 9   medium              9265 non-null   object 
 10  dimensions          9255 non-null   object 
 11  department          9269 non-null   object 
 12  culture             4688 non-null   object 
 13  period              2922 non-null   object 
 14  credit_line         9264 non-null   object 
 15  accession_year      9231 non-null   object 
 16  is_pub

In [16]:
#Target variable distribution
display_counts = df_clean["is_on_display"].value_counts()
display_pct = df_clean["is_on_display"].value_counts(normalize=True) * 100

print("Target variable distribution:")
print(f"  In storage (0): {display_counts.get(0, 0):,} ({display_pct.get(0, 0):.1f}%)")
print(f"  On display (1): {display_counts.get(1, 0):,} ({display_pct.get(1, 0):.1f}%)")

Target variable distribution:
  In storage (0): 7,881 (85.0%)
  On display (1): 1,388 (15.0%)


In [17]:
#Missing values summary
missing = df_clean.isnull().sum()
missing_pct = (missing / len(df_clean)) * 100
missing_df = pd.DataFrame({"missing_count": missing, "missing_pct": missing_pct.round(1)})
print("Missing values:")
print(missing_df[missing_df["missing_count"] > 0].sort_values("missing_pct", ascending=False))

Missing values:
                    missing_count  missing_pct
period                       6347         68.5
culture                      4581         49.4
title                        2263         24.4
width_cm                     1950         21.0
height_cm                    1936         20.9
artist_name                   999         10.8
artist_nationality            999         10.8
artist_begin_date             999         10.8
artist_end_date               999         10.8
accession_year                 38          0.4
dimensions                     14          0.2
credit_line                     5          0.1
medium                          4          0.0


In [18]:
#Sorting paintings by art department
print("Paintings by department:")
print(df_clean["department"].value_counts())

Paintings by department:
department
Asian Art                                    4620
European Paintings                           2290
Modern and Contemporary Art                  1983
Robert Lehman Collection                      285
Islamic Art                                    49
Photographs                                    26
Musical Instruments                             6
Arms and Armor                                  5
Drawings and Prints                             3
European Sculpture and Decorative Arts          1
Arts of Africa, Oceania, and the Americas       1
Name: count, dtype: int64


In [19]:
#Display rate by art department
print("\nDisplay rate by department:")
display_by_dept = df_clean.groupby("department")["is_on_display"].agg(["sum", "count"])
display_by_dept["display_rate"] = (display_by_dept["sum"] / display_by_dept["count"] * 100).round(1)
display_by_dept.columns = ["on_display", "total", "display_rate_%"]
print(display_by_dept.sort_values("total", ascending=False))


Display rate by department:
                                           on_display  total  display_rate_%
department                                                                  
Asian Art                                          70   4620             1.5
European Paintings                               1108   2290            48.4
Modern and Contemporary Art                        87   1983             4.4
Robert Lehman Collection                          115    285            40.4
Islamic Art                                         4     49             8.2
Photographs                                         0     26             0.0
Musical Instruments                                 1      6            16.7
Arms and Armor                                      2      5            40.0
Drawings and Prints                                 0      3             0.0
Arts of Africa, Oceania, and the Americas           1      1           100.0
European Sculpture and Decorative Arts         

#Nine - Saving Raw Dataset

Saving the merged and selected dataset as CSV for use in the next notebook.

In [20]:
output_path = "data/raw/met_paintings_raw.csv"
df_clean.to_csv(output_path, index=False)
print(f"Saved {len(df_clean):,} paintings to {output_path}")
print(f"File size: {os.path.getsize(output_path) / (1024*1024):.1f} MB")

Saved 9,269 paintings to data/raw/met_paintings_raw.csv
File size: 2.9 MB


#Overall Summary

In this notebook I have:
1. Downloaded The Met's full collection CSV from GitHub (approx 470,000+ objects)
2. Filtered to paintings using the Classification column
3. Enriched with live API data for structured measurements (height/width in cm) and subject tags
4. Created the target variable (is_on_display) from live API gallery number data only -- not from the CSV, which is up to three years old and would risk incorrect labels for paintings whose display status has changed
5. Performed an initial overview of the dataset
6. Saved the raw paintings dataset as CSV

*Next: `02_cleaning_and_features.ipynb` - data wrangling and feature engineering

In [21]:
# Final sense check
print(f"Failed requests: {len(failed_ids)}")
print()
print(f"height_cm non-null: {df_clean['height_cm'].notna().sum():,} / {len(df_clean):,} ({df_clean['height_cm'].notna().mean()*100:.1f}%)")
print(f"width_cm non-null: {df_clean['width_cm'].notna().sum():,} / {len(df_clean):,} ({df_clean['width_cm'].notna().mean()*100:.1f}%)")
print()
print(df_clean["is_on_display"].value_counts())
print(f"Display rate: {df_clean['is_on_display'].mean()*100:.1f}%")

Failed requests: 0

height_cm non-null: 7,333 / 9,269 (79.1%)
width_cm non-null: 7,319 / 9,269 (79.0%)

is_on_display
0    7881
1    1388
Name: count, dtype: int64
Display rate: 15.0%


In [22]:
# Checking it saved - so I don't have to wait 120 minutes again if I need to re-run the notebook
import os

files_to_check = [
    "data/raw/MetObjects.csv",
    "data/raw/api_enrichment_checkpoint.json",
    "data/raw/met_paintings_raw.csv"
]

for f in files_to_check:
    if os.path.exists(f):
        size_mb = os.path.getsize(f) / (1024 * 1024)
        print(f"{f}: {size_mb:.1f} MB")
    else:
        print(f"{f}: NOT FOUND")

data/raw/MetObjects.csv: 302.9 MB
data/raw/api_enrichment_checkpoint.json: 1.0 MB
data/raw/met_paintings_raw.csv: 2.9 MB


In [23]:
# Validation check - what percentage of CSV paintings were successfully retrieved from the API
total_in_csv = 9005  # total paintings identified in the CSV and enriched via API
total_api_success = df['height_cm'].notna().sum()  # rows with successful API enrichment
total_api_fail = total_in_csv - total_api_success

print(f"Total paintings in CSV: {total_in_csv}")
print(f"Successfully retrieved from API: {total_api_success}")
print(f"Not found or failed: {total_api_fail}")
print(f"API coverage rate: {total_api_success / total_in_csv * 100:.1f}%")

Total paintings in CSV: 9005
Successfully retrieved from API: 7333
Not found or failed: 1672
API coverage rate: 81.4%


81.4% of paintings in a three year old CSV were still findable and enriched via the live API. That is a strong validation that the GitHub CSV remains a reliable base for this kind of analysis. The Met's core paintings collection is relatively stable and the project can proceed. 
The remaining 18.6% had missing height and width data, reflecting a combination of potential deaccessions since the CSV was published and API retrieval failures. 
Analysis will proceed on the 9,005 painting dataset - expecting to enrich with medians applied to missing numeric values in subsequent notebooks.

Summary: The is_on_display target variable was derived from gallery number data pulled live from the Met API at the time of data collection, rather than from the GitHub CSV. The CSV was used only for painting IDs and stable metadata fields such as medium and accession year. The display status therefore reflects the collection at the point the data was collected, not the age of the source file.

Limitations: The target variable is a snapshot from the point of data collection. Display status changes regularly as paintings move on and off gallery walls, so the model reflects the collection at one moment in time. That said, the is_on_display field was derived from live API data rather than the static GitHub CSV, so it is not subject to the age of that file.

Next step: Cleaning the data