# A Notebook Preparing the MoMA Dataset Step-by-Step

In [1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load data
moma = pd.read_csv('../data/moma/Artworks.csv')
moma.head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjUyNzc3MCJd...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjUyNzM3NCJd...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjUyNzM3NSJd...,,,,34.3,,,31.8,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjUyNzQ3NCJd...,,,,50.8,,,50.8,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjUyNzQ3NSJd...,,,,38.4,,,19.1,,


In [3]:
# Have column names handy
moma.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

## 0. Preparation

### 0.1. The Question and Approach

What I want to know is this:
- How old was the artwork (i.e., how long after the work was complete) when it was acquired by MoMA?
- If the artist was alive when a given work was acquired, how old was he/she?
- If the artist was deceased when a given work was acquired, how many years after his/her death did this event occur?

These are not currently features of the dataset, so I'll have to engineer them as follows
- `artwork_age`: `acquisition_year` - `completed_year`
- `artist_age`: `acquisition_year` - `birth_year`
- `years_posthumous`: `acquisition_year` - `death_year`

However, these aren't features either, but will have to be extracted/parsed from the original dataset:
- `acquisition_year`: parse from `DateAcquired`
- `completed_year`: parse and extract from `Date`
- `birth_year`: parse from `BeginDate`
- `death_year`: parse from `EndDate`

Thus, to simplify our dataset moving forward, I'll filter for the following features:
- `Title`
- `Artist`
- `BeginDate`
- `EndDate`
- `Date`
- `DateAcquired`

### 0.2. Dealing with Multiple Artist Records

Before jumping in, there's one additional issue, which is that sometimes a single record has more than one artist associated with it.

This makes sense in the case of collaborations, for example--presumably less common in certain mediums/classifications (e.g., painting, sculpture) than others (e.g., architecture, books).

Before continuing, let's get a better handle on how pervasive this issue is, and what kinds of artworks it tends to affect most. That way we'll be in a better position to know how to deal with them.

In [4]:
print("There are {:,} records in this dataset".format(len(moma)))

There are 140,848 records in this dataset


We can see evidence of multiple artists in the following features:
- Names in `Artist` are comma-separated
- Each bio in `ArtistBio` is contained in its own set of parentheses
- Each nationality in `Nationality` is contained in its own set of parentheses
- Each artist birth year in `BeginDate` is contained in its own set of parentheses
- Each artist death year in `EndDate` is contained in its own set of parentheses
- Each artist gender in `Gender` is contained in its own set of parentheses

I'll go through each of these and ensure that they are all pointing to the same number of multi-artist works. We want to drop these.

In [5]:
multi_artist = moma['Artist'].str.contains(',').fillna(False)
multi_artistbio = moma['ArtistBio'].str.match(r'.+?\).+?\(').fillna(False)
multi_begindate = moma['BeginDate'].str.match(r'.+?\).+?\(').fillna(False)
multi_enddate = moma['EndDate'].str.match(r'.+?\).+?\(').fillna(False)
multi_gender = moma['Gender'].str.match(r'.+?\).+?\(').fillna(False)
multi_nationality = moma['Nationality'].str.match(r'.+?\).+?\(').fillna(False)

print("`Artist` multiples: {:,}".format(len(moma[multi_artist])))
print("`ArtistBio` multiples: {:,}".format(len(moma[multi_artistbio])))
print("`BeginDate` multiples: {:,}".format(len(moma[multi_begindate])))
print("`EndDate` multiples: {:,}".format(len(moma[multi_enddate])))
print("`Gender` multiples: {:,}".format(len(moma[multi_gender])))
print("`Nationality` multiples: {:,}".format(len(moma[multi_nationality])))

`Artist` multiples: 8,291
`ArtistBio` multiples: 6,986
`BeginDate` multiples: 7,741
`EndDate` multiples: 7,741
`Gender` multiples: 7,741
`Nationality` multiples: 7,741


In [6]:
print("Equality of `multi_begindate` and `multi_enddate`: {}"
      .format(multi_begindate.equals(multi_enddate)))
print("Equality of `multi_begindate` and `multi_gender`: {}"
      .format(multi_begindate.equals(multi_gender)))
print("Equality of `multi_begindate` and `multi_nationality`: {}"
      .format(multi_begindate.equals(multi_nationality)))

Equality of `multi_begindate` and `multi_enddate`: True
Equality of `multi_begindate` and `multi_gender`: True
Equality of `multi_begindate` and `multi_nationality`: True


We can see that the indicators of multiple artists in `BeginDate`, `EndDate`, `Gender`, and `Nationality` are consistent, but I'm curious about the discrepancy with `Artist` indicators in particular.

In [7]:
moma[multi_artist & ~ multi_gender]

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
902,"The Atheneum, New Harmony, Indiana","Richard Meier & Associates, Architects",22754,(founded 1963),(),(1963),(0),(),1975-79,Styrene,...,http://www.moma.org/media/W1siZiIsIjIyOTI2MSJd...,,75.5652,,24.800000,,,104.20,,
922,"Mediatheque, Sendai, Miyagi Prefecture, Japan ...","Toyo Ito & Associates, Architects",8987,"(Japan, established 1971)",(Japanese),(1971),(0),(),1995–2001,Acrylic,...,http://www.moma.org/media/W1siZiIsIjIxMTI5MyJd...,,74.0000,,27.000000,,,80.00,,
926,"Federal Building and United States Courthouse,...","Richard Meier & Partners, Architects",22753,(founded 1963),(),(1963),(0),(),1993-2000,Wood,...,http://www.moma.org/media/W1siZiIsIjIyOTU3NSJd...,,66.7000,,86.400000,,,127.70,,
944,"Shimosuwa Municipal Museum, Shimosuwa-machi, N...","Toyo Ito & Associates, Architects",8987,"(Japan, established 1971)",(Japanese),(1971),(0),(),1990–1993,Plexiglass and aluminum,...,http://www.moma.org/media/W1siZiIsIjUyNzY5MyJd...,,60.0076,,19.367539,,,120.00,,
969,Battery Jar,"Corning Glass Works, Corning, NY",1249,(est. 1851),(American),(1851),(0),(),1920s,Pyrex glass,...,,,,30.4801,60.700000,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
134482,"56 Leonard Street, New York, New York, USA","Herzog & de Meuron, Basel",7567,(est. 1978),(Swiss),(1978),(0),(),2006–2008,Acrylic,...,,,5.0000,,24.000000,,,4.00,,
134483,"56 Leonard Street, New York, New York, USA","Herzog & de Meuron, Basel",7567,(est. 1978),(Swiss),(1978),(0),(),2006–2008,Plexiglass,...,,,5.0000,,30.000000,,,6.00,,
134484,"56 Leonard Street, New York, New York, USA","Herzog & de Meuron, Basel",7567,(est. 1978),(Swiss),(1978),(0),(),2006–2008,Acrylic,...,,,5.0000,,30.000000,,,5.00,,
138641,"Untitled (Galeria de la Plaza, 613 N Main Stre...","Plaza Gallery, Los Angeles",133909,,(),(0),(0),(),c. 1900,Collodion silver print,...,http://www.moma.org/media/W1siZiIsIjQ5OTQ5MiJd...,,,,14.000000,,,9.70,,


The issue is commas in the `Artist` field, which are sometimes used to separate artist names but also sometimes parts of firm names, which would apply presumably to architecture and design.

In [8]:
moma[multi_artist & ~ multi_gender]['Classification'].value_counts()

Design              408
Architecture         74
Photograph           48
Print                14
Illustrated Book      2
Video                 2
Installation          1
(not assigned)        1
Name: Classification, dtype: int64

In [9]:
is_photograph =  (moma['Classification'] == 'Photograph')
moma[multi_artist & ~ multi_gender]['Artist'].value_counts()

Herzog & de Meuron, Basel                                                                67
Henry Wessel, Jr.                                                                        23
Coors Porcelain Co., Golden, CO                                                          22
Daum Frères, Nancy, France                                                               22
Department of Publications and Urban Design, Organizing Committee of the XIX Olympiad    19
                                                                                         ..
Hämmerli, Ltd., Lenzburg, Switzerland                                                     1
Slazengers Ltd., England                                                                  1
Inoue Pleats Co., Ltd., Fukui, Japan                                                      1
The Custanite Corp., Brooklyn, NY                                                         1
Van Cleave, Axtell, KS                                                          

Because indicators of multiple authorship are clearly less reliable/consistent in the `Artist` field, I'll ignore those in favor of the `Gender`/`BeginDate`/`EndDate`/`Nationality` indicators.

In [10]:
multi_artist = multi_gender

moma_solo = moma[~ multi_artist].copy()

print("{:,} total records of single-artist works".format(len(moma_solo)))

133,107 total records of single-artist works


## 1. Parsing Features and Extracting Relevant Data

Next is to parse these columns of interest and extract what we need.

### 1.1. Generating `year_acquired` from `DateAcquired`
Here we need to parse the `DateAcquired` feature and extract from it `year_acquired`

In [11]:
# Convert `DateAcquired` to datetime object and extract year
moma_solo['year_acquired'] = pd.to_datetime(moma_solo['DateAcquired']).dt.year.astype(float)

# Preview
moma_solo[['DateAcquired', 'year_acquired']].sample(20)

Unnamed: 0,DateAcquired,year_acquired
107988,1990-10-23,1990.0
44978,1974-04-02,1974.0
57214,1962-01-09,1962.0
126004,2015-10-19,2015.0
86386,2005-01-12,2005.0
136294,2019-11-12,2019.0
74874,1959-03-02,1959.0
54128,1996-12-10,1996.0
2550,1961-11-08,1961.0
83777,2005-05-10,2005.0


### 1.2. Parse `Date` Feature
The `Date` column holds the date attributed to the artwork, which seems like it should be straightforward enough but is actually a bit complicated because of artworks dated with a range or estimated date. So I'll extract both a `begun_year` feature as well as a `completed_year` feature, the latter being the thing we actually want for this analysis. As an intermediate step, I'll create a `date_stripped` feature which elimiates some of the extraneous details and transforms the value into a YYYY-YY or YYYY-YYYY format.

#### 1.2.1. Sampling Non-Standard Dates

To get to know the data better, I want to look for values that don't match an expected format, namely something standard (YYYY, preceded or followed by anything non-numerical) or hyphenated (YYYY-YY, YYYY-YYYY, or any variation that uses a hyphen, an n-dash, a slash, etc., and again preceded or followed by anything non-numerical).

I also want to look for values that are wholly non-numerical, since these are going to be NaN.

In [12]:
# Filter for non-numerical values
non_numerical = (moma_solo['Date'].str.match(r'^[^\d]+?$').fillna(False))

moma_solo[non_numerical]['Date'].value_counts()

n.d.                                                                                596
Unknown                                                                             239
(n.d.)                                                                              112
unknown                                                                              21
(London?, published in aid of the Comforts Fund  for Women and Children of Sovie     10
no date                                                                               4
n.d                                                                                   3
TBC                                                                                   3
New York                                                                              2
TBD                                                                                   2
Various                                                                               1
Unkown                          

In [13]:
# Filter for standard format dates
standard_format = (moma_solo['Date'].str.match(r'^.*?\d{4}[^\d]*?$').fillna(False))

moma_solo[standard_format].sample(10)

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.),year_acquired
102140,Schlafender Mann (Pechstein) [Sleeping Man (Pe...,Erich Heckel,2569,"(German, 1883–1970)",(German),(1883),(1970),(Male),1910,One from an exhibition catalogue with twenty w...,...,,,,16.8,,,11.0,,,2010.0
47471,Car Glass,Brett Weston,6327,"(American, 1911–1993)",(American),(1911),(1993),(Male),1939,Gelatin silver print\r\n,...,,,,24.4,,,19.7,,,1941.0
108297,Modernist dollhouse furniture,Unidentified Designer,6011,(Nationality unknown),(Nationality unknown),(0),(0),(),1930s,"Painted metal, wire, fabric, and wood",...,,5.7,,10.6,,,12.6,,,2012.0
76802,Balboa Terminals. General view of Coaling Poin...,Unidentified photographer,8595,,(),(0),(0),(),"September, 1915",Gelatin silver print,...,,,,17.7,,,24.2,,,1971.0
16803,Plate 2 from POEMS FROM THE CANTO GENERAL,David Alfaro Siqueiros,5454,"(Mexican, 1896–1974)",(Mexican),(1896),(1974),(Male),1966-1968,,...,,,,60.0,,,104.0,,,1969.0
88465,"detail (sculpture, Carinthia 1945/1970, wood a...",Ernst Strouhal,31057,,(),(0),(0),(Male),"(newspaper published July 20, 2000)","Lithograph, offset printed",...,,,,,,,,,,2006.0
124708,13 Essential Rules for Understanding the World,Basim Magdy,47821,"(Egyptian, born 1977)",(Egyptian),(1977),(0),(Male),2011,"Super 8mm film transferred to video (color, so...",...,,,,,,,,,316.0,2015.0
60147,Three Girls Before the Mirror (Drei Mädchen vo...,Otto Mueller,4140,"(German, 1874–1930)",(German),(1874),(1930),(Male),(c. 1922),Lithograph,...,,,,35.2,,,25.3,,,1957.0
39058,Ancien Château de Gaillon. XVIe siècle. École ...,Eugène Atget,229,"(French, 1857–1927)",(French),(1857),(1927),(Male),1921,Albumen silver print,...,,,,,,,,,,1968.0
38863,PORTAIL LATÉRAL. ÉGLISE SAINT LAURENT RUE SIBOUR,Eugène Atget,229,"(French, 1857–1927)",(French),(1857),(1927),(Male),1908,Albumen silver print,...,,,,,,,,,,1968.0


In [14]:
# Filter for hyphenated format dates
hyphenated_format = (moma_solo['Date'].str.match(r'^[^\d]*?\d{4} ?[-–/] ?\d{2,4}[^\d]*?$').fillna(False))

moma_solo[hyphenated_format].sample(10)

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.),year_acquired
135567,Untitled,Unidentified photographer,8595,,(),(0),(0),(),c. 1930-40,Gelatin silver print,...,,,,7.8,,,5.2,,,2019.0
9108,"THE PATH ALONG THE TREES, plate 8 (folio 21) f...",Markus Lüpertz,3640,"(German, born 1941)",(German),(1941),(0),(Male),1988-1989,,...,,,,20.9,,,15.7,,,1990.0
85179,Untitled Xerox Cut-Out (Betty Ford/Alcoholism),Cady Noland,7817,"(American, born 1956)",(American),(1956),(0),(Female),1993–94,Cut printed paper in artist's frame,...,,,,27.940056,,,38.100076,,,2005.0
128685,Untitled,Miguel Rio Branco,49585,"(Brazilian, born 1946)",(Brazilian),(1946),(0),(Male),1970-72,Gelatin silver print,...,,,,24.447549,,,15.24003,,,2017.0
50027,Untitled,Doris Ulmann,6004,"(American, 1884–1934)",(American),(1884),(1934),(Female),1929-31,Photogravure,...,,,,21.3,,,16.2,,,1974.0
104055,Museum Dinner Service,Eva Zeisel,6556,"(American, born Hungary. 1906–2011)",(American),(1906),(2011),(Female),c. 1942-45,Glazed porcelain,...,,,,,,,,,,
125358,Davidson Wayside Markets Project (Ground-floor...,Frank Lloyd Wright,6459,"(American, 1867–1959)",(American),(1867),(1959),(Male),1932–1933,"Pencil, colored pencil, and ink on paper",...,,0.0,,28.257557,,,35.560071,,,
14614,"SATAN III, plate VIII (page 53) from LES FLEUR...",Georges Rouault,5053,"(French, 1871–1958)",(French),(1871),(1958),(Male),1925-66,"Etching and drypoint over photogravure, printe...",...,,,,35.5,,,26.2,,,1967.0
112368,Hidden Structures 5,Dóra Maurer,42457,"(Hungarian, born 1937)",(Hungarian),(1937),(0),(Female),1977-80,Pencil on folded paper,...,,,,49.8476,,,64.77013,,,2012.0
26496,"CROWS IN WINTER (in-text plate, volume II, pag...",Aristide Maillol,3697,"(French, 1861–1944)",(French),(1861),(1944),(Male),1908-1950,,...,,,,10.7,,,11.6,,,1964.0


Now let's sample the `Date` feature for values that are non-standard and non-hyphenated, but still have numbers in there somewhere.

In [15]:
non_standard = (~hyphenated_format & ~standard_format &~non_numerical)

print("There are {:,} non-standard `Date` values."
      .format(len(moma_solo[non_standard])))

print()

print(moma_solo[non_standard]['Date'].dropna().sample(20))

There are 2,073 non-standard `Date` values.

135797                                            1870s-80s
125211                                           1951/52–55
132057                           1958, assembled c. 1965-66
109158          1919 (reproduced drawings executed 1908–09)
64027                               1921 (executed 1920-21)
132042                           1962, assembled c. 1965-66
132051                           1961, assembled c. 1965-66
132003                           1964, assembled c. 1965-66
39886                                 Mars 1926, 9 h. matin
8335                          1912 (print executed 1908-09)
132009                           1964, assembled c. 1965-66
135811                                            1870s-80s
135615                                            1970s-80s
132004                           1964, assembled c. 1965-66
62900                            1915–16, published 1922–23
108954          1893 (reproduced drawings executed 1891

These could be dealt with individually, since there are clear patterns here. But since there are so few of these kinds of formats (1-2% of the dataset in my estimation), I'm going to be a little faster and looser with extracting dates.

#### 1.2.2. Extracting Stripped Dates

In [16]:
# Simplify `Date`
moma_solo['date_stripped'] = moma_solo['Date'].str.extract(r'^.*?(\d{4} ?[-–/]? ?\d{0,4})')

moma_solo[['Date', 'date_stripped']].sample(10)

Unnamed: 0,Date,date_stripped
125736,July 1969,1969
67960,1962–66,1962–66
57628,1994,1994
50322,c. 1965,1965
96566,1965-66,1965-66
130990,"1961, assembled c. 1976",1961
96864,1964,1964
47826,c. 1962,1962
45481,1981,1981
58351,"1974, published 1975",1974


That's working as expected, but let's see what's happening to the non-standard examples from above:

In [17]:
moma_solo[non_standard][['Date', 'date_stripped']].dropna().sample(10)

Unnamed: 0,Date,date_stripped
63975,1910–11 (recto); 1917–18 (verso),1910–11
73536,1911? (dated on painting 1911-12),1911
21386,1952. (Commissioned by Vollard; plates execut...,1952
62342,"1933, printed 1939-49",1933
132060,"1964, assembled c. 1965-66",1964
13869,"(1950, print executed 1949-50)",1950
131889,"1960–62, assembled 1964–65",1960–62
62886,"1918, published 1922–23",1918
65502,"1915–16, published 1916–17",1915–16
132013,"1964, assembled c. 1965-66",1964


This is a compromise I'm willing to live with. We're losing a little bit of granularity and specificity, since we're ending up only with the date a work was conceived but not the date it was produced, or vice-versa. But we'll still end up with a good approximation--fine for our purposes here.

#### 1.2.3. Extracting `begun_year` and `completed_year` from `date_stripped`
With a cleaner field to work with, we can now extract start and end dates from each artwork.

In [18]:
# Extract start year from `date_stripped
moma_solo['begun_year'] = moma_solo['date_stripped'].str[:4].astype(float)

# Extract end year from `date_stripped
moma_solo['completed_year'] = (
    moma_solo['date_stripped'].str.extract(r'.*(\d{2})\d{2}') 
    + moma_solo['date_stripped'].str.extract(r'(\d{2})[-–/]?$')
).astype(float)

# Preview results
moma_solo[['Date', 'date_stripped', 'begun_year', 'completed_year']].sample(20, random_state=123)

Unnamed: 0,Date,date_stripped,begun_year,completed_year
121803,1962,1962,1962.0,1962.0
86326,,,,
49187,1948,1948,1948.0,1948.0
117585,1973,1973,1973.0,1973.0
18770,"1923, published 1977",1923,1923.0,1923.0
42425,1932,1932,1932.0,1932.0
18250,"1941, published 1943",1941,1941.0,1941.0
115693,1916-23,1916-23,1916.0,1923.0
33947,1912,1912,1912.0,1912.0
65814,1947,1947,1947.0,1947.0


And again, let's just have a look at the non-standard ones:

In [19]:
moma_solo[non_standard][['Date', 'date_stripped', 'begun_year', 'completed_year']].dropna().sample(10)

Unnamed: 0,Date,date_stripped,begun_year,completed_year
135805,1870s-80s,1870,1870.0,1870.0
132049,"1961, assembled c. 1965-66",1961,1961.0,1961.0
132015,"1964, assembled c. 1965-66",1964,1964.0,1964.0
57432,"1959, printed 1963–64",1959,1959.0,1959.0
132020,"1961, assembled c. 1965-66",1961,1961.0,1961.0
62489,"(1922, executed 1920-21)",1922,1922.0,1922.0
75777,1965-66 (cast 1967-68),1965-66,1965.0,1966.0
110642,(c. 1910s-30s),1910,1910.0,1910.0
13865,"(1950, print executed 1949-50)",1950,1950.0,1950.0
71898,"1947, published 1952–53",1947,1947.0,1947.0


### 1.3. Parsing `BeginDate`/`EndDate` and Extracting `birth_year`/`death_year`

Next is to deal with artist birth year and death year

In [20]:
# Create new features for artist birth year and death year
moma_solo['birth_year'] = moma_solo['BeginDate'].str.extract(r'\((\d+?)\)', ).astype(float)
moma_solo['death_year'] = moma_solo['EndDate'].str.extract(r'\((\d+?)\)').astype(float)

# Preview
moma_solo[['BeginDate', 'birth_year', 'EndDate', 'death_year']].sample(20)

Unnamed: 0,BeginDate,birth_year,EndDate,death_year
119913,(1949),1949.0,(0),0.0
29093,(1871),1871.0,(1958),1958.0
50322,(1928),1928.0,(1984),1984.0
9075,(1930),1930.0,(1982),1982.0
105768,(1924),1924.0,(1976),1976.0
94346,(1911),1911.0,(2010),2010.0
25277,(1866),1866.0,(1944),1944.0
132018,(1923),1923.0,(2006),2006.0
5399,(0),0.0,(0),0.0
84852,(1955),1955.0,(0),0.0


Make sure that there are no weird birth years.

In [21]:
moma_solo['birth_year'].sort_values().unique()

array([   0., 1730., 1731., 1746., 1753., 1765., 1772., 1782., 1787.,
       1789., 1792., 1795., 1796., 1797., 1798., 1799., 1800., 1801.,
       1802., 1804., 1808., 1809., 1810., 1811., 1812., 1813., 1814.,
       1815., 1816., 1817., 1818., 1819., 1820., 1821., 1822., 1823.,
       1824., 1825., 1826., 1827., 1828., 1829., 1830., 1831., 1832.,
       1833., 1834., 1835., 1836., 1837., 1838., 1839., 1840., 1841.,
       1842., 1843., 1844., 1845., 1846., 1847., 1848., 1849., 1850.,
       1851., 1852., 1853., 1854., 1855., 1856., 1857., 1858., 1859.,
       1860., 1861., 1862., 1863., 1864., 1865., 1866., 1867., 1868.,
       1869., 1870., 1871., 1872., 1873., 1874., 1875., 1876., 1877.,
       1878., 1879., 1880., 1881., 1882., 1883., 1884., 1885., 1886.,
       1887., 1888., 1889., 1890., 1891., 1892., 1893., 1894., 1895.,
       1896., 1897., 1898., 1899., 1900., 1901., 1902., 1903., 1904.,
       1905., 1906., 1907., 1908., 1909., 1910., 1911., 1912., 1913.,
       1914., 1915.,

Replace 0 with NaN

In [22]:
null_birthyear = (moma_solo['birth_year'] == 0)

moma_solo.loc[null_birthyear, 'birth_year'] = np.nan

### 1.4. Standardize `Gender`

In [23]:
# Standardize `Gender`                                                      
gender_map = {                                                              
    '(Male)': 'Male',                                                       
    '(male)': 'Male',                                                       
    '(Female)': 'Female',                                                   
    '(female)': 'Female',                                                   
    '(Non-Binary)': 'Non-Binary',                                           
    '(Non-binary)': 'Non-Binary'                                            
}                                                                           

moma_solo['Gender'] = moma_solo['Gender'].map(gender_map)                                 

### 1.5. Review
Here's how are dataset now looks, focusing on columns of interest:

In [24]:
cols = [
    'Title', 'Artist', 'BeginDate', 'EndDate', 'Date', 'DateAcquired', 
    'year_acquired', 'date_stripped', 'begun_year', 'completed_year', 
    'birth_year', 'death_year'
]

moma_solo[cols].sample(10)

Unnamed: 0,Title,Artist,BeginDate,EndDate,Date,DateAcquired,year_acquired,date_stripped,begun_year,completed_year,birth_year,death_year
100974,Mechanical for various Fluxus projects,,,,,2008-10-08,2008.0,,,,,
134087,Panel from Let's Take Back Our Space: 'Female'...,Marianne Wex,(1937),(2020),1977,2018-10-24,2018.0,1977,1977.0,1977.0,1937.0,2020.0
81066,Light on Water C,Richard Tuttle,(1941),(0),2003,2003-11-20,2003.0,2003,2003.0,2003.0,1941.0,0.0
113668,"Urban Proposal with Multi Thin-Shell Capsules,...",Daniel Grataloup,(1937),(0),1970,2012-10-15,2012.0,1970,1970.0,1970.0,1937.0,0.0
50049,Untitled,Doris Ulmann,(1884),(1934),1929-31,1974-10-01,1974.0,1929-31,1929.0,1931.0,1884.0,1934.0
74265,Tournament,Adolph Gottlieb,(1903),(1974),1951,1984-12-11,1984.0,1951,1951.0,1951.0,1903.0,1974.0
103959,"Hermann Lange House, Krefeld, Germany, Site pl...",Ludwig Mies van der Rohe,(1886),(1969),c.1927-1930,,,1927-1930,1927.0,1930.0,1886.0,1969.0
110173,Studien zu Passtücken (Studies on Adaptives),Franz West,(1947),(2012),1980-87/2006,2012-02-16,2012.0,1980-87,1980.0,1987.0,1947.0,2012.0
73117,Seamstress,Raphael Soyer,(1899),(1987),1956-60,1961-01-10,1961.0,1956-60,1956.0,1960.0,1899.0,1987.0
75637,Synthesis,Alexander Liberman,(1912),(1999),n.d.,1995-12-12,1995.0,,,,1912.0,1999.0


## 2. Computing New Features

Now we're ready to compute the remaining features we need
- `artwork_age`: `acquisition_year` - `completed_year`
- `artist_age`: `acquisition_year` - `birth_year`
- `years_posthumous`: `acquisition_year` - `death_year`

### 2.1. Compute `artwork_age` for all works

In [25]:
moma_solo['artwork_age'] = moma_solo['year_acquired'] - moma_solo['completed_year']

# Preview
moma_solo[['completed_year', 'year_acquired', 'artwork_age']].sample(10, random_state=111)

Unnamed: 0,completed_year,year_acquired,artwork_age
94488,1930.0,,
51741,1976.0,2000.0,24.0
41005,1922.0,1968.0,46.0
50725,1955.0,1998.0,43.0
22006,1933.0,1964.0,31.0
37951,1925.0,1968.0,43.0
105652,1996.0,2011.0,15.0
29527,1930.0,1964.0,34.0
14045,1911.0,1966.0,55.0
89781,1932.0,1974.0,42.0


### 2.2. Engineer `living` feature to categorize whether artist was alive or deceased at acquisition

In [26]:
moma_solo['living'] = (
    np.where((moma_solo['year_acquired'] < moma_solo['death_year']) | (moma_solo['death_year'] == 0), 1, 0)
)

# Preview
moma_solo[['year_acquired', 'death_year', 'living']].sample(10, random_state=111)

Unnamed: 0,year_acquired,death_year,living
94488,,1969.0,0
51741,2000.0,0.0,1
41005,1968.0,1927.0,0
50725,1998.0,1984.0,0
22006,1964.0,1979.0,1
37951,1968.0,1927.0,0
105652,2011.0,0.0,1
29527,1964.0,1974.0,1
14045,1966.0,1957.0,0
89781,1974.0,1969.0,0


### 2.3. Compute `artist_age` for Artists Alive at Acquisition

In [27]:
moma_solo['artist_age'] = (
    np.where(moma_solo['living'] == 1, moma_solo['year_acquired'] - moma_solo['birth_year'], np.nan)
)

# Preview
moma_solo[['year_acquired', 'birth_year', 'death_year', 'living', 'artist_age']].sample(10, random_state=111)

Unnamed: 0,year_acquired,birth_year,death_year,living,artist_age
94488,,1886.0,1969.0,0,
51741,2000.0,1934.0,0.0,1,66.0
41005,1968.0,1857.0,1927.0,0,
50725,1998.0,1928.0,1984.0,0,
22006,1964.0,1898.0,1979.0,1,66.0
37951,1968.0,1857.0,1927.0,0,
105652,2011.0,1946.0,0.0,1,65.0
29527,1964.0,1884.0,1974.0,1,80.0
14045,1966.0,1871.0,1957.0,0,
89781,1974.0,1886.0,1969.0,0,


### 2.4. Compute `years_posthumous` for Artists Deceased at Acquisition

In [28]:
moma_solo['years_posthumous'] = (
    np.where(moma_solo['living'] == 0, moma_solo['year_acquired'] - moma_solo['death_year'], np.nan)
)

# Preview
moma_solo[['year_acquired', 'birth_year', 'death_year', 'living', 'artist_age', 'years_posthumous']].sample(10, random_state=111)

Unnamed: 0,year_acquired,birth_year,death_year,living,artist_age,years_posthumous
94488,,1886.0,1969.0,0,,
51741,2000.0,1934.0,0.0,1,66.0,
41005,1968.0,1857.0,1927.0,0,,41.0
50725,1998.0,1928.0,1984.0,0,,14.0
22006,1964.0,1898.0,1979.0,1,66.0,
37951,1968.0,1857.0,1927.0,0,,41.0
105652,2011.0,1946.0,0.0,1,65.0,
29527,1964.0,1884.0,1974.0,1,80.0,
14045,1966.0,1871.0,1957.0,0,,9.0
89781,1974.0,1886.0,1969.0,0,,5.0


Looking good!

## 3. Test Cleaning Script

I've incorporated all the above steps into `art_stats_utils.py`. Now I want to confirm that the results are identical.

In [29]:
from art_stats_utils import prepare_dataset

df = pd.read_csv('../data/moma/Artworks.csv')

df = prepare_dataset(df)

In [30]:
df.equals(moma_solo)

True

All set!