# Cleaning and preparing Museum of Modern Art Data in Python for analysis

---

---

We will be working with a [data set](https://data.world/moma/collection) from data.world. These are data sets of artworks and artists maintained by [Museum of Modern Arts](https://www.moma.org/) in New York.
I'll implement techniques to tidy up datasets, and then tinker with and analyze them. I will be using python's built in package `csv` along with some simple functions and loops throughout this project.
There are great packages(`numpy`, `pandas`, etc.) for data analysis but in this project I use the basics. 

In [1]:
import csv
import re


---
##  A quick preview of the header and a few rows of the data in `artworks_3.csv`.

Below is a quick glance at header and first couple rows.

In [2]:
# simple printing function
def print_preview(dataset, start=0, end=1): 
    '''
    header + 2 row print function
    '''
    for i in dataset[start:end]:
        print('\n\n', i)

with open('/home/serge/.dw/cache/moma/collection/latest/data/artworks_3.csv') as file:
    artworks = list(csv.reader(file))
    header = artworks[0]
    artworks = artworks[1:]
print(header)
print_preview(artworks)

['title', 'artist', 'constituentid', 'artistbio', 'nationality', 'begindate', 'enddate', 'gender', 'date', 'medium', 'dimensions', 'creditline', 'accessionnumber', 'classification', 'department', 'dateacquired', 'cataloged', 'objectid', 'url', 'thumbnailurl', 'circumference_cm', 'depth_cm', 'diameter_cm', 'height_cm', 'length_cm', 'weight_kg', 'width_cm', 'seat_height_cm', 'duration_sec']


 ['Ferdinandsbrücke Project, Vienna, Austria, Elevation, preliminary version', 'Otto Wagner', '6210', '(Austrian, 1841–1918)', '(Austrian)', '(1841)', '(1918)', '(Male)', '1896', 'Ink and cut-and-pasted painted pages on paper', '19 1/8 x 66 1/2" (48.6 x 168.9 cm)', 'Fractional and promised gift of Jo Carole and Ronald S. Lauder', '885.1996', 'Architecture', 'Architecture & Design', '1996-04-09', 'Y', '2', 'http://www.moma.org/collection/works/2', 'http://www.moma.org/media/W1siZiIsIjU5NDA1Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=137b8455b1ec6167', '', '', '', '48.6', '', ''

---

# Cleaning the dataset:
In below cells we'll do a series of cleaning approaches:

- Check length of each row and delete outliers to avoid indexing problems.
- Because many artworks(rows) have multiple data per index(multiple artists) associated with them, we'll convert them(index=1 through index=8) to lists.
- Strip data of extra string characters, and convert(to integers where possible).
- Fill empty row indices(`nationality`, `gender`, `date`, etc.) with appropiate data.

---
To avoid problems in the future, I'll start by finding, inspecting, and removing outliers. Non-outliers(rows to keep) will at minimum contain:

- `constituentid`
- `objectid`
- `ur`

In [3]:
# below ill compare the length of indices in each row to the header
for i, k in enumerate(artworks):
    
    if len(k) < len(artworks[0]):
        print(f"At row: {i} {k} is an outlier")
        artworks.remove(k) 

At row: 81142 ['New York City Transit Authority'] is an outlier


There is only one that is out of place which has nothing but a name, and therefore was removed.

---
Below I'll make functions and apply them in a series of list comprehensions to:

- Remove parenthesis separating artist information.
- Convert to integers where possible.
- "listify" data points.

For artist `begindate` and `enddate`:

- Year is sufficient. Month and day will be removed.

Our index 8, `date`, is the date information on when the artwork was completed by artist(s). As dates are not standardized, I'll make a function below to:

- Strip non integer characters and whitespace.
- Find average of range dates(ex: 1976-82, c. 99-04, etc)

In [4]:
def artist_fix(artist):
    artist = artist.strip()
    if artist == '':
        return 'Unknown Artist'
    return artist.strip()

def constituentid_fix(num):
    if not num:
        return 0
    elif '.' in num: # a few constinuentid's have periods
        return float(num)
    return int(num)

def bio_fix(bio):
    bio = bio.strip()
    if bio == '':
        return 'Unknown Biography'
    return bio

def nationality_fix(nat):
    nat = nat.strip()
    if nat == '':
        return 'Unknown Nationality'
    return nat

def artist_date_fix(date):
    date = date.strip()
    if not date:
        return 0
    elif '-' in date:
        year, month_day = date.split('-', 1)
        return int(year)
    return int(date)

def gender_fix(gender):
    check = ['male', 'female', 'unknown/other gender']
    gender = gender.strip()
    if not gender:
        return 'Unknown/Other Gender'
    elif gender.lower() not in check:
        return 'Unknown/Other Gender'
    return gender

def artpiece_date_fix(date):
    """
    Returns a 4 digit year. If empty returns 0  
    """
    pattern = re.compile(r"(\d\d\d\d)")
    match = pattern.search(date)
    if match:
        return int(match.group())
    return 0

def cleanrows(dataset):
    """
    Returns modified dataset.
    Strips and splits data points, turning them into lists
    """
    for i, row in enumerate(dataset): # turn indices to lists
        row[1] = [artist_fix(artist) for artist in row[1].split(',')]
        row[2] = [constituentid_fix(num) for num in row[2].split(',')]
        row[3] = [bio_fix(bio) for bio in row[3][:-1].replace('(','').split(')')]
        row[4] = [nationality_fix(nat) for nat in row[4][:-1].replace('(','').split(')')]
        row[5] = [artist_date_fix(number) for number in row[5][:-1].replace('(','').strip().split(')')]
        row[6] = [artist_date_fix(number) for number in row[6][:-1].replace('(', '').strip().split(')')]
        row[7] = [gender_fix(gender) for gender in row[7][:-1].replace('(', '').strip().split(')')]
        row[8] = artpiece_date_fix(row[8])

    return dataset

artworks = cleanrows(artworks)

---

Observation of sample print-out below shows our functions and list-comps worked fine.

In [5]:
print(header)
print_preview(artworks,132033, 132035) # sample slice

['title', 'artist', 'constituentid', 'artistbio', 'nationality', 'begindate', 'enddate', 'gender', 'date', 'medium', 'dimensions', 'creditline', 'accessionnumber', 'classification', 'department', 'dateacquired', 'cataloged', 'objectid', 'url', 'thumbnailurl', 'circumference_cm', 'depth_cm', 'diameter_cm', 'height_cm', 'length_cm', 'weight_kg', 'width_cm', 'seat_height_cm', 'duration_sec']


 ['Anima 2, performed during Concert No. 1, Fully Guaranteed 12 Fluxus Concerts, Fluxhall, 359 Canal Street, New York, April 11, 1964', ['Takehisa Kosugi'], [3227], ['Japanese, born 1938'], ['Japanese'], [1938], [0], ['Male'], 1964, 'Gelatin silver print', 'sheet: 9 15/16 × 8" (25.3 × 20.3 cm)', 'The Gilbert and Lila Silverman Fluxus Collection Gift', '3734.2008.C02.x1-x3', 'Photograph', 'Fluxus Collection', '', 'N', '273247', '', '', '', '', '', '25.3', '', '', '20.3', '', '']


 ['Carrot Chew, performed during Concert No. 1, Fully Guaranteed 12 Fluxus Concerts, Fluxhall, 359 Canal Street, New York

---

---

# Analyze the Data

Next we'll:

- Calculate how old the artist was when they created their artwork
- Count artist frequency in `artworks`
- Analyze and interpret the distribution of artist ages
- Create functions which summarize our data
- Print summaries in an easy-to-read-way 

---

In [6]:
def age_at_artwork_finish(dataset):
    """
    Returns list of calculated approximate ages during artwork finish.
    Calculated by DOB subtracted from artwork completed date
    """
    final_ages = []

    for row in dataset:
        art_done_date = row[8]
        dob_list = row[5]
        pre_age = []

        for dob in dob_list:
            if (not dob) or (not art_done_date):
                pre_age.append(0)
            else:
                pre_age.append(art_done_date - dob)
        final_ages.append(pre_age)
    return final_ages

def artist_freq(dataset):
    """
    Return dict of artist name frequencies in the whole dataset  
    """
    artist_freq_counter = dict()

    for row in dataset:
        artist_name = row[1]

        for artist in artist_name:

            if artist not in artist_freq_counter:
                artist_freq_counter[artist] = 1
            else:
                artist_freq_counter[artist] += 1
    return artist_freq_counter

def top_10_artist_freq(dict):
    for k, v in sorted(dict.items(), key=lambda x: x[1], reverse=True)[:20]:
        print(f"{k} : {v}")
        
print(f"Sample artist ages during work completion: {age_at_artwork_finish(artworks[:10])}\n") # sample slice
print("Top 20 artists by frequency\n" )
top_10_artist_freq(artist_freq(artworks))

Sample artist ages during work completion: [[55], [43], [27], [36], [27], [32], [32], [32], [32], [32]]

Top 20 artists by frequency

Eugène Atget : 5050
Louise Bourgeois : 3327
Ludwig Mies van der Rohe : 2617
Unknown Artist : 2199
Unknown photographer : 1755
Jean Dubuffet : 1437
Lee Friedlander : 1335
Pablo Picasso : 1323
Marc Chagall : 1174
Henri Matisse : 1069
George Maciunas : 1016
Pierre Bonnard : 909
Lilly Reich : 841
Frank Lloyd Wright : 827
Various Artists : 766
August Sander : 750
Harry Shunk : 653
János Kender : 653
Georges Rouault : 633
Émile Bernard : 631


---

Below we do a simple gender count:

In [7]:
def gender_freq(dataset):
    """Returns dictionary with key=Genders, value=Counts
    """
    freq = dict()
    for row in dataset:
        gender = row[7]
        for i in gender:
            i = i.lower()
            if i not in freq:
                freq[i] = 1
            else:
                freq[i] += 1
    return freq

gender_counts = gender_freq(artworks)

print("Moma frequencies by gender: \n")
for k, v in gender_counts.items():
    print(f"    {v:,} {k.capitalize()} artists.")

Moma frequencies by gender: 

    115,856 Male artists.
    19,881 Female artists.
    11,037 Unknown/other gender artists.
