## **MoMA Collection Data Processing**
### Table of Contents
1. [General EDA & Initial Cleaning](#general)   
2. [Formatting review](#formatting-review)
3. [Standardizing Values](#standardizing-values)   
a. [Removing punctuation & special characters](#punctuation)   
b. [Cleaning Dates](#dates)
4. [Extracting from Biographies](#bio-extraction)   
a. [Extracting birthplace](#birthplace)   
b. [Imputing nationalities](#impute-nationalities)   
c. [Imputing Deceased Year & Creating `living` flag](#creating-living-flag)   
5. [Data Type Corrections](#datatypes)
6. [Title EDA](#title-eda)

For the purposes of this project, the data will be limited to records with Artist Name and a single identified artist as the creator of the artwork.   
All work done to process data for multi-artist pieces has been moved to [The Graveyard](#graveyard)   
a. [Evaluating `nulls`](#nulls)  
b. [Deduping within values](#deduping-within)   


### Data Type Corrections <a class='anchor' id='datatypes'></a>

### General Cleaning & EDA <a class='anchor' id='general'></a>
#### Dropped:
- Records missing `Artist` (1,216 records representing 0.8% of the entire catalogue)
- Records with multiple artists listed in `Artist` (9,577 records representing 6.8% of the entire catalogue)
  - Includes design groups/firms

#### Remaining dataset used contains **131,271 artworks** from **10,875 artists**

In [3]:
from cleaners.moma import *

original_catalogue = pd.read_csv('./MoMA Data/Artworks.csv')
# any values without an artist listed will not be useful, setting those aside for potential investigation in the future
missing_artists = original_catalogue[original_catalogue.Artist.isnull()==True].copy()
artists_listed = original_catalogue.dropna(axis=0, subset=['Artist'])
# print(len(missing_artists)/(len(original_catalogue)+len(missing_artists))), print(len(missing_artists))
# removing works with multiple artists/groups credited
catalogue = artists_listed[(artists_listed.Artist.str.contains(',|Associate|Architect', regex=True) == False)]
# print((len(original_catalogue)-len(catalogue))/len(original_catalogue)), print(len(original_catalogue)-len(catalogue))

movements = pd.read_csv('./WikiArt Data/movements_by_artist.csv')

catalogue.drop(columns=['URL','ThumbnailURL','Circumference (cm)','Depth (cm)','Diameter (cm)','Weight (kg)','Seat Height (cm)'], inplace=True) # dropping unneeded columns
# [print(f"| {i} | {catalogue[i].dtype} |  |") for i in catalogue.columns]

### Content & Formatting Review <a class='anchor' id='formatting-review'></a>
### Artist Information
| Column | DataType | Notes |
| --- | --- | --- |
| ConstituentID | object | `int` |
| Artist | object | Investigate falls non-nulls (e.g. "Unknown" / "Unidentified"...) |
| ArtistBio | object | Extract `birthplace` & `birthyear`; create `living` flag |
| Gender | object | Clean punctuation & dedupe |
| Nationality | object | Clean punctuation & dedupe |
### Artwork Information
| Column | DataType | Notes |
| --- | --- | --- |
| AccessionNumber | object |  |
| ObjectID | int64 |  |
| Title | object | Create `untitled` flag |
| Medium | object | Investigate overlap w/`Classification` |
| Classification | object | Investigate overlap w/`Medium` |
| Dimensions | object |  |
| Height (cm) | object |  |
| Length (cm) | object |  |
| Width (cm) | object |  |
| Duration (sec.) | object | `int` |
### Dates
| Column | DataType | Notes |
| --- | --- | --- |
| BeginDate | object | `int`; Clean punctuation & standardize |
| EndDate | object | `int`; Clean punctuation & standardize |
| Date | object | `int`; Clean punctuation & standardize` |
### Institutional Data
| Column | DataType | Notes |
| --- | --- | --- |
| CreditLine | object | Investigate potentially meaningful nulls |
| Department | object |  |
| DateAcquired | object | `int`; Clean punctuation & standardize |
| Cataloged | object | `bool` |

In [4]:
# low hanging fruit
catalogue['Cataloged'].replace({'Y':True,'N':False}, inplace=True) # converting Catalogued to boolean
catalogue['Untitled'] = catalogue.Title.str.contains('Untitled') # creating Untitled flag

### Standardizing Within Rows <a class="anchor" id="standardizing-within"></a>
#### Cleaning Punctuation <a class="anchor" id="punctuation"></a>

In [10]:
strip_punct(catalogue,['BeginDate','EndDate','Date'],'[^0-9\-]')
strip_punct(catalogue,'Nationality','[^A-Za-z0-9\,\-\s]')
strip_punct(catalogue,'ArtistBio','[\(\)]')
strip_punct(catalogue,'Gender','[^A-Za-z\-\s]')

catalogue['Gender'] = catalogue['Gender'].apply(lambda x: x.capitalize())
catalogue.fillna('-', inplace=True) # filling nulls for easier review

### Cleaning Dates <a class="anchor" id="dates"></a>
- Identified 3,148 or 2% of records that are missing `Date` value - see [date imputation](#date-imputation)
- We can also capture a large share with formattings:
  - *YYYYYYYY*
  - *YYYY-YYYY*
  - *YYYY-YY*
  - later noticed and added *YYYY* which introduced an additional 2% improvement

In [54]:
# adding properly formatted dates to new column
catalogue['cleanedDate'] = 0
clean_dates = catalogue[(catalogue.Date.str.len()==4)&(catalogue.Date.str.contains('\D')==False)].index
catalogue.loc[clean_dates,'cleanedDate'] = catalogue.loc[clean_dates,'Date']
# integrity review printout
print('{:,} total rows \n{:,} rows have dates correctly formatted'.format(len(catalogue), len(catalogue[catalogue.Date.str.len()==4])))
print(f'completeness of Date {len(clean_dates)/(len(catalogue)):.0%}\n')

catalogue[catalogue.Date.str.len()!=4]['Date'].value_counts()[0:2]; #missing values
catalogue[catalogue.Date.str.len()!=4]['Date'].value_counts()[2:30]; # identifying most common, alternate Date formatting

131,271 total rows 
98,541 rows have dates correctly formatted
completeness of Date 75%



In [55]:
# cleaning date ranges yyyyyyyy and yyyyyy
# create index reference
eight_digits = catalogue[((catalogue.Date.str.len()==6)|(catalogue.Date.str.len()==8))&(catalogue.Date.str.contains('\D',regex=True)==False)].index
# apply processing function to rows of given index
# store formatted year in cleanedDate, remaining data in Date2 for later review
catalogue.loc[eight_digits,'Date2'] = catalogue.loc[eight_digits,'Date'].apply(lambda x: str(x)[4:])
catalogue.loc[eight_digits,'cleanedDate'] = catalogue.loc[eight_digits,'Date'].apply(lambda x: str(x)[0:4])
print(f'YYYYYYYY & YYYYYY formatting: completeness of Date improved {len(eight_digits)/(len(catalogue)):.0%}')

# pulling all dates with ranges yyyy-yyyy | yyyy-yy
# create index reference
dash_ranges = catalogue[((catalogue.Date.str.len()==7)|(catalogue.Date.str.len()==9))&(catalogue.Date.str.contains('\d{4}\-\d{1,4}',regex=True))].index
# apply processing function to rows of given index
# store formatted year in cleanedDate, remaining data in Date2 for later review
catalogue.loc[dash_ranges,'Date2'] = catalogue.loc[dash_ranges,'Date'].apply(lambda x: x.split('-')[1])
catalogue.loc[dash_ranges,'cleanedDate'] = catalogue.loc[dash_ranges,'Date'].apply(lambda x: x.split('-')[0])

# integrity improvements printout
print(f'YYYY-YYYY & YYYY-YY formatting: completeness of Date improved {len(dash_ranges)/(len(catalogue)):.0%}')
print(f'remaining: {len(catalogue[catalogue.cleanedDate==0])/len(catalogue):.000%}')

YYYYYYYY & YYYYYY formatting: completeness of Date improved 6%
YYYY-YYYY & YYYY-YY formatting: completeness of Date improved 14%
remaining: 5%


In [56]:
# cleaning date ranges yyyyyyyyyy
multirange = catalogue[catalogue.Date.str.contains('\d{9}[^\D]', regex=True)].index
# stashing [6:]
catalogue.loc[multirange,'Date2'] = catalogue.loc[multirange,'Date'].apply(lambda x: str(x)[6:])
# keeping [0:4]
catalogue.loc[multirange,'cleanedDate'] = catalogue.loc[multirange,'Date'].apply(lambda x: str(x)[0:4])
# discarding remainder [4:6]

print(f'YYYYYYYYYY formatting: completeness of Date improved {len(multirange)/(len(catalogue)):.0%}')
print(f'total improvements: {(len(catalogue[catalogue.cleanedDate!=0])-len(clean_dates))/(len(catalogue)):.0%}')

YYYYYYYYYY formatting: completeness of Date improved 2%
total improvements: 21%


In [58]:
daterange = catalogue[(catalogue.Date.str.len()==7)&(catalogue.Date.str.contains('-')==True)].index
catalogue.loc[daterange,'cleanedDate'] = [i.split('-')[0] for i in catalogue.loc[daterange,'Date']]
catalogue.loc[daterange,'Date2'] = [i.split('-')[1] for i in catalogue.loc[daterange,'Date']]
print(f'completeness of Date improved {len(daterange)/(len(catalogue)):.0%}')
print(f'total improvements: {(len(catalogue[catalogue.cleanedDate!=0])-len(clean_dates))/(len(catalogue)):.0%}')
print(f'remaining: {len(catalogue[catalogue.cleanedDate==0])/len(catalogue):.0%}')

completeness of Date improved 6%
total improvements: 21%
remaining: 4%


In [61]:
catalogue[(catalogue.cleanedDate==0)].Date.value_counts()[2:40]

2015-            41
51877-221894     30
1910-111912      27
1958-641964      25
19641965-66      25
18-231966        25
11970            21
19501949-50      21
31916            20
19721971-1972    20
19271925-1927    18
-1991            17
71925            17
91907            16
61966-251967     15
17-191969        15
1923-241925      14
19611965-66      14
13-191970        12
1978-791995      11
-1925            11
1914-211925      10
1999-             9
8-131970          9
19171907-08       9
2-101969          9
10-151970         9
21-251963         9
31915             8
-1914             8
29-241967         8
19581965-66       8
11886             8
92000             7
19221920-21       7
1973-19741973     7
19621965-66       7
20-291967         7
Name: Date, dtype: int64

### Determining opportunities to impute Date <a class="anchor" id="date-imputation"></a>

## Extracting From Biographies <a class="anchor" id="bio-extraction"></a>
### [Extracting birthplace](#birthplace), if listed
### [Imputing nationalities](#impute-nationalities)
### [Creating `living` flag](#living)

### Extracting Birthplace <a class="anchor" id="birthplace"></a>

In [None]:
bio2ref = catalogue[catalogue.ArtistBio.str.contains(',')==True].index

catalogue.loc[:,'NationalityBio'] = catalogue.loc[:,'ArtistBio'].apply(lambda x: x.split(',')[0])
catalogue.loc[bio2ref,'Bio2'] = catalogue.loc[bio2ref,'ArtistBio'].apply(lambda x: x.split(',')[1])

In [None]:
catalogue[catalogue.Bio2.str.contains('born')==True].ArtistBio.value_counts()

In [None]:
bp_ref = catalogue[catalogue.ArtistBio.str.contains('born')==True].index

for i in bp_ref:
    output = list(set(re.findall('born\s\d{0,}\s{0,1}([A-Za-z]+)', catalogue.loc[i,'ArtistBio'])))
    
    if len(output) == 1:
        catalogue.loc[i,'Birthplace'] = output[0]
    elif len(output) == 0:
        catalogue.loc[i,'Birthplace'] = 'N/A'
    else:
        catalogue.loc[i,'Birthplace'] = [''.join(i) for i in output][0]
        
catalogue['Birthplace'].fillna('N/A', inplace=True)

In [None]:
[' '.join(i) for i in output][0]

In [None]:
catalogue.Birthplace.value_counts()[20:40]

In [None]:
# Answer
primes = [] # Set a list to catch prime values

for i in range(3, 2000): 
    # All statement evaluates to true if all of the iterables satisfy the criteria
    # If i divided by the existing primes(x) never has a remainder of 0
    if all(i % x != 0 for x in primes):
        # Append this number to the primes list
        primes.append(i)

sum(primes)

In [None]:
catalogue[catalogue.Birthplace=="['Uruguay', 'American', 'Argentine']"]

In [None]:
catalogue[(catalogue.Nationality==None)]

In [None]:
catalogue['Nationality'].unique()

In [None]:
catalogue['Bio2'].unique()[0:100]

In [None]:
catalogue[catalogue.Birthplace.str.contains('\[')==True].Birthplace.unique()

### Impute Nationalities <a class="anchor" id="impute-nationalities"></a>

In [None]:
# checking mismatches
len(catalogue[catalogue.Nationality!=catalogue.NationalityBio])

In [None]:
# pulling top 10 donors, by volume
[print('{:,} items from {}'.format(catalogue.CreditLine.value_counts()[i], i)) for i in catalogue.CreditLine.value_counts().index[0:10]];

In [None]:
# pulling remaining, most common formatting issues
catalogue[catalogue.cleanedDate==0].Date.value_counts()[0:10]

In [None]:
catalogue[(catalogue.Date==0)&(catalogue.BeginDate.str.len()==4)]

In [None]:
catalogue[(catalogue.Date==0)|(catalogue.Date==None)]

In [None]:
# for i in catalogue.loc[ref, 'Date'].index:
#     f = catalogue.loc[i,'Date'][:4]
#     s = catalogue.loc[i,'Date'][4:]
    
#     catalogue.loc[i,'Date'] = f
#     catalogue.loc[i,'Date2'] = s

### Impute Deceased Year & `living` flag <a class="anchor" id="living"></a>

## Title EDA & Imputation <a class='anchor' id='title-eda'></a>

# Graveyard
## Turn back now!

[Evaluating `nulls`](#evaluating-nulls) is no longer relevant, as the dataset was limited to single-artist artworks. Storing in graveyard for potential future use.

In [None]:
null_summary = pd.DataFrame(round(catalogue.isnull().sum()/len(catalogue),2), columns=['pct'])
null_summary = null_summary.reset_index(drop=False).rename(columns={'index':'Column'})
# [print(f"| {null_summary.loc[i,'Column']} | {null_summary.loc[i,'pct']} |") for i in null_summary[null_summary.pct==0].index]
# [print(f"| {null_summary.loc[i,'Column']} | {null_summary.loc[i,'pct']} |") for i in null_summary[null_summary.pct>0.1].index];
# [print(f"| {null_summary.loc[i,'Column']} | {null_summary.loc[i,'pct']} |") for i in null_summary[(null_summary.pct<=0.1)&(null_summary.pct!=0)].index];

### Evaluating `nulls` <a class='anchor' id='nulls'></a>
#### Complete Columns
- Artist (intervention)
- Title (further explored in [Title EDA](#title-eda))
- AccessionNumber (internal)
- Classification (internal)
- Department (internal)
- Object ID (internal)
- Cataloged (will be converted to bool)

#### Looking at those columns with high frequency of nulls (>10%)...
#### _Items to remove_
| Column | % Null | Notes |
| --- | --- | --- |
| URL | 0.33 | Not needed |
| ThumbnailURL | 0.40 | Not needed |
| Weight (kg) | 1.0 | Not needed |
| Seat Height (cm) | 1.0 | Not needed |
#### _Remaining_
| Column | % Null | Notes |
| --- | --- | --- |
| Circumference (cm) | 1.0 | Likely related to `Medium` |
| Depth (cm) | 0.89 | Likely related to `Medium` |
| Diameter (cm) | 0.99 | Likely related to `Medium` |
| Height (cm) | 0.12 | Possible imputation w/`Dimensions` |
| Length (cm) | 0.99 | Possible imputation w/`Dimensions` |
| Width (cm) | 0.13 | Possible imputation w/`Dimensions` |
| Duration (sec.) | 0.99 | Dependent on `Medium` |
#### Looking at those columns with low frequency nulls (<10%)...
#### _Linked to `null` Artist value_
| Column | % Null | Notes |
| --- | --- | --- |
| Artist | 0.01 | Removed. Represented .008 / 0.8% of all records. |
| ConstituentID | 0.01 | Dropped w/null artist |
| Nationality | 0.01 | Dropped w/null artist |
| BeginDate | 0.01 | Dropped w/null artist |
| EndDate | 0.01 | Dropped w/null artist |
| Gender | 0.01 | Dropped w/null artist |
#### _Not linked to `null` Artist value_
| Column | % Null | Notes |
| --- | --- | --- |
| ArtistBio | 0.03 | Will investigate further at later stage |
| CreditLine | 0.01 | `null` may have significance/meaning (acquisition) |
| Date | 0.01 | Will investigate overlap w/nulls in `BeginDate` and `EndDate` |
| DateAcquired | 0.05 | Internal, will follow up as possibly useful for further analysis |
| Dimensions | 0.06 | Possibly linked to `Medium` type |
| Medium | 0.07 | Will investigate further at later stage |

#### Deduping within rows <a class="anchor" id="deduping-within"></a>
_Many of our features are formatted in different ways, with duplicate values within, like so:_
- Gender: `(Male) (Male) (Male) (Male) (Male) (Male) (Male) (Male) (Male) (Female) (Male) (Male) ()`
- Nationality: `(Spanish) (Cuban) (Spanish) (Spanish)`
`strip_punct` handled step 1 - results:
- Gender: `Male Female`
- Nationality: `Spanish Cuban`
Last step is to dedupe within values.

In [None]:
standard = ['Male','Female']
len(catalogue[(catalogue['Gender']!='Female')& (catalogue['Gender']!='Male')&(catalogue['Artist'].str.contains(','))]), len(catalogue[(catalogue['Gender']!='Female')& (catalogue['Gender']!='Male')&(catalogue['Artist'].str.contains(',')==False)])

In [None]:
catalogue[(catalogue['Gender']!='Female')& (catalogue['Gender']!='Male')&(catalogue['Artist'].str.contains(',')==False)]

In [None]:
# gender requires deduping within row values
catalogue['Gender'] = catalogue['Gender'].apply(lambda x: dedupe(x))

Parsing nationalities

In [None]:
output = ''

[output.join(i) for i in list(set([i.replace(')','').replace('(','') for i in re.split('\)\s\(',"(American) (American) (French)") if len(i) > 0 ]))]
print(output)


In [None]:
def parse_multi_nat(input: str) -> str:
    return ' '.join(list(set([i.replace(')','').replace('(','').replace('\t','') for i in re.split('\)\s\(',input) if len(i) > 0 ]))).strip()

In [None]:
# # testing
# parse_multi_nat('(American) (American) (Brazilian) (French) () (American)')
# parse_multi_nat('(American) (American) () (American)')
parse_multi_nat('(German) (Swedish) (German)	')

In [None]:
multi_nationality = catalogue[catalogue.Nationality.str.contains('\) ')].index
catalogue.loc[multi_nationality,'cleanedNationality'] = catalogue.loc[multi_nationality,'Nationality'].apply(lambda x: parse_multi_nat(x))

In [None]:
clean_nats = catalogue[catalogue.cleanedNationality.isnull()].index
catalogue.loc[clean_nats,'cleanedNationality'] = catalogue.loc[clean_nats,'Nationality']
# catalogue.loc[:,'cleanedNationality'] = catalogue.loc[:,'cleanedNationality'].str.replace(' Nationality unkown ','')

In [None]:
catalogue.cleanedNationality.unique()

In [None]:
# # improved method
# #processing dates formatted YYYYYYYY
# nodash = catalogue[(catalogue.Date.str.len()==8)&(catalogue.Date.str.contains('-')==False)].index
# catalogue.loc[nodash,'cleanedDate'] = [int(i[0:4]) for i in catalogue.loc[nodash,'Date']]
# catalogue.loc[nodash,'Date2'] = [int(i[4:]) for i in catalogue.loc[nodash,'Date']]

# print(f'completeness of Date improved {len(nodash)/(len(catalogue)):.0%}')
# print(f'total improvements: {(len(catalogue[catalogue.cleanedDate!=0])-len(clean_dates))/(len(catalogue)):.0%}')
