# Data Merge & Cleaning
In this notebook, I will merge all data and check what's missing. 
I'll also use this section to figure out what kind of preprocessing will be necessary and how to extract text information.

In [2]:
import pandas as pd
import numpy as np

import pickle

In [3]:
ls PKL

 Volume in drive D has no label.
 Volume Serial Number is 7684-49A2

 Directory of D:\Projects\art_title_generator\PKL

09/16/2020  04:14 PM    <DIR>          .
09/16/2020  04:14 PM    <DIR>          ..
09/16/2020  08:12 AM         4,044,325 raw_data_Harvard_1.pkl
09/16/2020  08:08 AM         1,099,606 raw_data_RISD_1.pkl
09/16/2020  11:41 AM         1,766,377 raw_data_RISD_2.pkl
               3 File(s)      6,910,308 bytes
               2 Dir(s)  581,283,987,456 bytes free


In [5]:
harvard = pd.read_pickle('PKL/raw_data_Harvard_1.pkl')

In [7]:
risd1 = pd.read_pickle('PKL/raw_data_RISD_1.pkl')

In [8]:
risd2 = pd.read_pickle('PKL/raw_data_RISD_2.pkl')

## Harvard Data
Let's look at the Harvard data first.

### missing image
Drop if it's missing image url.

In [107]:
harvard = harvard.dropna(subset = ['primaryimageurl'])

In [29]:
# removing unnecessary columns
cols = ['id', 'period', 'images', 'worktypes', 'accessionyear', 'classification',
        'primaryimageurl', 'style', 'commentary', 'technique', 'description', 'medium', 
       'title', 'colors', 'provenance', 'dated', 'department', 'dateend', 'people', 'url', 
        'century', 'labeltext', 'datebegin', 'culture']

In [30]:
harvard = harvard[cols]

In [109]:
imageinfo = ['description', 'technique', 'alttext', 'publiccaption']
worktypesinfo = 'worktype'
colorinfo = 'hue'
peopleinfo = ['displayname'] # role must be Artist

Images, worktypes, colors, people are dictionary nested in the list. We'll need to extract information from them.

In [None]:
[{'worktypeid': '9', 'worktype': 'album leaf'}, {'worktypeid': '249', 'worktype': 'painting'}]


In [234]:
def extract_info(x, name):
    if isinstance(x, list):
        if len(x) == 1:
            return x[0][name]
        else:
            inst = []
            for i in range(len(x)):
                val = x[i][name]
                if val not in inst and val != None:
                    inst.append(val)
            if len(inst) > 1:
                return ', '.join(inst)
            elif len(inst) == 1:
                return inst[0]
            else:
                return None
    else:
        return None

In [235]:
for item in imageinfo: 
    harvard[f'img_{item}'] = harvard['images'].apply(lambda x: extract_info(x, item))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harvard[f'img_{item}'] = harvard['images'].apply(lambda x: extract_info(x, item))


In [236]:
harvard['worktype'] = harvard['worktypes'].apply(lambda x: extract_info(x, 'worktype'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harvard['worktype'] = harvard['worktypes'].apply(lambda x: extract_info(x, 'worktype'))


In [237]:
harvard

Unnamed: 0,id,period,images,worktypes,accessionyear,classification,primaryimageurl,style,commentary,technique,...,url,century,labeltext,datebegin,culture,img_description,img_technique,img_alttext,img_publiccaption,worktype
0,47769,"Qing dynasty, 1644-1911","[{'date': '2005-04-25', 'copyright': 'Presiden...","[{'worktypeid': '9', 'worktype': 'album leaf'}...",1985.0,Paintings,https://nrs.harvard.edu/urn-3:HUAM:INV004865_d...,,,,...,https://www.harvardartmuseums.org/collections/...,17th century,,1644,Chinese,,,,,"album leaf, painting"
1,47969,"Qing dynasty, 1644-1911","[{'date': '2005-04-25', 'copyright': 'Presiden...","[{'worktypeid': '249', 'worktype': 'painting'}...",1985.0,Paintings,https://nrs.harvard.edu/urn-3:HUAM:INV004859_d...,,,,...,https://www.harvardartmuseums.org/collections/...,17th century,,1644,Chinese,,,,,"painting, album leaf"
2,48085,"Qing dynasty, 1644-1911","[{'date': '2005-04-26', 'copyright': 'Presiden...","[{'worktypeid': '9', 'worktype': 'album leaf'}...",1985.0,Paintings,https://nrs.harvard.edu/urn-3:HUAM:INV004947_d...,,,,...,https://www.harvardartmuseums.org/collections/...,,,0,Chinese,,,,,"album leaf, painting"
3,48126,"Qing dynasty, 1644-1911","[{'date': '2016-10-17', 'copyright': 'Presiden...","[{'worktypeid': '249', 'worktype': 'painting'}...",1985.0,Paintings,https://nrs.harvard.edu/urn-3:HUAM:761725,,,,...,https://www.harvardartmuseums.org/collections/...,19th century,,1800,Chinese,,Make:Hasselblad;Model:Hasselblad H5D-50c MS;Or...,,,"painting, album leaf"
4,48128,"Qing dynasty, 1644-1911","[{'date': '2016-10-17', 'copyright': 'Presiden...","[{'worktypeid': '9', 'worktype': 'album leaf'}...",1985.0,Paintings,https://nrs.harvard.edu/urn-3:HUAM:761710,,,,...,https://www.harvardartmuseums.org/collections/...,19th century,,1800,Chinese,,Make:Hasselblad;Model:Hasselblad H5D-50c MS;Or...,,,"album leaf, painting"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1323,330479,,"[{'date': '2009-09-22', 'copyright': None, 'im...","[{'worktypeid': '249', 'worktype': 'painting'}...",2018.0,Paintings,https://nrs.harvard.edu/urn-3:huam:LEG005443_d...,,,,...,https://www.harvardartmuseums.org/collections/...,20th century,,1967,Chinese,,,,,"painting, wall scroll"
1324,331245,,"[{'date': '2012-06-11', 'copyright': 'Presiden...","[{'worktypeid': '249', 'worktype': 'painting'}]",2009.0,Paintings,https://nrs.harvard.edu/urn-3:HUAM:DDC230662_d...,,,,...,https://www.harvardartmuseums.org/collections/...,17th century,,1635,"Italian, Venetian",,,,,painting
1327,332872,,"[{'date': '2014-08-25', 'copyright': 'Copyrigh...","[{'worktypeid': '249', 'worktype': 'painting'}]",2009.0,Paintings,https://nrs.harvard.edu/urn-3:HUAM:LEG253387,,,,...,https://www.harvardartmuseums.org/collections/...,21st century,,0,American,,,,,painting
1328,336321,,"[{'date': '2011-08-22', 'copyright': None, 'im...","[{'worktypeid': '249', 'worktype': 'painting'}]",2011.0,Paintings,https://nrs.harvard.edu/urn-3:huam:LEG009311_d...,,,,...,https://www.harvardartmuseums.org/collections/...,18th century,,1775,"Italian, Emilian, Bolognese",,,,,painting
