# Data Collection

Here I will take information from RISD Museum Collection API.  
Documentation can be found here: https://risdmuseum.org/art-design/projects-publications/articles/risd-museum-collection-api

## Import data

In [4]:
import pandas as pd
import numpy as np
import requests
import json
import nltk

%matplotlib inline

In [9]:
# This API allow loading maximum 25 items at a time. 
# There are about 3900 works with 'painting' search term in their collection.
# So run it through about 156 times.

url = "https://risdmuseum.org/api/v1/collection" #RISD Museum collection

db = []

for i in range(156):
    resp = requests.get(url, {'search_api_fulltext': 'painting', 
                'items_per_page': 25, 
                'page': i})
    db.extend(resp.json())


In [30]:
df = pd.DataFrame(db)
len(df)

3895

Since we are interested in descriptions, we will drop data without descriptions.

In [33]:
# drop items without descriptions
df = df[df.description != ""]

## Data Normalizing
Making all description texts lower case and a list of strings

In [34]:
# we will run cleaning on the copy of the original dataframe.
clean_df = df.copy() 

In [27]:
import re
#nltk.download('stopwords')

In [35]:
def normalizing(string):
    """
    Input: string 
    Return: list of lower case keywords with special characters removed

    """
    # remove special character, lowercase, then remove individual words
    return re.sub('[^A-Za-z]+', ' ', string).lower().split() 

## Removing stopwords
Removing words we don't want.

In [16]:
# Importing stopwords
from nltk.corpus import stopwords

In [17]:
# Stop words corpus
# We'll take from NLTK package and add couple more
sw = stopwords.words('english')
sw += ['p', 'r', 'l', 'x', 'e']

In [18]:
def remove_stop(list_):
    """
    Input: list of words
    Return: list of words excluding stopwords
    """
    return [x for x in list_ if x not in sw]

## Stemming & Lemmatizing 
We will use both Porter Stemming and Wordnet Lemmatizing. 
(For details see Harvard_data file)

In [23]:
wnl = nltk.WordNetLemmatizer()
porter = nltk.PorterStemmer()

def make_keywords(string):
    """
    Input: string of words
    Return: list of words excluding stopwords (after normalizing) and lemmatized
    """
    wordslist = remove_stop(normalizing(string))
    return list(map(lambda x: wnl.lemmatize(porter.stem(x)), wordslist))


In [37]:
# finally run on our data
clean_df.description = clean_df.description.apply(lambda x: make_keywords(x))

In [38]:
clean_df.description

14      [subject, portrait, august, vestri, son, king,...
86      [studi, apostl, agostino, paint, last, supper,...
101     [draw, preliminari, design, bracquemond, etch,...
122     [douard, manet, complet, paint, eleg, parisian...
139     [plump, nake, child, known, putti, appear, wes...
                              ...                        
3230       [munk, csi, paint, new, york, public, librari]
3338    [pierr, narciss, gu, rin, paint, harvard, museum]
3381                [solario, paint, mu, du, louvr, pari]
3476          [crowd, turkish, men, outsid, build, paint]
3504          [raoux, paint, mu, de, beau, art, de, tour]
Name: description, Length: 69, dtype: object

In [39]:
clean_df[(clean_df.description.apply(lambda x: 'abstract' in x))]

Unnamed: 0,id,collection,credit,culture,dating,datingYearFrom,datingYearTo,description,dimensions,edition,...,publicDomain,recentAcquisition,referenceNumber,relatedObjects,state,support,technique,title,type,url
1043,980991,,Helen M. Danforth Acquisition Fund,,1890,1890,1890,"[support, artist, career, armand, guillaumin, ...",59.7 x 73 cm (23 1/2 x 28 3/4 inches),,...,True,False,,[],,[canvas],[],The Road Mender (Le Cantonnier),[Paintings],https://risdmuseum.org/art-design/collection/r...
