## Data set

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository

#### Install `git-lfs` on Macs 

In [None]:
!brew install git-lfs

Updating Homebrew...


#### Install `git-lhs` on `wsl Ubuntu` 

In [None]:
!sudo apt-get install software-properties-common

In [None]:
!sudo add-apt-repository ppa:git-core/ppa

In [None]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

In [None]:
!sudo apt-get install git-lfs

#### Install lfs (both systems)

In [6]:
!git lfs install

Updated git hooks.
Git LFS initialized.


#### Clone the collection repo

In [8]:
!rm -rf ./data/MoMA_collection/

In [9]:
!git clone https://github.com/MuseumofModernArt/collection.git ./data/MoMA_collection

Cloning into './data/MoMA_collection'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 334 (delta 3), reused 24 (delta 0), pack-reused 306[K
Receiving objects: 100% (334/334), 36.84 MiB | 2.94 MiB/s, done.
Resolving deltas: 100% (75/75), done.


In [11]:
!ls -al ./data/MoMA_collection/

total 348392
drwxr-xr-x@  9 bn8210wy  WINONA\Domain Users        288 Mar  4 16:43 [34m.[m[m
drwxr-xr-x@ 30 bn8210wy  WINONA\Domain Users        960 Mar  4 16:42 [34m..[m[m
drwxr-xr-x@ 13 bn8210wy  WINONA\Domain Users        416 Mar  4 16:43 [34m.git[m[m
-rw-r--r--@  1 bn8210wy  WINONA\Domain Users         85 Mar  4 16:42 .gitattributes
-rw-r--r--@  1 bn8210wy  WINONA\Domain Users    1034713 Mar  4 16:42 Artists.csv
-rw-r--r--@  1 bn8210wy  WINONA\Domain Users    3567550 Mar  4 16:42 Artists.json
-rw-r--r--@  1 bn8210wy  WINONA\Domain Users   56801077 Mar  4 16:43 Artworks.csv
-rw-r--r--@  1 bn8210wy  WINONA\Domain Users  116953711 Mar  4 16:43 Artworks.json
-rw-r--r--@  1 bn8210wy  WINONA\Domain Users       4358 Mar  4 16:42 README.md


#### Install the exhibitions repo

In [5]:
!git clone https://github.com/MuseumofModernArt/exhibitions.git ./data/MoMA_exhibitions

fatal: destination path './data/MoMA_exhibitions' already exists and is not an empty directory.


## The exhibition file gives encoding errors by default

In [45]:
exhibitions = pd.read_csv('./data/MoMA_exhibitions/MoMAExhibitions1929to1989.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte

## Switching encodings fixes the problem

* See [this Stack Overflow question](https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python)
* More details on [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)

In [47]:
dat_cols = ['ExhibitionBeginDate', 'ExhibitionEndDate']
exhibitions = pd.read_csv('./data/MoMA_exhibitions/MoMAExhibitions1929to1989.csv', 
                          encoding="ISO-8859-1",
                          parse_dates=dat_cols)
exhibitions.head(2)

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,ConstituentID,ConstituentType,DisplayName,AlphaSort,FirstName,MiddleName,LastName,Suffix,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Curator,Director,9168.0,Individual,"Alfred H. Barr, Jr.",Barr Alfred H. Jr.,Alfred,H.,Barr,Jr.,,American,1902.0,1981.0,"American, 19021981",Male,109252853.0,Q711362,500241556.0,moma.org/artists/9168
1,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Artist,Artist,1053.0,Individual,Paul Cézanne,Cézanne Paul,Paul,,Cézanne,,,French,1839.0,1906.0,"French, 18391906",Male,39374836.0,Q35548,500004793.0,moma.org/artists/1053


In [48]:
artists = pd.read_csv("./data/Artists.csv")
artists.head(2)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,


In [49]:
artists_schema = get_spark_types(artists, keys=['ConstituentID'])

artists_spark = spark.createDataFrame(artists, schema=artists_schema)
(artists_spark
 .take(5)) >> to_pandas

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500028000.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


In [50]:
from more_dfply import fix_names
artwork = (pd.read_csv("./data/Artworks.csv")
           >> fix_names
           >> mutate(id = X.index + 1)
          )
artwork.head(2)

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,ThumbnailURL,Circumference_cm,Depth_cm,Diameter_cm,Height_cm,Length_cm,Weight_kg,Width_cm,Seat_Height_cm,Duration_sec,id
0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,"19 1/8 x 66 1/2"" (48.6 x 168.9 cm)",Fractional and promised gift of Jo Carole and ...,885.1996,Architecture,Architecture & Design,1996-04-09,Y,2,http://www.moma.org/collection/works/2,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,,1
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,"16 x 11 3/4"" (40.6 x 29.8 cm)",Gift of the architect in honor of Lily Auchinc...,1.1995,Architecture,Architecture & Design,1995-01-17,Y,3,http://www.moma.org/collection/works/3,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,,2


In [7]:
artwork_schema = get_spark_types(artwork, keys=['id'])

artwork_spark = spark.createDataFrame(artwork, schema=artwork_schema)
(artwork_spark
 .take(2)) >> to_pandas

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference_cm,Depth_cm,Diameter_cm,Height_cm,Length_cm,Weight_kg,Width_cm,Seat_Height_cm,Duration_sec,id
0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,,,,48.599998,,,168.899994,,,1
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,,,,40.640099,,,29.8451,,,2
