# Step 1 : "Objects.csv" --> "Object_2D.csv"

## 1.1: Load raw "01_0_objects.csv" in

In [104]:
import numpy as np
import pandas as pd

pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 200

df1 = pd.read_csv('01_0_objects.csv', encoding = 'utf-8', low_memory = False)

#df1 #(overview of the table)
df1.shape

(136928, 18)

Pandas allow for adjusting table display configuration:
* https://pandas.pydata.org/docs/user_guide/options.html

NGA data are pulished in .csv-format and using UTF-8 encoding  
* https://www.nga.gov/open-access-images/open-data.html
* latin1 (MySQL), UTF8 (UTF8m4b, utf8m3b)

You can get a slice of a DataFrame by using a colon `:`

* Format: `[start_index:end_index]` 
* start_index and end_index are both optional 
* `start_index` is the index of the first value (included in slice)  
* `end_index` is the index of the last value (not included in slice) (i.e. from `start` upto but not including `end`)   

In [105]:
df1.head(10)
#df1[100:200]
df1.columns

Index(['objectID', 'title', 'displayDate', 'beginYear', 'endYear', 'timeSpan',
       'medium', 'dimensions', 'inscription', 'markings',
       'attributionInverted', 'attribution', 'classification', 'parentID',
       'portfolio', 'series', 'volume', 'watermarks'],
      dtype='object')

## 1.2: Gather all the paintings, drawings, prints 
**Raw Counts: Total = 109773 ==> Print = 69432, Drawing = 36098, Painting = 4243**

In [106]:
painting = df1["classification"] == "painting"
#df1[painting]
df1[painting].shape # 4243 counts of all the "paintings"
#df1[painting]

(4243, 18)

In [107]:
drawing = df1["classification"] == "drawing"
# df1[drawing]
df1[drawing].shape # counts of all the "drawings"

(36098, 18)

In [108]:
prints = df1["classification"] == "print"
# df1[prints]
df1[prints].shape # counts of all the "prints"

(69432, 18)

### Use `.append()` (rather than `.concat()`) 
* to join tuples of the types: ("painting" + "drawing" + "print"), all together

https://pandas.pydata.org/pandas-docs/version/0.20/merging.html#concatenating-using-append


* Full Columns: 'objectID', 'title', 'displayDate', 'beginYear', 'endYear', 'timeSpan', 'medium', 'dimensions', 'inscription', 'markings','attributionInverted', 'attribution', 'classification', 'parentID','portfolio', 'series', 'volume', 'watermarks'
* Kept Columms: 'objectID', 'title', 'displayDate', 'beginYear', 'endYear', 'timeSpan', 'medium', 'dimensions', 'attributionInverted', 'attribution', 'classification', 'parentID','portfolio', 'series', 'volume'

In [109]:
df_object_2D = df1[painting].append([df1[drawing], df1[prints]])
df_object_2D.columns
df_object_2D = df_object_2D[['objectID', 'title', 'displayDate', 'beginYear', 'endYear', 'timeSpan', 'medium', 'dimensions', 'attributionInverted', 'attribution', 'classification', 'parentID','portfolio', 'series', 'volume']]
df_object_2D.shape # (109773 x 15)
#df_object_2D #109773 distinct objectIDs

(109773, 15)

## 1.3: Output Paintings + Drawings + Prints as .csv

### Output dataframe to .csv file

https://sparkbyexamples.com/pandas/pandas-write-dataframe-to-csv-file  

https://stackoverflow.com/questions/16923281/writing-a-pandas-dataframe-to-csv-file

### UTF-8 Encoding Concern  
* this jupyter is using python3 (default is UNICODE encoding)
* this when read in .csv file, don't need to specify.
* when ouput .csv file, good practice to enforce utf-8 encoding
https://stackoverflow.com/questions/36462852/how-to-read-utf-8-files-with-pandas

### NULL value outputs (UNSOLVED......)
https://stackoverflow.com/questions/50890989/pandas-changing-the-format-of-nan-values-when-saving-to-csv

In [110]:
df_object_2D.to_csv("01_1_objects_2D.csv", encoding = 'utf-8', index = False)
# CHECKED output .csv file is INTACT (i.e. 109773 tuples)

# Step 2: "07_object_images.csv"

## 2.1: Read in "object_images.csv"

In [111]:
df7 = pd.read_csv('07_0_objects_images_clean.csv', encoding = 'utf-8', low_memory = False)
df7.shape # 103227 digital images availiable (x 7 attributes)
df7.columns
df7.head(5)

Unnamed: 0,uuid,URL,thumbURL,width,height,maxpixels,objectID
0,00004dec-8300-4487-8d89-562d0126b6a1,https://api.nga.gov/iiif/00004dec-8300-4487-8d89-562d0126b6a1,"https://api.nga.gov/iiif/00004dec-8300-4487-8d89-562d0126b6a1/full/!200,200/0/default.jpg",2623,4000,640.0,11975
1,00007f61-4922-417b-8f27-893ea328206c,https://api.nga.gov/iiif/00007f61-4922-417b-8f27-893ea328206c,"https://api.nga.gov/iiif/00007f61-4922-417b-8f27-893ea328206c/full/!200,200/0/default.jpg",3365,4332,,17387
2,0000bd8c-39de-4453-b55d-5e28a9beed38,https://api.nga.gov/iiif/0000bd8c-39de-4453-b55d-5e28a9beed38,"https://api.nga.gov/iiif/0000bd8c-39de-4453-b55d-5e28a9beed38/full/!200,200/0/default.jpg",3500,4688,,19245
3,0000e5a4-7d32-4c2a-97c6-a6b571c9fd71,https://api.nga.gov/iiif/0000e5a4-7d32-4c2a-97c6-a6b571c9fd71,"https://api.nga.gov/iiif/0000e5a4-7d32-4c2a-97c6-a6b571c9fd71/full/!200,200/0/default.jpg",2252,3000,,153987
4,0001668a-dd1c-48e8-9267-b6d1697d43c8,https://api.nga.gov/iiif/0001668a-dd1c-48e8-9267-b6d1697d43c8,"https://api.nga.gov/iiif/0001668a-dd1c-48e8-9267-b6d1697d43c8/full/!200,200/0/default.jpg",3446,4448,,23830


## To use IIIF image framework
https://iiif.io/api/image/2.1/

`{scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}`

default image:  
https://api.nga.gov/iiif/00004dec-8300-4487-8d89-562d0126b6a1/full/!200,200/0/default.jpg  
`00004dec-8300-4487-8d89-562d0126b6a1` is imageID

**Image Rotation**  
* NGA's IIIF image: currently implemented rotation angles are 0, 90, 180 and 270 degrees

**IIIF API DEMO**
* full region, 210 width x 320 height, 90 degree rotation, gray image quality 
    * https://api.nga.gov/iiif/00004dec-8300-4487-8d89-562d0126b6a1/full/!210,320/90/gray.jpg
* sqaure size, 210 width x 320 height, 270 degree rotation, color image quality 
    * https://api.nga.gov/iiif/00004dec-8300-4487-8d89-562d0126b6a1/square/!210,320/270/color.jpg
    


## 2.2: Gather all the matching images of paintings, drawings, prints 
**Counts: Total = 81937  ==>  Painting = 3788, Drawing = 32155, Print = 45994**

In [112]:
df_2D_7 = pd.merge(left = df_object_2D, right = df7, how = "inner", left_on = "objectID", right_on = "objectID")
df_2D_7.shape 
# 81937 digital images matched with Artwork-Objects of interest (i.e. Paintings, Drawings, Prints)

(81937, 21)

In [113]:
df_2D_7.columns # 21 columns as following:

Index(['objectID', 'title', 'displayDate', 'beginYear', 'endYear', 'timeSpan',
       'medium', 'dimensions', 'attributionInverted', 'attribution',
       'classification', 'parentID', 'portfolio', 'series', 'volume', 'uuid',
       'URL', 'thumbURL', 'width', 'height', 'maxpixels'],
      dtype='object')

In [114]:
df_2D_7.head(5)

Unnamed: 0,objectID,title,displayDate,beginYear,endYear,timeSpan,medium,dimensions,attributionInverted,attribution,classification,parentID,portfolio,series,volume,uuid,URL,thumbURL,width,height,maxpixels
0,0,Saint James Major,c. 1310,1310.0,1310.0,1300 to 1400,tempera on panel,painted surface (top of gilding): 62.2 × 34.8 cm (24 1/2 × 13 11/16 in.)\r\npainted surface (including painted border): 64.8 × 34.8 cm (25 1/2 × 13 11/16 in.)\r\noverall: 66.7 × 36.7 × 1.2 cm (26 ...,Grifo di Tancredi,Grifo di Tancredi,painting,34.0,,,,7b170a4c-9d44-475c-b294-cee6f43d88af,https://api.nga.gov/iiif/7b170a4c-9d44-475c-b294-cee6f43d88af,"https://api.nga.gov/iiif/7b170a4c-9d44-475c-b294-cee6f43d88af/full/!200,200/0/default.jpg",2846,5153,
1,1,Saint Paul and a Group of Worshippers,1333,1333.0,1333.0,1300 to 1400,tempera on panel,painted surface: 224.8 × 77 cm (88 1/2 × 30 5/16 in.)\r\noverall: 233.53 × 88.8 × 5.3 cm (91 15/16 × 34 15/16 × 2 1/16 in.),"Daddi, Bernardo",Bernardo Daddi,painting,,,,,7bbcfd01-e774-46e7-96d1-a3b03598cd8a,https://api.nga.gov/iiif/7bbcfd01-e774-46e7-96d1-a3b03598cd8a,"https://api.nga.gov/iiif/7bbcfd01-e774-46e7-96d1-a3b03598cd8a/full/!200,200/0/default.jpg",6004,15544,
2,2,Saint Andrew and Saint Benedict with the Archangel Gabriel [left panel],shortly before 1387,1387.0,1387.0,1300 to 1400,tempera on poplar panel,overall: 197 × 80 cm (77 9/16 × 31 1/2 in.),"Gaddi, Agnolo",Agnolo Gaddi,painting,206122.0,,,,e8a1acb4-f60a-477a-9bfe-61fa5072c514,https://api.nga.gov/iiif/e8a1acb4-f60a-477a-9bfe-61fa5072c514,"https://api.nga.gov/iiif/e8a1acb4-f60a-477a-9bfe-61fa5072c514/full/!200,200/0/default.jpg",2152,4827,
3,18,The Annunciation,c. 1423/1424,1383.0,1435.0,1300 to 1400,tempera (and possibly oil glazes) on panel,overall: 148.8 x 115.1 cm (58 9/16 x 45 5/16 in.)\r\nframed: 181 x 165.1 x 11.1 cm (71 1/4 x 65 x 4 3/8 in.),Masolino da Panicale,Masolino da Panicale,painting,,,,,e6497b39-66a9-4b6b-bf2b-c1633d20c0b6,https://api.nga.gov/iiif/e6497b39-66a9-4b6b-bf2b-c1633d20c0b6,"https://api.nga.gov/iiif/e6497b39-66a9-4b6b-bf2b-c1633d20c0b6/full/!200,200/0/default.jpg",2333,2996,
4,19,Portrait of a Man,c. 1450,1450.0,1450.0,1401 to 1500,tempera on panel,painted surface: 54.2 x 40.4 cm (21 5/16 x 15 7/8 in.)\r\nsupport: 55.5 x 41.2 cm (21 7/8 x 16 1/4 in.)\r\nframed: 86.4 x 74.9 x 8.9 cm (34 x 29 1/2 x 3 1/2 in.),Andrea del Castagno,Andrea del Castagno,painting,,,,,a21cc457-7ddf-4c4d-9934-8039bc919864,https://api.nga.gov/iiif/a21cc457-7ddf-4c4d-9934-8039bc919864,"https://api.nga.gov/iiif/a21cc457-7ddf-4c4d-9934-8039bc919864/full/!200,200/0/default.jpg",5395,7270,


In [119]:
df_2D_7["parentID"].nunique() #1149 parentIDs

1149

**Search a column use `Series.str.contains()` method**
* https://stackoverflow.com/questions/11350770/filter-pandas-dataframe-by-substring-criteria
* also see CIT591-M6-Slide "Computations – sum()"

In [86]:
# Get the Artwork Series named "The Birds of America"
# df_2D_7.loc[df_2D_7["series"].str.contains("The Birds of America", na=False)]

### Check for unique objectID with `df.nunique()` or `df.value_counts()`
* https://www.geeksforgeeks.org/python-pandas-index-value_counts/  
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html  

In [87]:
df_2D_7["objectID"].nunique()
# 81868 distinct objectID

81868

## 2.3 Issue: Does any artwork-object has more than one corresponding image?
* 81937 total tuples (when inner-joininig objects with object-images)
* 81868 distinct objectID among 81937 total object+image combinations
* 69 artwork-objects have more than one images

**`df.duplicated()` method: Check for all the duplicates** (i.e. only return the duplicates)
* https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

In [120]:
KeyAttrs = ["objectID","title","classification","thumbURL"] # only pulling out these columns to check for digital image validity for each artwork-object

df_dup = df_2D_7[df_2D_7["objectID"].duplicated(keep = False)]
df_dup[KeyAttrs]
df_dup
# df_dup["objectID"].nunique() # 59 distinct objectIDs (out of 81868 objects) have more than one images)

Unnamed: 0,objectID,title,displayDate,beginYear,endYear,timeSpan,medium,dimensions,attributionInverted,attribution,classification,parentID,portfolio,series,volume,uuid,URL,thumbURL,width,height,maxpixels
364,46054,Four-Panel Screen,c. 1475/1500,1475.0,1500.0,1401 to 1500,oil on panel,"overall size: 222 x 286.6 cm (87 3/8 x 112 13/16 in.)\r\noverall (Saint Dionysius, painted surface): 151.5 x 54.5 cm (59 5/8 x 21 7/16 in.)\r\noverall (Saint Dionysius, painted surface and frame):...",Portuguese 15th Century,Portuguese 15th Century,painting,,,,,9d7fd726-6565-49b5-ae42-fd8f570fbf08,https://api.nga.gov/iiif/9d7fd726-6565-49b5-ae42-fd8f570fbf08,"https://api.nga.gov/iiif/9d7fd726-6565-49b5-ae42-fd8f570fbf08/full/!200,200/0/default.jpg",401,1200,
365,46054,Four-Panel Screen,c. 1475/1500,1475.0,1500.0,1401 to 1500,oil on panel,"overall size: 222 x 286.6 cm (87 3/8 x 112 13/16 in.)\r\noverall (Saint Dionysius, painted surface): 151.5 x 54.5 cm (59 5/8 x 21 7/16 in.)\r\noverall (Saint Dionysius, painted surface and frame):...",Portuguese 15th Century,Portuguese 15th Century,painting,,,,,a3bbffa3-3e50-465a-832a-fff278cd945c,https://api.nga.gov/iiif/a3bbffa3-3e50-465a-832a-fff278cd945c,"https://api.nga.gov/iiif/a3bbffa3-3e50-465a-832a-fff278cd945c/full/!200,200/0/default.jpg",394,1200,
366,46054,Four-Panel Screen,c. 1475/1500,1475.0,1500.0,1401 to 1500,oil on panel,"overall size: 222 x 286.6 cm (87 3/8 x 112 13/16 in.)\r\noverall (Saint Dionysius, painted surface): 151.5 x 54.5 cm (59 5/8 x 21 7/16 in.)\r\noverall (Saint Dionysius, painted surface and frame):...",Portuguese 15th Century,Portuguese 15th Century,painting,,,,,b862fdfb-bb90-45fe-9456-5f02a84bc6bc,https://api.nga.gov/iiif/b862fdfb-bb90-45fe-9456-5f02a84bc6bc,"https://api.nga.gov/iiif/b862fdfb-bb90-45fe-9456-5f02a84bc6bc/full/!200,200/0/default.jpg",383,1200,
367,46054,Four-Panel Screen,c. 1475/1500,1475.0,1500.0,1401 to 1500,oil on panel,"overall size: 222 x 286.6 cm (87 3/8 x 112 13/16 in.)\r\noverall (Saint Dionysius, painted surface): 151.5 x 54.5 cm (59 5/8 x 21 7/16 in.)\r\noverall (Saint Dionysius, painted surface and frame):...",Portuguese 15th Century,Portuguese 15th Century,painting,,,,,ce0ccbec-f4e6-4d7e-af41-e0b568f726eb,https://api.nga.gov/iiif/ce0ccbec-f4e6-4d7e-af41-e0b568f726eb,"https://api.nga.gov/iiif/ce0ccbec-f4e6-4d7e-af41-e0b568f726eb/full/!200,200/0/default.jpg",386,1200,
1997,41641,The Rule of Bacchus [left panel],c. 1535,1535.0,1535.0,1501 to 1550,oil on hardboard transferred from panel,left panel: 39 x 15.9 cm (15 3/8 x 6 1/4 in.),"Altdorfer, Albrecht, Workshop of",Workshop of Albrecht Altdorfer,painting,,,,,6d7a3f41-6c87-4055-bd05-a96006f9bd74,https://api.nga.gov/iiif/6d7a3f41-6c87-4055-bd05-a96006f9bd74,"https://api.nga.gov/iiif/6d7a3f41-6c87-4055-bd05-a96006f9bd74/full/!200,200/0/default.jpg",4500,2827,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76678,134411,Putti and Fauns Climbing a Grapevine,1650s,1617.0,1665.0,1601 to 1650,etching with engraving on laid paper,plate: 18.5 x 16.5 cm (7 5/16 x 6 1/2 in.)\r\nsheet: 38.5 x 26.8 cm (15 3/16 x 10 9/16 in.),"Dorigny, Michel",Michel Dorigny,print,134155.0,Recueil de Douze Bacchantes,Recueil de douze Bacchanales,,8d1aba03-3d45-4604-aa53-c9c72de76957,https://api.nga.gov/iiif/8d1aba03-3d45-4604-aa53-c9c72de76957,"https://api.nga.gov/iiif/8d1aba03-3d45-4604-aa53-c9c72de76957/full/!200,200/0/default.jpg",2506,4000,
77638,134423,Frontispiece,1650s,1610.0,1686.0,1601 to 1650,etching with engraving on laid paper,plate: 7.6 x 16.1 cm (3 x 6 5/16 in.)\r\nsheet: 38.5 x 26.8 cm (15 3/16 x 10 9/16 in.),"Cochin, Nicolas",Nicolas Cochin,print,134155.0,Recueil de douze Bacchanales,Recueil de douze Bacchanales,,2108023a-c1e2-49f8-94ec-a6f8eb6c01f2,https://api.nga.gov/iiif/2108023a-c1e2-49f8-94ec-a6f8eb6c01f2,"https://api.nga.gov/iiif/2108023a-c1e2-49f8-94ec-a6f8eb6c01f2/full/!200,200/0/default.jpg",2506,4000,
77639,134423,Frontispiece,1650s,1610.0,1686.0,1601 to 1650,etching with engraving on laid paper,plate: 7.6 x 16.1 cm (3 x 6 5/16 in.)\r\nsheet: 38.5 x 26.8 cm (15 3/16 x 10 9/16 in.),"Cochin, Nicolas",Nicolas Cochin,print,134155.0,Recueil de douze Bacchanales,Recueil de douze Bacchanales,,f771f5a1-666a-4f10-bb0a-bb6648dd75c9,https://api.nga.gov/iiif/f771f5a1-666a-4f10-bb0a-bb6648dd75c9,"https://api.nga.gov/iiif/f771f5a1-666a-4f10-bb0a-bb6648dd75c9/full/!200,200/0/default.jpg",4000,1956,
77662,147751,Woolworth Building June Night,1916,1916.0,1916.0,1901 to 1925,halftone offset lithograph,sheet: 140 x 89 mm,"Elmer, Rachael Robinson",Rachael Robinson Elmer,print,,Post Cards: New York Series I,,,27ab5df3-d94b-4d43-b8fe-17b4d3997269,https://api.nga.gov/iiif/27ab5df3-d94b-4d43-b8fe-17b4d3997269,"https://api.nga.gov/iiif/27ab5df3-d94b-4d43-b8fe-17b4d3997269/full/!200,200/0/default.jpg",7319,11421,


In [89]:
df_dup[KeyAttrs][0:4]

Unnamed: 0,objectID,title,classification,thumbURL
364,46054,Four-Panel Screen,painting,"https://api.nga.gov/iiif/9d7fd726-6565-49b5-ae42-fd8f570fbf08/full/!200,200/0/default.jpg"
365,46054,Four-Panel Screen,painting,"https://api.nga.gov/iiif/a3bbffa3-3e50-465a-832a-fff278cd945c/full/!200,200/0/default.jpg"
366,46054,Four-Panel Screen,painting,"https://api.nga.gov/iiif/b862fdfb-bb90-45fe-9456-5f02a84bc6bc/full/!200,200/0/default.jpg"
367,46054,Four-Panel Screen,painting,"https://api.nga.gov/iiif/ce0ccbec-f4e6-4d7e-af41-e0b568f726eb/full/!200,200/0/default.jpg"


In [90]:
df_dup[KeyAttrs].iloc[0:4,3]
# this is slicing for cells @:
# column-index = 3 (i.e. thumbURL), and 
# row-index = 0~4 (i.e. the 4 components of artwork titled "Four-Panel Screen" )

364    https://api.nga.gov/iiif/9d7fd726-6565-49b5-ae42-fd8f570fbf08/full/!200,200/0/default.jpg
365    https://api.nga.gov/iiif/a3bbffa3-3e50-465a-832a-fff278cd945c/full/!200,200/0/default.jpg
366    https://api.nga.gov/iiif/b862fdfb-bb90-45fe-9456-5f02a84bc6bc/full/!200,200/0/default.jpg
367    https://api.nga.gov/iiif/ce0ccbec-f4e6-4d7e-af41-e0b568f726eb/full/!200,200/0/default.jpg
Name: thumbURL, dtype: object

In [91]:
df_2D_paint = df_2D_7[ df_2D_7["classification"] == "painting"]
df_2D_paint.shape # 3788 counts of painting-images
df_2D_paint["objectID"].nunique() # 3782 counts of distinct paintings

3782

In [92]:
df_2D_draw = df_2D_7[ df_2D_7["classification"] == "drawing"]
df_2D_draw.shape # 32155 counts of drawing-images
df_2D_draw["objectID"].nunique() #  counts of distinct paintings

32135

In [93]:
df_2D_print = df_2D_7[ df_2D_7["classification"] == "print"]
df_2D_print.shape # 45994 counts of drawing-images
df_2D_print["objectID"].nunique() #  counts of distinct paintings

45951

In [94]:
# Images for Paintings
df_2D_paint[KeyAttrs]

# to check a specific artwork-object for its digital image URL:
df_2D_paint[KeyAttrs].iloc[0,3] 
# get the cell with row_index = 4 (i.e. objectID = 19), column_index = 3 (i.e. attribute named "thumbURL")
# SYNTAX: dataframe.iloc[row_index, column_index]

'https://api.nga.gov/iiif/7b170a4c-9d44-475c-b294-cee6f43d88af/full/!200,200/0/default.jpg'

In [95]:
# Images for Drawings
df_2D_draw[KeyAttrs]

df_2D_draw[KeyAttrs].iloc[32154,3] # objectID=165301, title = "The Ballet"

'https://api.nga.gov/iiif/2303ecb9-b192-4833-8d33-29c0ccb7a07f/full/!200,200/0/default.jpg'

In [96]:
# Images for Prints
df_2D_print[KeyAttrs]

df_2D_print[KeyAttrs].iloc[45993,3] # objectID = 32572, title = "American Flamingo"

'https://api.nga.gov/iiif/bd772828-3571-450c-a79e-bb547c8ec9c1/full/!200,200/0/default.jpg'

### how to get the value of a specific cell in a table / dataframe

`.iloc[]` function
* https://www.w3schools.com/python/pandas/ref_df_iloc.asp
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

To select multiple tuples from multiple columns
* https://www.geeksforgeeks.org/how-to-select-multiple-columns-in-a-pandas-dataframe/

## 2.4: Output 2 Tables: "obejects_cleaned.csv" && "objects_images_cleaned.csv"

**Pandas Create New DataFrame By Selecting Specific Columns**  
* https://sparkbyexamples.com/pandas/pandas-create-new-dataframe-by-selecting-specific-columns/

**Drop the duplicated tuples**
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
* ALSO see CIT591-M6-Slide14
* By default, this will look at all columns to identify and drop duplicate rows in the data.
* to drop duplicates based on a specific column , use the `subset=` parameter to specify the column name you want to use to compare and drop duplicates based on 

row index not matching with row counts  
https://stackoverflow.com/questions/53135481/pandas-dataframe-index-length-doesnt-match-number-of-rows

Solve DtypeWarning: `Columns have mixed types. Specify dtype option on import or set low_memory=False`
https://www.roelpeters.be/solved-dtypewarning-columns-have-mixed-types-specify-dtype-option-on-import-or-set-low-memory-in-pandas/

In [121]:
# --------- 1) Extract the "objects.csv" part of the JOINT table ----------------
df1_clean_raw = df_2D_7[['objectID', 'title', 'displayDate', 'beginYear', 'endYear', 'timeSpan',
       'medium', 'dimensions','attributionInverted', 'attribution', 'classification', 'parentID',
       'portfolio', 'series', 'volume']].copy()

#df1_clean # 81937 tuples with duplicates
df1_clean = df1_clean_raw.drop_duplicates(subset="objectID")
df1_clean = df1_clean.reset_index(drop=True) # reset row indexing

# ---- 1.2) check if any duplicated objectID exist ----------------------
#df1_dup = df1_clean[df1_clean["objectID"].duplicated(keep = False)]
#df1_dup

#df1_clean.head(5) # 81937 tuples with duplicates ==> 81868 tuples without duplicates
df1_clean.shape # 81868 tuples x 18 columns
df1_clean["objectID"]

81868

In [98]:
# --------- 2) Extract the "objects_images.csv" part of the JOINT table ----------------
df7_clean = df_2D_7[['uuid', 'objectID', 'URL', 'thumbURL', 'width', 'height', 'maxpixels']].copy()
df7_clean # 81937 tuples

df7_clean = df7_clean.drop_duplicates(subset="uuid")
df7_clean = df7_clean.reset_index(drop=True)
df7_clean # 81937 tuples, there are no duplicated uuid/images

#df7_clean.head(5)
#df7_clean.shape # 81937 tuples x 7 columns
# 81868 objectID --> 81937 images

Unnamed: 0,uuid,objectID,URL,thumbURL,width,height,maxpixels
0,7b170a4c-9d44-475c-b294-cee6f43d88af,0,https://api.nga.gov/iiif/7b170a4c-9d44-475c-b294-cee6f43d88af,"https://api.nga.gov/iiif/7b170a4c-9d44-475c-b294-cee6f43d88af/full/!200,200/0/default.jpg",2846,5153,
1,7bbcfd01-e774-46e7-96d1-a3b03598cd8a,1,https://api.nga.gov/iiif/7bbcfd01-e774-46e7-96d1-a3b03598cd8a,"https://api.nga.gov/iiif/7bbcfd01-e774-46e7-96d1-a3b03598cd8a/full/!200,200/0/default.jpg",6004,15544,
2,e8a1acb4-f60a-477a-9bfe-61fa5072c514,2,https://api.nga.gov/iiif/e8a1acb4-f60a-477a-9bfe-61fa5072c514,"https://api.nga.gov/iiif/e8a1acb4-f60a-477a-9bfe-61fa5072c514/full/!200,200/0/default.jpg",2152,4827,
3,e6497b39-66a9-4b6b-bf2b-c1633d20c0b6,18,https://api.nga.gov/iiif/e6497b39-66a9-4b6b-bf2b-c1633d20c0b6,"https://api.nga.gov/iiif/e6497b39-66a9-4b6b-bf2b-c1633d20c0b6/full/!200,200/0/default.jpg",2333,2996,
4,a21cc457-7ddf-4c4d-9934-8039bc919864,19,https://api.nga.gov/iiif/a21cc457-7ddf-4c4d-9934-8039bc919864,"https://api.nga.gov/iiif/a21cc457-7ddf-4c4d-9934-8039bc919864/full/!200,200/0/default.jpg",5395,7270,
...,...,...,...,...,...,...,...
81932,f457a3e8-5531-4e4a-a0b9-29f23d9a4ff3,222954,https://api.nga.gov/iiif/f457a3e8-5531-4e4a-a0b9-29f23d9a4ff3,"https://api.nga.gov/iiif/f457a3e8-5531-4e4a-a0b9-29f23d9a4ff3/full/!200,200/0/default.jpg",10693,8178,
81933,039df406-4517-4f75-8195-d0b668289b2d,222956,https://api.nga.gov/iiif/039df406-4517-4f75-8195-d0b668289b2d,"https://api.nga.gov/iiif/039df406-4517-4f75-8195-d0b668289b2d/full/!200,200/0/default.jpg",10578,8074,
81934,412ed7cb-5244-42f5-96d6-7264928ba491,222965,https://api.nga.gov/iiif/412ed7cb-5244-42f5-96d6-7264928ba491,"https://api.nga.gov/iiif/412ed7cb-5244-42f5-96d6-7264928ba491/full/!200,200/0/default.jpg",10470,7953,
81935,9c5398e2-3be4-4d1d-b293-72830423d0db,32452,https://api.nga.gov/iiif/9c5398e2-3be4-4d1d-b293-72830423d0db,"https://api.nga.gov/iiif/9c5398e2-3be4-4d1d-b293-72830423d0db/full/!200,200/0/default.jpg",4950,7350,


### 2.4 Issue: outputted .csv file has 81874 tuples (should be 81868), extra 6 tuples????

#### SOLUTION: output dataframe as `.xlse` file, then use EXCEL software to save as `.csv` file

* Write pandas DataFrame to CSV file. The result gets extra rows  https://stackoverflow.com/questions/68372151/write-pandas-dataframe-to-csv-file-the-result-gets-extra-rows

**`pd.to_excel()` method**
* https://sparkbyexamples.com/pandas/pandas-read-excel-with-examples/


`isnull()`  
* https://www.geeksforgeeks.org/python-pandas-isnull-and-notnull/  
* https://pandas.pydata.org/docs/reference/api/pandas.isnull.html

## 2.5 Output cleaned `objects.xlsx` & `objects_images.csv`

Output `.csv`/`.xlsx` file to a different directory:
* https://dataindependent.com/pandas/pandas-write-to-csv-pd-dataframe-to_csv/

In [99]:
# --------- 1) objects.csv ---------------------
# df1_clean.to_csv("01_2_objects_cleaned.csv", encoding = 'utf-8', index = False) # ISSUE: produced extra 6 tuples
df1_clean.to_excel("01_2_objects_cleaned.xlsx", encoding = 'utf-8', index = False)
# ----------to `Ready` folder (Database Ready) ------------
df1_clean.to_excel("../Ready/01_objects.xlsx", encoding = 'utf-8', index = False)
# CHECKED output .csv file is INTACT (i.e. 81868 tuples)


# ----------2) objects_images.csv file -----------
df7_clean.to_csv("07_1_objects_images_cleaned.csv", encoding = 'utf-8', index = False)
# ----------to `Ready` folder (Database Ready) ------------ 
df7_clean.to_csv("../Ready/07_objects_images.csv", encoding = 'utf-8', index = False)

In [100]:
# --------- Testing Coverting .xlsx to .csv file -------------
#df1t = pd.read_excel('01_2_objects_cleaned.xlsx')
#df1t
#df1t.to_csv("01_2_objects_cleaned.csv", encoding = 'utf-8', index = False)
# OBSERVATION: still encounter the same issue as writing extra 6 tuples out
# USE Excel App to convert .xlsx to .csv directly

if need to manipulate file directory: `os.path.join()` method
* https://www.geeksforgeeks.org/python-os-path-join-method/

## 2.5 Use Excel-app to Convert `objects.xlsx` to `objects.csv`