# Mexican Broadsides Data Analysis

For the upcoming Mexican Broadsides collection, we so far have an Excel sheet with call numbers and some metadata. These data will need to be converted, cleaned, and otherwise analyzed in order to produce quality metadata for the DAMS. 

## First step: data conversion
Step one was taking the primary sheet in Excel and converting that to a csv file via `csvkit`. The command was:

```
in2csv --sheet 'Mexican Broadsides Batch 1' Mexican_broadsides_batch_1.xlsx > broadsides.csv
```

In [1]:
import pandas as pd
import os

In [2]:
df1 = pd.read_csv('broadsides.csv')

In [3]:
df1.head()

Unnamed: 0,BIB,CALL,LOC,AUTHOR,TITLE,PUBLISHED,GENRE,STATUS,Notes
0,b35481328,BX1427 .A35 1680z,cpn,Cofradía de San Benito de Palerma,Patente de la Cofradia y Hermandad de la Coron...,"[Mexico] : [La Cofradia], [168-]",Broadsides -- Mexico -- 1680-1689. rbgenr,In Aeon,
1,b35484287,BX1427 .A35 1767,cpn,Cofradia del Santissimo Sacramento y Nuestra S...,Patente de la Cofradia del Santissimo Sacramen...,"[Mexico] : [La Cofradia], [1767?]",Broadsides -- Mexico -- 1760-1769. rbgenr,In Aeon,
2,b35484718,BX1427 .A35 1770z,cpn,Cofradía de San Dimas y Nuestra Señora de la C...,Patente de la Cofradia y Ermandad de el glorio...,"[Mexico] : [La Cofradia], [177-]",Broadsides -- Mexico -- 1770-1779. rbgenr,In Aeon,
3,b35484676,BX1427 .A35 1780z,cpn,Cofradia de San Dimas y Nuestra Señora de la C...,Patente de hermano de la ilustre Cofradia del ...,"[Mexico] : [La Cofradia], [178-]",Broadsides -- Mexico -- 1780-1789. rbgenr,In Aeon,
4,b35581864,BX1430.G7 C38 1845,cpn,Catholic Church. Diocese of Guadalajara (Mexic...,"El Ilmo. Sr. Dr. D. Diego Aranda, dignisimo ob...","[Guadalajara, Mexico] : Imprenta de Rodriguez,...",Broadsides -- Mexico -- 1840-1849. rbgenr,In Aeon,


In [16]:
bib = df1['BIB']

In [17]:
bibs = []
for line in bib:
    bibs.append(line.strip())

In [28]:
print(bibs)
print('\n',len(bibs))

['b35481328', 'b35484287', 'b35484718', 'b35484676', 'b35581864', 'b23614614', 'b66771638', 'b35485140', 'b32169991', 'b35323954', 'b35411545', 'b35567028', 'b35575542', 'b35575943', 'b35382752', 'b35562560', 'b35571275', 'b35572620', 'b35571652', 'b35574987', 'b35571196', 'b35571378', 'b35517323', 'b35573314', 'b35581736', 'b35586436', 'b35608080', 'b48687273', 'b29429821', 'b35581827', 'b35355062', 'b35354963', 'b35354811', 'b35509247', 'b35562730', 'b35572930', 'b33814508', 'b23614791', 'b35573740', 'b35567004', 'b34096279', 'b92452498', 'b92452383', 'b26409124', 'b35847499', 'b35382636', 'b57981590', 'b35485486', 'b35474166', 'b3549444x', 'b3547449x', 'b35514048', 'b35561440', 'b33844483', 'b35514954', 'b35517529', 'b35370725', 'b35498833', 'b35498808', 'b35563370', 'b35568549', 'b35568409', 'b35513172', 'b35515971', 'b35459116', 'b35572784', 'b3551100x', 'b35517311', 'b35519952', 'b35476709', 'b35513366', 'b35519332', 'b33022161', 'b23614560', 'b23614699', 'b35353053', 'b35352942'

So we see we have 155 lines or objects listed in the spreadsheet. Let's now look at the actual file directories to compare

In [23]:
os.chdir('/mnt/digital-staging/Mexican-Broadsides/batch1/Working_Files')
folders = [name for name in os.listdir(".") if os.path.isdir(name)]

In [53]:
print(folders[0:30])
print('\n',len(folders), "total folders")

['b35571652', 'b35411545', 'b35576418', 'b92452966', 'b35515971', 'b35577605', 'b92453156', 'b92849179', 'b34079567', 'b35459098', 'b92645847', 'b35352942', 'b35353922', 'b92472801', 'b92452383', 'b35589711', 'b92758587', 'b35484676', 'b35407773', 'b35474166', 'b34088465', 'b92826428', 'b35353053', 'b92478396', 'b92675153', 'b3551274x', 'b23614560', 'b3560122x', 'b35517529', 'b92850510']

 223 total folders


Hmmmm. So comparing above, there were only 155 objects listed, but the directory has 223.  

Because we can have 'complex objects', that is, objects that have more than 1 file, we should explore what that structure looks like

In [61]:
files = []
for f in folders:
    file = os.listdir(f)
    files.append(file)
    continue 

In [77]:
print(files[0:15])
print('\n',len(files), "total objects")

[['Thumbs.db', 'b35571652_1.tif'], ['b35411545_1.tif', 'Thumbs.db'], ['Thumbs.db', 'b35576418_1.tif'], ['b92452966_1.tif', 'Thumbs.db'], ['b35515971_1.tif', 'Thumbs.db'], ['b35577605_1.tif', 'Thumbs.db'], ['b92453156_1.tif', 'Thumbs.db'], ['Thumbs.db', 'b92849179_1.tif'], ['b34079567_2.tif', 'b34079567_1.tif', 'Thumbs.db'], ['b35459098_1.tif', 'Thumbs.db'], ['b92645847_1.tif', 'Thumbs.db'], ['b35352942_1.tif', 'Thumbs.db'], ['b35353922_1.tif', 'b35353922_2.tif', 'Thumbs.db'], ['b92472801_1.tif', 'Thumbs.db', 'b92472801_2.tif'], ['b92452383_2.tif', 'b92452383_1.tif']]

 223 total objects


See how within each group, there are brackets? This means we have made a "list of lists", which means that each group is its own list within that larger list. This is key to knowing how to filter it later.

So, we see that within our 223 objects, we have some that have 2 or more .tif files, and thankfully, the objects with only 1 .tif still maintain a `_1` suffix.  

But, we also have the junk files 'Thumbs.db' that we will want to filter out of the list within a list. we can use the `.remove` method, which removes items from lists

In [93]:
for file in files:
    while 'Thumbs.db' in file: file.remove('Thumbs.db')
print(files[0:15])
print('\n',len(files), "total objects")

[['b35571652_1.tif'], ['b35411545_1.tif'], ['b35576418_1.tif'], ['b92452966_1.tif'], ['b35515971_1.tif'], ['b35577605_1.tif'], ['b92453156_1.tif'], ['b92849179_1.tif'], ['b34079567_2.tif', 'b34079567_1.tif'], ['b35459098_1.tif'], ['b92645847_1.tif'], ['b35352942_1.tif'], ['b35353922_1.tif', 'b35353922_2.tif'], ['b92472801_1.tif', 'b92472801_2.tif'], ['b92452383_2.tif', 'b92452383_1.tif']]

 223 total objects


So now we have a clean file directory list for our objects. At the very beginning, we made a list from the spreadsheet. That 'bib' ID in the spreadsheet would be helpful to group this directory listing as a key. This means we could make a python dictionary, with the bib ID as the key, and the files that exist as the values 

In [94]:
listing_dict = dict(zip(folders, files))

In [95]:
print(listing_dict)

{'b92675153': ['b92675153_1.tif', 'b92675153_2.tif'], 'b92698505': ['b92698505_2.tif', 'b92698505_1.tif'], 'b9275868x': ['b9275868x_1.tif'], 'b35352942': ['b35352942_1.tif'], 'b92592892': ['b92592892_1.tif'], 'b92453156': ['b92453156_1.tif'], 'b35514954': ['b35514954_1.tif'], 'b34074600': ['b34074600_1.tif'], 'b92848266': ['b92848266_2.tif', 'b92848266_1.tif'], 'b92713555': ['b92713555_1.tif'], 'b33814508': ['b33814508_1.tif'], 'b35601516': ['b35601516_1.tif'], 'b35519551': ['b35519551_1.tif'], 'b92478396': ['b92478396_1.tif'], 'b92452383': ['b92452383_2.tif', 'b92452383_1.tif'], 'b35600408': ['b35600408_1.tif'], 'b35608808': ['b35608808_1.tif'], 'b35601176': ['b35601176_1.tif'], 'b35583149': ['b35583149_1.tif', 'b35583149_2.tif'], 'b35847499': ['b35847499_1.tif'], 'b92592533': ['b92592533_1.tif'], 'b35605558': ['b35605558_1.tif'], 'b35608596': ['b35608596_1.tif'], 'b35382636': ['b35382636_1.tif', 'b35382636_2.tif'], 'b35517323': ['b35517323_1.tif', 'b35517323_2.tif'], 'b35589711': ['b