# Importing very large CSVs into `pandas` for the CIL project 

`pandas` can handle very large files, but special flags have to be set. Let's see how this works:

In [1]:
import pandas as pd
import timeit

In [2]:
df_basic = pd.read_csv('cil_excel_object_input.csv')

  interactivity=interactivity, compiler=compiler, result=result)


We get an interesting error. It doesn't make much sense that `pandas` is complaining about mixed data types, but perhaps that's not really the issue. It _does_ recommend we can set the flag `low_memory=False`, so let's try that. 

In [6]:
%%time
df_opt = pd.read_csv('cil_excel_object_input.csv', low_memory=False)

CPU times: user 1.68 s, sys: 96.8 ms, total: 1.78 s
Wall time: 1.8 s


It appears to have worked. Now let's look at a preview.

**NOTE:** for whatever reason, when we used our `%%time` magic, we lost whatever happened in the cell. We will first have to redo that creation of the DataFrame before we move on

In [9]:
df_opt = pd.read_csv('cil_excel_object_input.csv', low_memory=False)
df_opt[0:24]

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Type of Resource.1,Language,Title,Date:creation,Date:issued,...,Subject:topic.11,Subject:topic.12,Subject:topic.13,Subject:topic.14,Subject:topic.15,Subject:topic.16,Subject:topic.17,Subject:topic.18,Subject:topic.19,Subject:topic.20
0,12154,Object,,,data,still image,zxx - No linguistic content; Not applicable,"CIL:12154, Gallus gallus gallus, memory B cell.",,2019.0,...,,,,,,,,,,
1,12154,Component,12154.jpg,image-source,,,,Jpeg format,,,...,,,,,,,,,,
2,12154,Component,12154.tif,image-source,,,,OME_tif format,,,...,,,,,,,,,,
3,12154,Component,12154.zip,data-service,,,,Zip format,,,...,,,,,,,,,,
4,12154,Component,12154.json,data-service,,,,CIL source metadata (JSON),,,...,,,,,,,,,,
5,35710,Object,,,data,still image,zxx - No linguistic content; Not applicable,"CIL:35710, Saccharomyces cerevisiae.",,2019.0,...,,,,,,,,,,
6,35710,Component,35710.tif,image-source,,,,OME_tif format,,,...,,,,,,,,,,
7,35710,Component,35710.jpg,image-source,,,,Jpeg format,,,...,,,,,,,,,,
8,35710,Component,35710.zip,data-service,,,,Zip format,,,...,,,,,,,,,,
9,35710,Component,35710.json,data-service,,,,CIL source metadata (JSON),,,...,,,,,,,,,,


## Making scripts to subset the data

Now for CIL, we would like to take the huge CSV and subset it both for manageability and by copyright and licenses. Let's again import the data 

In [118]:
df_cil = pd.read_csv('../test/cil_excel_object_input.csv', low_memory=False)

In [119]:
df_cil.shape

(51327, 29)

So, the CSV has 29 columns and 51,000+ rows. Let's look at a brief snippet

In [120]:
df_cil.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,del,Title,Date:creation,Date:issued,...,Person:researcher,Related resource:related,Subject:anatomy,Subject:scientific name,Subject:series,Subject:topic,Access granted,Copyright status,Copyright holder,CC license
0,2,Object,,,data | still image,zxx - No linguistic content; Not applicable,"CIL:2, Mus musculus, fibroblast.","CIL:2, Mus musculus, fibroblast",,2020.0,...,"Parysek, Linda | Aebig, Trudy",Source Record in the Cell Image Library @ http...,fibroblast | mitochondrion | type III intermed...,Mus musculus,,intermediate filament-based process | mitochon...,The world - metadata and files,Public Domain,,
1,2,Component,2.tif,image-source,,,OME_tif format,OME_tif format,,,...,,,,,,,,,,
2,2,Component,2.zip,data-service,,,Zip format,Zip format,,,...,,,,,,,,,,
3,2,Component,2.jpg,image-source,,,Jpeg format,Jpeg format,,,...,,,,,,,,,,
4,2,Component,2.json,data-service,,,CIL source metadata (JSON),CIL source metadata (JSON),,,...,,,,,,,,,,


A first hurdle we encounter is that due to the object structure, component row values are blank/NaN for the columns we are interested in: `[Copyright status]` and `[CC license]`. We may need to fill down these values (this is done via `fillna()` in `pandas`. But it will need to be done intelligently. A 'forward fill' will always work off the _last_ valid value, which will unfortauntely fill down in objects we don't want it to.  

First, renaming these columns will help us with our operations

In [126]:
df_cil.rename({'Copyright status': 'copyright_status', 'CC license': 'cc_license'}, axis=1, inplace=True)

In [127]:
df_cil.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,del,Title,Date:creation,Date:issued,...,Person:researcher,Related resource:related,Subject:anatomy,Subject:scientific name,Subject:series,Subject:topic,Access granted,copyright_status,Copyright holder,cc_license
0,2,Object,,,data | still image,zxx - No linguistic content; Not applicable,"CIL:2, Mus musculus, fibroblast.","CIL:2, Mus musculus, fibroblast",,2020.0,...,"Parysek, Linda | Aebig, Trudy",Source Record in the Cell Image Library @ http...,fibroblast | mitochondrion | type III intermed...,Mus musculus,,intermediate filament-based process | mitochon...,The world - metadata and files,Public Domain,,
1,2,Component,2.tif,image-source,,,OME_tif format,OME_tif format,,,...,,,,,,,,,,
2,2,Component,2.zip,data-service,,,Zip format,Zip format,,,...,,,,,,,,,,
3,2,Component,2.jpg,image-source,,,Jpeg format,Jpeg format,,,...,,,,,,,,,,
4,2,Component,2.json,data-service,,,CIL source metadata (JSON),CIL source metadata (JSON),,,...,,,,,,,,,,


Now, to do our "intelligent" fill down for public domain objects, we will have to group by the Unique Object ID, in order to fill down object by object

In [128]:
df_cil.copyright_status = df_cil.groupby('Object Unique ID').copyright_status.apply(lambda x: x.ffill())

Let's apply a filter to just get the public domain objects and components

In [135]:
# filter rows that are public domain
cil_pd = df_cil[df_cil['copyright_status']=='Public Domain']
cil_pd

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,del,Title,Date:creation,Date:issued,...,Person:researcher,Related resource:related,Subject:anatomy,Subject:scientific name,Subject:series,Subject:topic,Access granted,copyright_status,Copyright holder,cc_license
0,2,Object,,,data | still image,zxx - No linguistic content; Not applicable,"CIL:2, Mus musculus, fibroblast.","CIL:2, Mus musculus, fibroblast",,2020.0,...,"Parysek, Linda | Aebig, Trudy",Source Record in the Cell Image Library @ http...,fibroblast | mitochondrion | type III intermed...,Mus musculus,,intermediate filament-based process | mitochon...,The world - metadata and files,Public Domain,,
1,2,Component,2.tif,image-source,,,OME_tif format,OME_tif format,,,...,,,,,,,,Public Domain,,
2,2,Component,2.zip,data-service,,,Zip format,Zip format,,,...,,,,,,,,Public Domain,,
3,2,Component,2.jpg,image-source,,,Jpeg format,Jpeg format,,,...,,,,,,,,Public Domain,,
4,2,Component,2.json,data-service,,,CIL source metadata (JSON),CIL source metadata (JSON),,,...,,,,,,,,Public Domain,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51278,50660,Component,50660.json,data-service,,,CIL source metadata (JSON),CIL source metadata (JSON),,,...,,,,,,,,Public Domain,,
51279,50661,Object,,,data | still image,zxx - No linguistic content; Not applicable,"CIL:50661, Homo sapiens, Umbilical vein endoth...","CIL:50661, Homo sapiens, Umbilical vein endoth...",,2020.0,...,"Inala, Ashwin | Shiuan, Eileen",Source Record in the Cell Image Library @ http...,Vascular endothelial cadherin | Umbilical vein...,Homo sapiens,,,The world - metadata and files,Public Domain,,
51280,50661,Component,50661.zip,data-service,,,Zip format,Zip format,,,...,,,,,,,,Public Domain,,
51281,50661,Component,50661.jpg,image-source,,,Jpeg format,Jpeg format,,,...,,,,,,,,Public Domain,,


We see we now have a dataset of public domain objects, and it is 27,000 rows.  

Now we will need to do a similar fill down for Creative Commons licensed objects, since those components are also blank

In [137]:
df_cil.cc_license = df_cil.groupby('Object Unique ID').cc_license.apply(lambda x: x.ffill())

In [139]:
# filter out null on [CC license] to see how many CC licenses there are
df_cc = df_cil[df_cil.cc_license.notnull()]
df_cc

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,del,Title,Date:creation,Date:issued,...,Person:researcher,Related resource:related,Subject:anatomy,Subject:scientific name,Subject:series,Subject:topic,Access granted,copyright_status,Copyright holder,cc_license
20,123,Object,,,data | still image,zxx - No linguistic content; Not applicable,"CIL:123, Gallus gallus gallus, erythrocyte.","CIL:123, Gallus gallus gallus, erythrocyte",,2020.0,...,"Woodcock, Christopher",Source Record in the Cell Image Library @ http...,"chromosome, telomeric region | erythrocyte",Gallus gallus gallus,,structural constituent of chromatin | chromati...,The world - metadata and files,,UC Regents,Attribution-NonCommercial-ShareAlike
21,123,Component,123.zip,data-service,,,Zip format,Zip format,,,...,,,,,,,,,,Attribution-NonCommercial-ShareAlike
22,123,Component,123.tif,image-source,,,OME_tif format,OME_tif format,,,...,,,,,,,,,,Attribution-NonCommercial-ShareAlike
23,123,Component,123.jpg,image-source,,,Jpeg format,Jpeg format,,,...,,,,,,,,,,Attribution-NonCommercial-ShareAlike
24,123,Component,123.json,data-service,,,CIL source metadata (JSON),CIL source metadata (JSON),,,...,,,,,,,,,,Attribution-NonCommercial-ShareAlike
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51322,50680,Component,50680.json,data-service,,,CIL source metadata (JSON),CIL source metadata (JSON),,,...,,,,,,,,,,Attribution
51323,50681,Object,,,data | still image,zxx - No linguistic content; Not applicable,"CIL:50681, FIB-SEM Dataset of anti-PKHD1L1 Imm...","CIL:50681, FIB-SEM Dataset of anti-PKHD1L1 Imm...",,2020.0,...,"Ivanchenko, Maryna | Indzhykulian, Artur | Cor...",Nature article @ https://www.nature.com/articl...,Hair cell stereocilia | Polycystic Kidney and ...,Mus musculus,Cell Image Library Group ID: 20000,Native tissue at postnatal day 4,The world - metadata and files,,UC Regents,Attribution
51324,50681,Component,50681.jpg,image-source,,,Jpeg format,Jpeg format,,,...,,,,,,,,,,Attribution
51325,50681,Component,50681.zip,data-service,,,Zip format,Zip format,,,...,,,,,,,,,,Attribution


### Further filtering by CC license type
We need to not only separate public domain objects from CC licensed objects; we also need to further sort the CC licensed objects by type (attribution or CC-NY, non-derivative or CC-BY-ND, etc.).  

We can pass simple filters based on values to get these

In [144]:
cc_by = df_cc[df_cc['cc_license']=='Attribution']
cc_by

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,del,Title,Date:creation,Date:issued,...,Person:researcher,Related resource:related,Subject:anatomy,Subject:scientific name,Subject:series,Subject:topic,Access granted,copyright_status,Copyright holder,cc_license
634,701,Object,,,data | still image,zxx - No linguistic content; Not applicable,"CIL:701, Rattus, multipolar neuron.","CIL:701, Rattus, multipolar neuron",,2020.0,...,"Withers, Ginger",Source Record in the Cell Image Library @ http...,dendritic branch | dendrite | axon | multipola...,Rattus,,dendrite development | establishment or mainte...,The world - metadata and files,,UC Regents,Attribution
635,701,Component,701.zip,data-service,,,Zip format,Zip format,,,...,,,,,,,,,,Attribution
636,701,Component,701.tif,image-source,,,OME_tif format,OME_tif format,,,...,,,,,,,,,,Attribution
637,701,Component,701.jpg,image-source,,,Jpeg format,Jpeg format,,,...,,,,,,,,,,Attribution
638,701,Component,701.json,data-service,,,CIL source metadata (JSON),CIL source metadata (JSON),,,...,,,,,,,,,,Attribution
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51322,50680,Component,50680.json,data-service,,,CIL source metadata (JSON),CIL source metadata (JSON),,,...,,,,,,,,,,Attribution
51323,50681,Object,,,data | still image,zxx - No linguistic content; Not applicable,"CIL:50681, FIB-SEM Dataset of anti-PKHD1L1 Imm...","CIL:50681, FIB-SEM Dataset of anti-PKHD1L1 Imm...",,2020.0,...,"Ivanchenko, Maryna | Indzhykulian, Artur | Cor...",Nature article @ https://www.nature.com/articl...,Hair cell stereocilia | Polycystic Kidney and ...,Mus musculus,Cell Image Library Group ID: 20000,Native tissue at postnatal day 4,The world - metadata and files,,UC Regents,Attribution
51324,50681,Component,50681.jpg,image-source,,,Jpeg format,Jpeg format,,,...,,,,,,,,,,Attribution
51325,50681,Component,50681.zip,data-service,,,Zip format,Zip format,,,...,,,,,,,,,,Attribution


We can then do the same for the rest of the CC license values