# iDigBio load

We will need labeled images for training our classifier models. iDigBio data has many multi-modal records that we can leverage for getting training data.

We first need to go to the [iDigBio site](https://www.idigbio.org/portal/search) and download a snapshot of data we can use for this project. The downloads can be large and are often over 75 GB zipped. This snapshot contains several CSV files, we are interested in 3 of them:
- multimedia.csv: Has links to the image files
- occurrence.csv: Contains most of the data about taxonomy and locations
- occurrence_raw.csv: Holds the raw data that we will harvest for determining if an image shows flowers, etc.

Our strategy for getting this data is:
1. Get all multimedia records. This is the smallest file and will greatly reduce the DB size.
1. Then get all occurrence and occurrence_raw records that have a matching multimedia record. We will also remove any records that are for fossils or records without a genus.
1. Then we will filter the records in another script to harvest Angiosperms.


## Setup

This is so the notebook can find this project's library files.

In [1]:
import sys

sys.path.append('..')

In [2]:
import sqlite3

Rather than duplicate script code in this notebook I will call functions in the script to describe what is going on. This is the script that I am using to load iDigBio data into the SQLite3 database.

In [3]:
from pathlib import Path

from herbarium.pylib.idigbio import idigbio_utils

## Examine the structure of the iDigBio snapshot

As I said, this file is very large and I don't expect you do download it yourself but I did want to show what's in it.

This is this list of the CSV files it contains.

In [4]:
DATA_DIR = Path('..') / 'data'
ZIP_FILE = DATA_DIR / 'iDigBio_snapshot_2021-02.zip'

In [5]:
idigbio_utils.show_csv_files(ZIP_FILE)

['occurrence.csv',
 'multimedia_raw.csv',
 'multimedia.csv',
 'occurrence_raw.csv',
 'records.citation.txt',
 'mediarecords.citation.txt',
 'meta.xml']

### multimedia.csv

Here are the fields in `multimedia.csv`.

In [7]:
fields = idigbio_utils.show_csv_headers(ZIP_FILE, 'multimedia.csv')
' '.join(fields)

'ac:accessURI ac:licenseLogoURL ac:tag coreid dc:type dcterms:format dcterms:modified dcterms:rights exif:PixelXDimension exif:PixelYDimension idigbio:dataQualityScore idigbio:dateModified idigbio:etag idigbio:flags idigbio:hasSpecimen idigbio:mediaType idigbio:recordIds idigbio:records idigbio:recordsets idigbio:uuid idigbio:version xmpRights:WebStatement'

We're only interested in these two fields. The columns get renamed in the database by dropping any prefix before the colon. So, `ac:accessURI` becomes `accessURI`.

In [9]:
list(idigbio_utils.MULTIMEDIA)

['coreid', 'ac:accessURI']

### occurrence.csv

In [11]:
fields = idigbio_utils.show_csv_headers(ZIP_FILE, 'occurrence.csv')
' '.join(fields)

'coreid dwc:basisOfRecord dwc:bed dwc:catalogNumber dwc:class dwc:collectionCode dwc:collectionID dwc:continent dwc:coordinateUncertaintyInMeters dwc:country dwc:county dwc:earliestAgeOrLowestStage dwc:earliestEonOrLowestEonothem dwc:earliestEpochOrLowestSeries dwc:earliestEraOrLowestErathem dwc:earliestPeriodOrLowestSystem dwc:eventDate dwc:family dwc:fieldNumber dwc:formation dwc:genus dwc:geologicalContextID dwc:group dwc:higherClassification dwc:highestBiostratigraphicZone dwc:individualCount dwc:infraspecificEpithet dwc:institutionCode dwc:institutionID dwc:kingdom dwc:latestAgeOrHighestStage dwc:latestEonOrHighestEonothem dwc:latestEpochOrHighestSeries dwc:latestEraOrHighestErathem dwc:latestPeriodOrHighestSystem dwc:lithostratigraphicTerms dwc:locality dwc:lowestBiostratigraphicZone dwc:maximumDepthInMeters dwc:maximumElevationInMeters dwc:member dwc:minimumDepthInMeters dwc:minimumElevationInMeters dwc:municipality dwc:occurrenceID dwc:order dwc:phylum dwc:recordNumber dwc:reco

In [14]:
list(idigbio_utils.OCCURRENCE)

['coreid',
 'dwc:basisOfRecord',
 'dwc:kingdom',
 'dwc:phylum',
 'dwc:class',
 'dwc:order',
 'dwc:family',
 'dwc:genus',
 'dwc:specificEpithet',
 'dwc:scientificName',
 'dwc:eventDate',
 'dwc:continent',
 'dwc:country',
 'dwc:stateProvince',
 'dwc:county',
 'dwc:locality',
 'idigbio:geoPoint']

### occurrence_raw.csv

In [13]:
fields = idigbio_utils.show_csv_headers(ZIP_FILE, 'occurrence_raw.csv')
' '.join(fields)

'aec:associatedTaxa coreid dc:rights dcterms:accessRights dcterms:bibliographicCitation dcterms:language dcterms:license dcterms:modified dcterms:references dcterms:rights dcterms:rightsHolder dcterms:source dcterms:type dwc:Identification dwc:MeasurementOrFact dwc:ResourceRelationship dwc:VerbatimEventDate dwc:acceptedNameUsage dwc:accessRights dwc:associatedMedia dwc:associatedOccurrences dwc:associatedOrganisms dwc:associatedReferences dwc:associatedSequences dwc:associatedTaxa dwc:basisOfRecord dwc:bed dwc:behavior dwc:catalogNumber dwc:class dwc:classs dwc:collectionCode dwc:collectionID dwc:continent dwc:coordinatePrecision dwc:coordinateUncertaintyInMeters dwc:country dwc:countryCode dwc:county dwc:dataGeneralizations dwc:datasetID dwc:datasetName dwc:dateIdentified dwc:day dwc:decimalLatitude dwc:decimalLongitude dwc:disposition dwc:dynamicProperties dwc:earliestAgeOrLowestStage dwc:earliestEonOrLowestEonothem dwc:earliestEpochOrLowestSeries dwc:earliestEraOrLowestErathem dwc:e

In [15]:
list(idigbio_utils.OCCURRENCE_RAW)

['coreid',
 'dwc:reproductiveCondition',
 'dwc:occurrenceRemarks',
 'dwc:dynamicProperties',
 'dwc:fieldNotes']

## How big are the data files and database?

You can't see the files so this will give you an idea of what's going on.

In [16]:
!ls -lh ../data/idigbio_2021-02.sqlite

-rw-r--r-- 1 rafe rafe 20G Oct 28 15:06 ../data/idigbio_2021-02.sqlite


In [17]:
!ls -lh ../../misc/idigbio/iDigBio_snapshot_2021-02.zip

-rw-rw-r-- 1 rafe rafe 56G Feb 16  2021 ../../misc/idigbio/iDigBio_snapshot_2021-02.zip


## How many records are in the database?

In [18]:
DB = DATA_DIR / 'idigbio_2021-02.sqlite'

In [19]:
with sqlite3.connect(DB) as cxn:
    for table in ['multimedia', 'occurrence', 'occurrence_raw']:
        sql = f'select count(*) from {table}'
        result = cxn.execute(sql)
        count = result.fetchone()[0]
        print(f'{table}: {count}')

multimedia: 40907454
occurrence: 34982246
occurrence_raw: 35007636


## Next we need to filter out all non-Angiosperm records

I'll do this in another notebook