In [1]:
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# GBIF - The Global Biodiversity Information Facility

This notebook shows how to access biological occurrence data from GBIF, a wonderful resource for biologists and anyone interested in biological data. 


## API Interface
We'll start by making API calls to the Occurrence API, which holds records of where and when a particular organism was observed.

In [2]:
base_url = "https://api.gbif.org/v1/occurrence/search?"

There are a lot of additional parameters you can add, here's the list, but you can find the descriptions for each parameter at the bottom of this page: https://www.gbif.org/developer/occurrence.

__Optional Parameters:__ q, basisOfRecord, catalogNumber, collectionCode, continent, country, datasetKey, decimalLatitude, decimalLongitude, depth, elevation, eventDate, geometry, hasCoordinate, hasGeospatialIssue, institutionCode, issue, lastInterpreted, mediaType, month, occurrenceId, organismId, protocol, license, publishingCountry, publishingOrg, crawlId, recordedBy, recordNumber, scientificName, locality, stateProvince, waterBody, taxonKey, kingdomKey, phylumKey, classKey, orderKey, familyKey, genusKey, subGenusKey, speciesKey, year, establishmentMeans, repatriated, typeStatus, facet, facetMincount, facetMultiselect, facet, paging

For this example, I'm going to be using the genusKey (I'll have to use the 'q' parameter first, which is just a general query parameter, to find the appropriate genusKey for my genus of interest), hasCoordinate, and hasGeospatialIssue parameters.

In [20]:
response = requests.get(base_url+'&limit=300'+'&q=Ayenia'+'&hasCoordinate=true'+'&hasGeospatialIssue=false')

In [21]:
response.status_code

200

In [22]:
trial = response.json()

In [23]:
trial.keys()

dict_keys(['offset', 'limit', 'endOfRecords', 'count', 'results', 'facets'])

In [24]:
len(trial['results'])

300

In [27]:
trial['endOfRecords']

False

In [43]:
#Get the genusKey for Ayenia
trial['results'][0]['genusKey']

3152178

Based on these few lines of exploratory code, we can now write a script that will download all of the data for this genus. The responses to the query are paginated, meaning we only get 300 records per request. The `'offset'` parameter tells which record is the beginning of the page. The `'endOfRecords'` parameter lets us know if the last record of the query is included in the response. Knowing this, we want to construct a script that will keep downloading new pages until `'endOfRecords' = True`. Each time, we'll concatenate the records that are stored in `'results'` to a Pandas DataFrame.

In [45]:
def get_GBIF_response(base_url, offset, params, df):
    """Performs an API call to the base URL with additional parameters listed in 'params'. Concatenates response to 
    a Pandas DataFrame, 'df'."""
    #Construct the query URL
    query = base_url+'&'+f'offset={offset}'
    for each in params:
        query = query+'&'+each
    #Call API
    response = requests.get(query)
    #If call is successful, add data to df
    if response.status_code != 200:
        print(f"API call failed at offset {offset} with a status code of {response.status_code}.")
    else:
        result = response.json()
        df_concat = pd.concat([df, pd.DataFrame.from_dict(result['results'])], axis = 0, ignore_index = True, sort = True)
        endOfRecords = result['endOfRecords']
        return df_concat, endOfRecords, response.status_code
    

In [44]:
params = ['limit=300', 'genusKey=3152178', 'hasCoordinate=true', 'hasGeospatialIssue=false']

In [46]:
df = pd.DataFrame()
endOfRecords = False
offset = 0
status = 200

while endOfRecords == False and status == 200:
    df, endOfRecords, status = get_GBIF_response(base_url, offset, params, df)
    offset = len(df) + 1


## Data Cleaning

Now that we have our DataFrame with all the records in it, I'm going to first save the full DataFrame as a CSV file, then I'll start cleaning up the data. A lot of the cleaning I'm going to do is based on my *very specialized* knowledge of this genus. 

In [54]:
df.to_csv("../data/Ayenia_full_dataframe.csv")

In [2]:
#Read back in csv file
df = pd.read_csv("../data/Ayenia_full_dataframe.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [51]:
trial['results'][0].keys()

dict_keys(['key', 'datasetKey', 'publishingOrgKey', 'installationKey', 'publishingCountry', 'protocol', 'lastCrawled', 'lastParsed', 'crawlId', 'extensions', 'basisOfRecord', 'taxonKey', 'kingdomKey', 'phylumKey', 'classKey', 'orderKey', 'familyKey', 'genusKey', 'speciesKey', 'acceptedTaxonKey', 'scientificName', 'acceptedScientificName', 'kingdom', 'phylum', 'order', 'family', 'genus', 'species', 'genericName', 'specificEpithet', 'taxonRank', 'taxonomicStatus', 'decimalLongitude', 'decimalLatitude', 'year', 'month', 'day', 'eventDate', 'issues', 'modified', 'lastInterpreted', 'references', 'license', 'identifiers', 'media', 'facts', 'relations', 'geodeticDatum', 'class', 'countryCode', 'recordedByIDs', 'identifiedByIDs', 'country', 'identifier', 'recordedBy', 'created', 'locality', 'gbifID', 'occurrenceID', 'associatedSequences', 'taxonID', 'higherClassification'])

Let's drop the records that aren't real species or missing a species-level identification.

In [3]:
df_cleanedspecies = df.drop(index = df[(df['species'] == 'Ayenia villicocca') 
                                       | (df['species'] == 'Ayenia echinococca') 
                                       | (df['species'].isna())
                                       | (df['specificEpithet'] == 'villicocca')
                                       | (df['specificEpithet'] == 'echinococca')
                                       | (df['specificEpithet'].isna())
                                       | (df['scientificName'] == 'Ayenia L.')].index)

In [4]:
df_cleanedspecies.reset_index(inplace = True)

Next, I want to drop the records that aren't based on a real specimen that is preserved in a museum. GBIF includes observations from iNaturalist, another really cool biological data source based on people uploading pictures of organisms they see. However, observations on iNaturalist may not be checked by specialists in the field, therefore the identification of the species may not be right.

In [5]:
df_cleanedobservations = df_cleanedspecies.drop(index = df_cleanedspecies[df_cleanedspecies['basisOfRecord'] != 'PRESERVED_SPECIMEN'].index)

In [6]:
df_cleanedobservations.reset_index(inplace = True)

Finally, let's just keep a few of the columns that we'll need for making a biodiversity heatmap. There's a lot of extra stuff in here that we just don't need.

In [7]:
df_cleaned = df_cleanedobservations[['species', 'decimalLongitude', 'decimalLatitude', 'country']]

In [8]:
df_cleaned.head()

Unnamed: 0,species,decimalLongitude,decimalLatitude,country
0,Ayenia tomentosa,-40.700556,-19.53,Brazil
1,Ayenia tomentosa,-40.700556,-19.53,Brazil
2,Ayenia tomentosa,-38.646389,-8.592222,Brazil
3,Ayenia tomentosa,-38.023611,-8.105278,Brazil
4,Ayenia tomentosa,-38.113056,-8.592222,Brazil


In [78]:
df_cleaned.to_csv("../data/Ayenia_cleaned_dataframe.csv")

Now that we have our cleaned data, we can try mapping it next! We'll move to a new notebook for that. 