# DATA271 Final Project – Impact of Weather Patterns on Swainson Hawk Migration in California
---

## Research Details
---

### Introducing the Problem
This project performs a statistical investigative process to explore and analyze the Swainson hawk tracking data from GBIF and Movebank, alongside meteorological datasets across multiple years. The goal in this notebook will be to identify environmental factors that influence their movement patterns across California. We aim to understand how temperature, precipitation, and other atmospheric conditions affect **when** and **where** these birds migrate—and whether these patterns are shifting in response to broader climate variability or relationships with their natural prey. The articles below come from reputable sources which address this concern in more detail:

- **[National Environmental Education Foundation article: “How Climate Change Is Changing Animal Habits”](https://www.neefusa.org/story/climate-change/how-climate-change-changing-animal-habits)** – NEEF is the nation’s leading organization in lifelong environmental learning, creating opportunities for people of all ages to experience and learn about the environment in ways that improve their lives and the health of the planet.  
- **[BioMed Central publication: “Climate-change impacts on animal migration”](https://climatechangeresponses.biomedcentral.com/articles/10.1186/s40665-015-0013-9)** – BMC publishes 250+ open-access journals, advancing progress in biology, health sciences, and medicine.  
- **[University of Maryland Division of Research publication: “Global-scale Animal Ecology Reveals Behavioral Changes in Response to Climate Change”](https://research.umd.edu/articles/global-scale-animal-ecology-reveals-behavioral-changes-response-climate-change)** – Highlights the UMD research enterprise’s contribution to understanding climate-driven behavioral shifts.  
- **[National Institutes of Health article: “Crossing regimes of temperature dependence in animal movement”](https://pubmed.ncbi.nlm.nih.gov/26854767/)** – NIH research linking temperature, metabolism, and movement dynamics.
- **[UC Santa Cruz news article about climate shifts and animal relocation](https://www.scientificamerican.com/article/climate-change-is-driving-animal-migration/)** - UCSC article on climate shifts and the impact it has on animal relocation 

Animals movements are mainly attributed to **food and water**, **safety**, and broader **migratory cycles**, all of which are influenced by weather and climate signals that can disrupt internal timing mechanisms. With climate change becoming a larger discussion, we hope to identify discrepancies in typical flight patterns that could potentially be caused by environmental variable variation.

Again, the species I’ll be investigating in particular are *Swainson’s Hawks (Buteo swainsoni)* and their primary food source, *Grasshoppers (Acrididae)*. I chose them because (i) high-quality data are available within the target time frame, (ii) both are extensively tracked in California, and (iii) birds provide direct insight into atmospheric effects like flight behavior and cover long distances quickly, which can help recognize flight patterns. I sourced the tracking data from Movebank’s digital library of archived studies, along with GBIF's repository of occurrence data from indiviudals and machines.

I'll list a couple of resources in case you're curious about the subjects of my analysis, both contain data related to the species and details worth keeping in mind while conducting my analysis on their movements.

- **[California Department of Fish and Wildlife report on Swainson Hawks](https://wildlife.ca.gov/Conservation/Birds/Swainsons-Hawk)**
- **[UC Statewide IPM PDF on Grasshoppers](https://ipm.ucanr.edu/legacy_assets/PDF/PESTNOTES/pngrasshoppers.pdf)**

Below is the Movebank study I'll be siphoning the data from, since researchers are allowed to send their data to Movebank for peer review and publishing to their site, which will provide public biodiversity data to the general public.

- **[Space use by Swainson's hawk (Buteo swainsoni) in the Natomas Basin, California](https://datarepository.movebank.org/entities/datapackage/fb9f260b-b3fa-4c3b-a059-847341c43998)** – GPS-tracked Swainson’s Hawks ranged widely (87–172 km²) but focused activity in grassland and alfalfa; space-use intensity was unaffected by sex, offspring stage, or nest success.  

For context, Movebank is a free, web-based platform (launched 2007) where researchers upload, manage, analyze, share, and archive animal-tracking data. Hosted by the Max Planck Institute of Animal Behavior with U.S. and German partners, it offers live data feeds, Env-DATA environmental annotation, MoveApps no-code workflows, REST/R APIs, and a DOI-granting repository. As of early 2025 it hosts **9 000+ studies**, **1 500+ species**, and roughly **8 billion** location records (~11 million new positions daily).

We'll also be using the **Global Biodiversity Information Facility (GBIF)** to provide species and occurrence data, adding to our repository of geospatial tracking data overtime. The Global Biodiversity Information Facility (GBIF) is an international open-data infrastructure funded by governments worldwide. Its mission is to provide free and open access to biodiversity data collected across the globe, helping researchers, policymakers, and conservationists better understand and protect life on Earth.

In regard to the climatological and meterological data we'll be pairing with our animal tracking records, I'll be using the **National Oceanic and Atmospheric Administration (NOAA)** to gather values associated with weather readings that are collected across multiple sites in California. The National Oceanic and Atmospheric Administration (NOAA) is a scientific agency within the U.S. Department of Commerce. It focuses on the conditions of the oceans, major waterways, and the atmosphere, and plays a vital role in monitoring environmental systems and delivering real-time data for public and scientific use.

By integrating Movebank tracks with data from the CDFW and records from the NOAA, this project seeks to uncover how atmospheric anomalies correlate with migration timing, routes, and habitat choices.

---

### Addressing the Problem
Our spatiotemporal analysis will join datasets on geographic coordinates and dates, exploring time-series trends in weather and movement (e.g., earlier arrivals during warmer winters or retreats from drought-affected regions). If possible, we’ll also test whether predator–prey interactions confound purely weather-driven explanations.

---

### Analysis Breakdown
We will ask:

- _Which atmospheric variables (temperature, precipitation, wind) correlate most strongly with migration timing or intensity?_
- _Are there climate thresholds or seasonal shifts that predict migratory events?_
- _Can spatial–temporal mapping reveal significant trends over time?_
- _How might continued climate variability affect future migration predictability?_
- _Do predator–prey relationships better explain some shifts than weather alone?_

Stages:

1. **Explore individual datasets** – Clean and summarize; generate baseline stats and visuals.  
2. **Analyze combined datasets** – Merge movement and climate data; create layered time-series plots and heat maps.  
3. **Evaluate significance and limits** – Assess correlation strength versus ecological plausibility.  
4. **Answer research questions** – Compare findings against initial hypotheses.  
5. **Recommendations & further exploration** – Suggest conservation actions and identify data gaps.

---

### Dataset Sources
1. [NOAA National Centers for Environmental Information](https://www.ncei.noaa.gov/) – Historical and real-time climate data  
2. [California Department of Fish and Wildlife](https://data-cdfw.opendata.arcgis.com/search?q=barn%20owl&tags=species) – Public CDFW spatial data  
3. [Movebank](https://www.movebank.org/) – Open animal-movement data across species  

---

### Libraries & Modules
- **Pandas** – Time-series wrangling, dataset merges, and general data manipulation  
- **Numpy** – Scientific computing and vectorized calculations  
- **Matplotlib & Seaborn** – Temporal and distributional visualizations (static plots)  
- **Plotly** – Interactive plots and geographic data visualization  
- **Plotnine** – Grammar of graphics for structured, layered visualizations (ggplot2-style)  
- **Geopandas** – Spatial joins and mapping of migration paths and weather zones  
- **pygbif** – Access to GBIF species, taxonomy, and occurrence data via API  
- **Requests** – HTTP requests for accessing web APIs (e.g., NOAA, GBIF)  
- **HTTPBasicAuth** – Authentication for APIs requiring secure access  
- **StringIO** – Parsing text-based API responses into pandas DataFrames  
- **datetime / timedelta** – Handling time deltas, date slicing, and datetime construction  
- **time / sleep** – Rate-limiting requests, simulating pauses for API pagination or throttling   

---

### Project Resources
- [GitHub Repository](https://github.com/toritotony/Data271FinalProject)


## Collecting Data

In [1]:
%%capture
!pip install plotnine geopandas pygbif

In [2]:
%%capture
!pip --upgrade pandas

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import numpy as np
from plotnine import *
import geopandas
import plotly
import requests
from time import *
from requests.auth import HTTPBasicAuth
from io import StringIO
import time
from pygbif import species, registry, occurrences
from datetime import timedelta

Let's start by exploring the data available to us from GBIF, which we can hopefully use to gather data on the species we'll be studying, and their occurrences across California. First we'll look at the datasets; a GBIF dataset provides occurrence data, which will drive our analysis by allowing us to track their movements overtime. 

Publishing organizations register datasets in this Registry, and the data they reference is retrieved and indexed in GBIF's occurrence store on a regular schedule. We'll ideally be pairing this with Movebanks to have a diverse dataset with less bias towards a certain source.

Before that however, let's retrieve the taxon ids for our subjects

In [4]:
def get_taxon_key(scientific_name):
    response = requests.get(
        "https://api.gbif.org/v1/species/match",
        params={"name": scientific_name}
    )
    data = response.json()
    return data.get("usageKey")

hawk_gbif_key = get_taxon_key("Buteo swainsoni")
acrididae_gbif_key = get_taxon_key("Acrididae")

print(f"Swainson's Hawk taxonKey: {hawk_gbif_key}")
print(f"Acrididae taxonKey: {acrididae_gbif_key}")

Swainson's Hawk taxonKey: 2480562
Acrididae taxonKey: 9394


Now that we have the taxonKeys, let's use it for a dataset search on occurrences for Swainson Hawks

In [5]:
hawk_gbif_datasets = registry.dataset_search(taxonKey=hawk_gbif_key, type='OCCURRENCE')
for ds in hawk_gbif_datasets['results']:
    print(f"{ds['title']} (UUID: {ds['key']}) \n")

It looks like nothing comes up, let's try using the query parameter to account for datasets that have data but maybe didn't include the taxonKey

In [6]:
hawk_gbif_datasets = registry.dataset_search(q="Swainson Hawk", type='OCCURRENCE', publishingCountry="US")
for ds in hawk_gbif_datasets['results']:
    print(f"{ds['title']} (UUID: {ds['key']})")

Records of Hawk Moths (Sphingidae) from Vermont, USA (UUID: 97f3fa73-3195-4af9-a14e-07030f00db96)
Hawk Migration Association of North America - HawkCount (UUID: 8648fdbe-f762-11e1-a439-00145eb45e9a)
Ecological Baseline Studies of the U.S. Outer Continental Shelf Option Year 2 (UUID: 8bfd90da-4ab9-40a2-8c96-bd7c2c9e9fc5)
Ecological Baseline Studies of the U.S. Outer Continental Shelf Option Year 1 (UUID: 874ff2f9-f515-435b-afb1-bc282ef3fa67)


Let's do the same for grasshoppers, searching for occurrence data sets that include the name to an extent, or something semantically similar

In [7]:
acrididae_gbif_datasets = registry.dataset_search(q="Grasshoppers", type='OCCURRENCE', publishingCountry="US")
for ds in acrididae_gbif_datasets['results']:
    print(f"{ds['title']} (UUID: {ds['key']})")

University of Michigan Museum of Zoology, Division of Insects (UUID: 13e7869e-0c76-473a-a227-53d6e3d6fbf2)
William F. Barr Entomological Museum (UUID: d4aea86c-bd2c-4526-ab57-c6a98dc057e5)
American Museum of Natural History (AMNH) Terrestrial Polyneoptera Collection (UUID: d475b3b1-1e5a-4d56-a547-f7afb810c4c5)


Now that we can confirm there are datasets containing occurrences for hawks and grasshoppers, we can start looking at the species information available from GBIF. Using the taxonKeys, their scientific name and common name, we can retrieve relevant information to the species that could help us later on

In [8]:
hawk_gbif_species = species.name_usage(key=hawk_gbif_key, language="eng")
print(hawk_gbif_species)

{'key': 2480562, 'nubKey': 2480562, 'nameKey': 1742843, 'taxonID': 'gbif:2480562', 'sourceTaxonKey': 172742067, 'kingdom': 'Animalia', 'phylum': 'Chordata', 'order': 'Accipitriformes', 'family': 'Accipitridae', 'genus': 'Buteo', 'species': 'Buteo swainsoni', 'kingdomKey': 1, 'phylumKey': 44, 'classKey': 212, 'orderKey': 7191147, 'familyKey': 2877, 'genusKey': 2480517, 'speciesKey': 2480562, 'datasetKey': 'd7dddbf4-2cf0-4f39-9b2a-bb099caae36c', 'constituentKey': '7ddf754f-d193-4cc9-b351-99906754a03b', 'parentKey': 2480517, 'parent': 'Buteo', 'scientificName': 'Buteo swainsoni Bonaparte, 1838', 'canonicalName': 'Buteo swainsoni', 'vernacularName': "Swainson's Hawk", 'authorship': 'Bonaparte, 1838', 'nameType': 'SCIENTIFIC', 'rank': 'SPECIES', 'origin': 'SOURCE', 'taxonomicStatus': 'ACCEPTED', 'nomenclaturalStatus': [], 'remarks': '', 'publishedIn': 'Bonaparte, C.L.J.L. (1838). A geographical and comparative list of the birds of Europe and North America.', 'numDescendants': 0, 'lastCrawle

Let's do the same for grasshoppers.

In [9]:
acrididae_gbif_species = species.name_usage(key=acrididae_gbif_key, language="eng")
print(acrididae_gbif_species)

{'key': 9394, 'nubKey': 9394, 'nameKey': 174293, 'taxonID': 'gbif:9394', 'sourceTaxonKey': 189946194, 'kingdom': 'Animalia', 'phylum': 'Arthropoda', 'order': 'Orthoptera', 'family': 'Acrididae', 'kingdomKey': 1, 'phylumKey': 54, 'classKey': 216, 'orderKey': 1458, 'familyKey': 9394, 'datasetKey': 'd7dddbf4-2cf0-4f39-9b2a-bb099caae36c', 'constituentKey': '7ddf754f-d193-4cc9-b351-99906754a03b', 'parentKey': 1458, 'parent': 'Orthoptera', 'scientificName': 'Acrididae', 'canonicalName': 'Acrididae', 'vernacularName': 'short-horned grasshopper', 'authorship': '', 'nameType': 'SCIENTIFIC', 'rank': 'FAMILY', 'origin': 'SOURCE', 'taxonomicStatus': 'ACCEPTED', 'nomenclaturalStatus': [], 'remarks': '', 'publishedIn': 'MacLeay, W.S. (1821) In Horae Entomologicae or Essays on the Annulose Animals. S. Bagster, London. Vol. 2. Available from http://www.biodiversitylibrary.org/item/103259#page/7/mode/1up', 'numDescendants': 10576, 'lastCrawled': '2023-08-22T23:20:59.545+00:00', 'lastInterpreted': '2023

It might be smart to define the bounds of California to narrow the data we're looking to extract

In [10]:
california_wkt = 'POLYGON((-124.48 32.53, -114.13 32.53, -114.13 42.01, -124.48 42.01, -124.48 32.53))'

Now, with the information suggesting that occurrence data is available for these animals, and given our CA bounds and the taxon keys, we can search for those occurrences and take a look. We'll be appending the data retrieved into a csv to save locally (to avoid rerunning this computationally heavy cell) and then import it as a dataframe in the next cell

In [11]:
def fetch_occurrences_to_csv(taxon_key, geometry, start_year, end_year, out_csv):
    offset = 0
    limit = 300
    first_write = True

    while True:
        response = occurrences.search(
            taxonKey=taxon_key,
            geometry=geometry,
            year=f"{start_year},{end_year}",
            hasCoordinate=True,
            limit=limit,
            offset=offset
        )

        results = response.get("results", [])
        if not results:
            break

        df_chunk = pd.json_normalize(results)
        df_chunk.to_csv(out_csv, mode='a', header=first_write, index=False)
        first_write = False

        offset += limit
        print(f"Wrote {len(df_chunk)} records at offset {offset}")
        time.sleep(1)

    print(f"Finished writing CSV to {out_csv}")

Now we'll call on this function for the Swainson Hawks, which will generate a pdf, which will be imported as a dataframe to avoid using the above function again (this cell takes roughly an hour to complete)

In [29]:
hawk_csv_path = "GBIF/SwainsonHawkGBIF.csv"
#fetch_occurrences_to_csv(hawk_gbif_key, california_wkt, 2013, 2023, hawk_csv_path)
hawk_df = pd.read_csv(hawk_csv_path, on_bad_lines='skip', low_memory=False)

Let's take a quick look at the head of the dataframe to make sure it was processed correctly. After collecting the datasets for our grasshoppers, we'll clean these and append them into a dataset to later incorporate in a singular dataset containing all occurrences or tracking updates for the animals we're investigating

In [30]:
hawk_df.head()

Unnamed: 0,key,datasetKey,publishingOrgKey,installationKey,hostingOrganizationKey,publishingCountry,protocol,lastCrawled,lastParsed,crawlId,...,georeferenceVerificationStatus,locality,municipality,identificationVerificationStatus,verbatimIdentification,individualCount,taxonConceptID,county,informationWithheld,lifeStage
0,4011915312,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T10:20:29.885+00:00,530,...,,,,,,,,,,
1,4028968442,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T10:22:46.194+00:00,530,...,,,,,,,,,,
2,4028827735,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T04:40:59.287+00:00,530,...,,,,,,,,,,
3,4103303596,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T04:37:20.170+00:00,530,...,,,,,,,,,,
4,4090593311,6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b,0c6d40e3-5d96-4a2d-9342-b02833aaa766,b07988d2-1b3d-4fdb-931a-2e852ddac2dd,0c6d40e3-5d96-4a2d-9342-b02833aaa766,GB,EML,2025-01-22T04:59:47.418+00:00,2025-02-01T20:15:49.245+00:00,11,...,Verified by data custodian,Pleasant Grove Waste Water Treatment Plant,Sabre City,Accepted,Buteo swainsoni,,,,,


We should split these by dataset keys, since we'll want to standardize them before combining them into a single GBIF dataset. We'll also be dividing these by hawk and grasshopper data, assuming that some datasets will contain records on occurrences for both species. Mainly, we'll want to make sure all the eventDates and eventTimes are in some standard before combining all hawk and grasshopper data from GBIF, so that errors cause a loss of data.

In [59]:
iNat_hawk_df = hawk_df.loc[hawk_df.datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7"]
iNat_hawk_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
0,4011915312,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.813019,-116.525763,2023-01-02T10:52,10:52:00-08:00,61.0,PRESENT
1,4028968442,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.995642,-118.096413,2023-01-26T09:00,09:00:00-08:00,483.0,PRESENT
2,4028827735,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.995499,-118.096327,2023-01-26T10:58,10:58:00-08:00,166.0,PRESENT
3,4103303596,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",38.570301,-121.730326,2023-01-24T15:02,15:02:00-08:00,4.0,PRESENT
72,4039458396,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",37.181739,-120.601336,2023-02-16T11:24,11:24:00-08:00,449.0,PRESENT


In [60]:
iNat_acrid_df = acrididae_df.loc[acrididae_df.datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7"]
iNat_acrid_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
0,4011493188,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Schistocerca nitens (Thunberg, 1815)",33.771625,-116.449113,2023-01-01T12:33:32,12:33:32-08:00,20.0,PRESENT
1,4014953736,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Schistocerca nitens (Thunberg, 1815)",34.732208,-117.336281,2023-01-02T08:08:41,08:08:41-08:00,5347.0,PRESENT
2,4011801343,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Schistocerca nitens (Thunberg, 1815)",34.28539,-119.274894,2023-01-01T13:13:45,13:13:45-08:00,227.0,PRESENT
3,4014816815,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Schistocerca nitens (Thunberg, 1815)",32.874442,-117.225288,2023-01-03T15:50:15,15:50:15-08:00,21.0,PRESENT
4,4847044924,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Trimerotropis pallidipennis (Burmeister, 1838)",33.600468,-114.520485,2023-01-03T10:08,10:08:00-07:00,,PRESENT


In [76]:
birda_hawk_df = hawk_df.loc[hawk_df.datasetKey == "6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b"]
birda_hawk_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
4,4090593311,6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b,"Buteo swainsoni Bonaparte, 1838",38.794622,-121.370722,2023-01-17T22:49:54.309Z,,,PRESENT
256,4090559924,6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b,"Buteo swainsoni Bonaparte, 1838",38.034858,-121.733744,2023-03-25T00:03:40.209Z,,,PRESENT
19488,4091315614,6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b,"Buteo swainsoni Bonaparte, 1838",-121.296766,NORTH_AMERICA,274,,California,PRESENT


In [63]:
birda_acrid_df = acrididae_df.loc[acrididae_df.datasetKey == "6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b"]
birda_acrid_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus


In [73]:
ebird_hawk_df = hawk_df.loc[hawk_df.datasetKey == "4fa7b334-ce0d-4e88-aaae-2e0c138d049e"]
ebird_hawk_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
5,4690323559,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",38.188446,-121.97644,2023-01-01,,,PRESENT
6,4773821493,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",38.188446,-121.97644,2023-01-01,,,PRESENT
7,4676490068,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",38.146843,-121.981544,2023-01-01,,,PRESENT
8,4658782557,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",33.81128,-116.51996,2023-01-02,,,PRESENT
9,4757165580,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",38.38116,-121.54239,2023-01-01,,,PRESENT


In [65]:
ebird_acrid_df = acrididae_df.loc[acrididae_df.datasetKey == "4fa7b334-ce0d-4e88-aaae-2e0c138d049e"]
ebird_acrid_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus


In [66]:
observOrg_hawk_df = hawk_df.loc[hawk_df.datasetKey == "8a863029-f435-446a-821e-275f4f641165"]
observOrg_hawk_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
41052,4056060496,8a863029-f435-446a-821e-275f4f641165,"Buteo swainsoni Bonaparte, 1838",-121.280704,NORTH_AMERICA,153,Stichting Observation International,California (CA),PRESENT
45208,2840528531,8a863029-f435-446a-821e-275f4f641165,"Buteo swainsoni Bonaparte, 1838",-121.452389,NORTH_AMERICA,116,,California (CA),PRESENT
59698,2839984961,8a863029-f435-446a-821e-275f4f641165,"Buteo swainsoni Bonaparte, 1838",-121.705517,NORTH_AMERICA,174,,California (CA),PRESENT


In [67]:
observOrg_acrid_df = acrididae_df.loc[acrididae_df.datasetKey == "8a863029-f435-446a-821e-275f4f641165"]
observOrg_acrid_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus


In [74]:
xeno_hawk_df = hawk_df.loc[hawk_df.datasetKey == "b1047888-ae52-4179-9dd5-5448ea342a24"]
xeno_hawk_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
53719,2243763610,b1047888-ae52-4179-9dd5-5448ea342a24,"Buteo swainsoni Bonaparte, 1838",-121.5624,NORTH_AMERICA,126,https://data.biodiversitydata.nl/xeno-canto/ob...,,PRESENT
53720,2243830160,b1047888-ae52-4179-9dd5-5448ea342a24,"Buteo swainsoni Bonaparte, 1838",-121.637,NORTH_AMERICA,129,https://data.biodiversitydata.nl/xeno-canto/ob...,,PRESENT


In [77]:
xeno_acrid_df = acrididae_df.loc[acrididae_df.datasetKey == "b1047888-ae52-4179-9dd5-5448ea342a24"]
xeno_acrid_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus


CHECK LOWEST KEY VALUE AND LOWEST DATE VALUE TO MAKE ASSUMPTION ABOUT HOW THEY TRACK DAYS SINCE START OF STUDY.

Create a function that takes each dataframe containing dataset-specific records across both species, and does the following:
- sorts them by date and then by key
- if the date 

In [31]:
acrididae_csv_path = "GBIF/AcrididaeGBIF.csv"
#fetch_occurrences_to_csv(acrididae_gbif_key, california_wkt, 2013, 2023, acrididae_csv_path)
acrididae_df = pd.read_csv(acrididae_csv_path, on_bad_lines='skip', low_memory=False)

Likewise, let's also take a peak at the head of our Acrididae occurrence dataframe.

In [32]:
acrididae_df.head()

Unnamed: 0,key,datasetKey,publishingOrgKey,installationKey,hostingOrganizationKey,publishingCountry,protocol,lastCrawled,lastParsed,crawlId,...,gadm.level1.gid,gadm.level1.name,gadm.level2.gid,gadm.level2.name,lifeStage,occurrenceRemarks,infraspecificEpithet,informationWithheld,sex,identificationRemarks
0,4011493188,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T04:26:05.332+00:00,530,...,USA.5_1,California,USA.5.33_1,Riverside,,,,,,
1,4014953736,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T04:26:09.006+00:00,530,...,USA.5_1,California,USA.5.36_1,San Bernardino,,,,,,
2,4011801343,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T10:20:30.305+00:00,530,...,USA.5_1,California,USA.5.56_1,Ventura,,,,,,
3,4014816815,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T04:26:15.819+00:00,530,...,USA.5_1,California,USA.5.37_1,San Diego,,,,,,
4,4847044924,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,US,DWC_ARCHIVE,2025-04-09T17:43:33.531+00:00,2025-04-16T04:26:20.761+00:00,530,...,USA.3_1,Arizona,USA.3.7_1,La Paz,,,,,,


Now that we've collected both the Swainson Hawk and Grasshopper occurrences, let's observe some properties of the data and clean it up before appending both to a main table containing both of their occurrences.

First we'll take a look at their column properties, including datatypes and the number of null values in each.

In [33]:
hawk_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72250 entries, 0 to 72249
Columns: 106 entries, key to lifeStage
dtypes: int64(10), object(96)
memory usage: 58.4+ MB


In [34]:
acrididae_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9900 entries, 0 to 9899
Data columns (total 99 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   key                                                 9900 non-null   int64  
 1   datasetKey                                          9900 non-null   object 
 2   publishingOrgKey                                    9900 non-null   object 
 3   installationKey                                     9900 non-null   object 
 4   hostingOrganizationKey                              9900 non-null   object 
 5   publishingCountry                                   9900 non-null   object 
 6   protocol                                            9900 non-null   object 
 7   lastCrawled                                         9900 non-null   object 
 8   lastParsed                                          9900 non-null   object 
 9

There's lots of columns, and we probably don't need all of them. Let's take a look at the sum of null values for both dataframes without their being an output limit hopefully. Otherwise, we'll devise a function to remove the ones with many null values, or use our knowledge of the analysis to determine which will be most helpful. We'll be changing pandas options to see all the columns, since there are too many for it to include by default.

In [35]:
# set options for pandas to include full output
pd.set_option('display.max_rows', None)     # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)        # Don't break lines
pd.set_option('display.max_colwidth', None) # Don't truncate column values
hawk_df.isna().sum()

key                                                       0
datasetKey                                                0
publishingOrgKey                                          0
installationKey                                           0
hostingOrganizationKey                                    0
publishingCountry                                         0
protocol                                                  0
lastCrawled                                               0
lastParsed                                                0
crawlId                                                   0
projectId                                              2407
basisOfRecord                                          2385
occurrenceStatus                                          0
taxonKey                                                  0
kingdomKey                                                0
phylumKey                                                 0
classKey                                

In [36]:
acrididae_df.isna().sum()

key                                                      0
datasetKey                                               0
publishingOrgKey                                         0
installationKey                                          0
hostingOrganizationKey                                   0
publishingCountry                                        0
protocol                                                 0
lastCrawled                                              0
lastParsed                                               0
crawlId                                                  0
projectId                                             2542
basisOfRecord                                            0
occurrenceStatus                                       259
taxonKey                                              1100
kingdomKey                                               0
phylumKey                                                0
classKey                                                

In [37]:
# Reset options set previously
pd.reset_option('display.max_columns')
pd.reset_option('display.max_rows')
pd.reset_option('display.width')
pd.reset_option('display.max_colwidth')

Now that we know what columns exist and the number of null values, and after some consideration, we decided to keep the following columns that seem most relevant to our analysis, listed below in a Python list

In [38]:
columns_to_keep = [
    "key", "datasetKey","scientificName", "decimalLatitude", "decimalLongitude", "eventDate", "eventTime", "coordinateUncertaintyInMeters", "occurrenceStatus"
]

Let's now filter these dataframes to only include these columns, and reduce them further 

In [39]:
hawk_df = hawk_df.loc[:, columns_to_keep]
acrididae_df = acrididae_df.loc[:, columns_to_keep]

Now that we have two dataframes containing both occurrences of Swainson Hawks and Grasshoppers, we should combine both into a main dataset to later combine with any other tracking data we retrieve from Movebank

In [40]:
gbif_df = pd.concat([hawk_df, acrididae_df], ignore_index=True)
gbif_df.head()

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
0,4011915312,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.813019,-116.525763,2023-01-02T10:52,10:52:00-08:00,61.0,PRESENT
1,4028968442,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.995642,-118.096413,2023-01-26T09:00,09:00:00-08:00,483.0,PRESENT
2,4028827735,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.995499,-118.096327,2023-01-26T10:58,10:58:00-08:00,166.0,PRESENT
3,4103303596,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",38.570301,-121.730326,2023-01-24T15:02,15:02:00-08:00,4.0,PRESENT
4,4090593311,6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b,"Buteo swainsoni Bonaparte, 1838",38.794622,-121.370722,2023-01-17T22:49:54.309Z,,,PRESENT


In [51]:
gbif_df.head(900)

Unnamed: 0,key,datasetKey,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
0,4011915312,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.813019,-116.525763,2023-01-02T10:52,10:52:00-08:00,61.0,PRESENT
1,4028968442,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.995642,-118.096413,2023-01-26T09:00,09:00:00-08:00,483.0,PRESENT
2,4028827735,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",33.995499,-118.096327,2023-01-26T10:58,10:58:00-08:00,166.0,PRESENT
3,4103303596,50c9509d-22c7-4a22-a47d-8c48425ef4a7,"Buteo swainsoni Bonaparte, 1838",38.570301,-121.730326,2023-01-24T15:02,15:02:00-08:00,4.0,PRESENT
4,4090593311,6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b,"Buteo swainsoni Bonaparte, 1838",38.794622,-121.370722,2023-01-17T22:49:54.309Z,,,PRESENT
...,...,...,...,...,...,...,...,...,...
895,4667655572,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",-121.08899,NORTH_AMERICA,76,,California,PRESENT
896,4667655593,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",-120.626495,NORTH_AMERICA,75,,California,PRESENT
897,4699503557,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",-121.14873,NORTH_AMERICA,69,,California,PRESENT
898,4718088442,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,"Buteo swainsoni Bonaparte, 1838",-118.18041,NORTH_AMERICA,81,,California,PRESENT


Now let's explore properties of this main dataset, and change datatypes if necessary, or fill in column NaNs with values if possible. Essentially, let's clean this up so we can gather some insights from this dataset before moving onto Movebank.

In [24]:
gbif_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82150 entries, 0 to 82149
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   key                            82150 non-null  int64 
 1   scientificName                 82150 non-null  object
 2   decimalLatitude                82146 non-null  object
 3   decimalLongitude               81451 non-null  object
 4   eventDate                      82150 non-null  object
 5   eventTime                      11901 non-null  object
 6   coordinateUncertaintyInMeters  80310 non-null  object
 7   occurrenceStatus               81891 non-null  object
dtypes: int64(1), object(7)
memory usage: 5.0+ MB


Let's remove any rows that contain NaN eventDates, lat/lng pairs, and convert them into appropriate datatypes such as datetimes and floats. For coordinate uncertainty, we can fill those in with zeros and default the occurrenceStatus to PRESENT if any are NaN (we can change this if data implicates an animal death or something else)

In [25]:
gbif_df['decimalLatitude'] = pd.to_numeric(gbif_df['decimalLatitude'], errors='coerce')
gbif_df['decimalLongitude'] = pd.to_numeric(gbif_df['decimalLatitude'], errors='coerce')
gbif_df['coordinateUncertaintyInMeters'] = gbif_df['coordinateUncertaintyInMeters'].fillna(0)
gbif_df['occurrenceStatus'] = gbif_df['occurrenceStatus'].fillna("PRESENT")
#gbif_df['eventDate'] = pd.to_datetime(gbif_df['eventDate'], errors='coerce', format='mixed', utc=True)
gbif_df = gbif_df.dropna(subset=['decimalLatitude', 'decimalLongitude', 'eventDate'])
gbif_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 79150 entries, 0 to 82149
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   key                            79150 non-null  int64  
 1   scientificName                 79150 non-null  object 
 2   decimalLatitude                79150 non-null  float64
 3   decimalLongitude               79150 non-null  float64
 4   eventDate                      79150 non-null  object 
 5   eventTime                      11094 non-null  object 
 6   coordinateUncertaintyInMeters  79150 non-null  object 
 7   occurrenceStatus               79150 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 5.4+ MB


There appears to be discrepancies across the eventDate column, so what we'll do is use the gbif library to find the datasets being used, and then hopefully those datasets contain metadata on how they standardized their column values.

In [44]:
help(registry)

Help on package pygbif.registry in pygbif:

NAME
    pygbif.registry - GBIF registry APIs methods

DESCRIPTION
    * `organizations`: Organizations metadata
    * `nodes`: Nodes metadata
    * `networks`: Networks metadata
    * `installations`: Installations metadata
    * `datasets`: Search for datasets and dataset metadata
    * `dataset_metrics`: Get details/metrics on a GBIF dataset
    * `dataset_suggest`: Search that returns up to 20 matching datasets
    * `dataset_search`: Full text search across all datasets

PACKAGE CONTENTS
    datasets
    installations
    networks
    nodes
    organizations

FILE
    /opt/conda/lib/python3.10/site-packages/pygbif/registry/__init__.py




In [69]:
unique_keys = gbif_df['datasetKey'].dropna().unique()

metadata_rows = []

for i in gbif_df.datasetKey.unique():
    try:
        response = requests.get("https://api.gbif.org/v1/dataset/" + i)
        response.raise_for_status() 
        data = response.json()       
        metadata_rows.append(data)
        print(i)
    except Exception as e:
        metadata_rows.append({'datasetKey': key, 'error': str(e)})
    
gbif_datasets = pd.json_normalize(metadata_rows)
gbif_datasets

50c9509d-22c7-4a22-a47d-8c48425ef4a7
6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b
4fa7b334-ce0d-4e88-aaae-2e0c138d049e
8a863029-f435-446a-821e-275f4f641165
b1047888-ae52-4179-9dd5-5448ea342a24


Unnamed: 0,key,installationKey,publishingOrganizationKey,doi,external,numConstituents,type,title,description,language,...,samplingDescription.sampling,additionalInfo,project.title,project.contacts,project.funding,project.awards,project.studyAreaDescription,project.designDescription,project.relatedProjects,project.abstract
0,50c9509d-22c7-4a22-a47d-8c48425ef4a7,997448a8-f762-11e1-a439-00145eb45e9a,28eb1a3f-1c15-4a95-931a-4af90ecb574d,10.15468/ab3s5x,False,0,OCCURRENCE,iNaturalist Research-grade Observations,<p>\n Observations from iNaturalist.org...,eng,...,,,,,,,,,,
1,6ff8b3b0-ef0f-4f79-a310-5a5615c6aa0b,b07988d2-1b3d-4fdb-931a-2e852ddac2dd,0c6d40e3-5d96-4a2d-9342-b02833aaa766,10.15468/6kud7x,False,0,OCCURRENCE,Birda - Global Observation Dataset,Occurrences of Animalia Chordata Aves recorded...,eng,...,Field observation by users of the Birda mobile...,,,,,,,,,
2,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,7182d304-b0a2-404b-baba-2086a325c221,e2e717bf-551a-4917-bdc9-4fa0f342c530,10.15468/aomfnb,False,0,OCCURRENCE,EOD – eBird Observation Dataset,eBird is a collective enterprise that takes a ...,eng,...,,Data released annually. These data are made a...,,,,,,,,
3,8a863029-f435-446a-821e-275f4f641165,a0e05292-3d09-4eae-9f83-02ae3516283c,c8d737e0-2ff8-42e8-b8fc-6b805d26fc5f,10.15468/5nilie,False,0,OCCURRENCE,"Observation.org, Nature data from around the W...",<p>This dataset contains occurrence data of fl...,eng,...,,,Observation.org data 1900-,"[{'type': 'ADMINISTRATIVE_POINT_OF_CONTACT', '...",Naturalis Biodiversity Center,[],,,[],
4,b1047888-ae52-4179-9dd5-5448ea342a24,85ccfd1a-a837-48d6-9a87-96c99f6fe012,1f00d75c-f6fc-4224-a595-975e82d7689c,10.15468/qv0ksn,False,0,OCCURRENCE,Xeno-canto - Bird sounds from around the world,<p>This dataset covers the sounds of the Bird ...,eng,...,,,,"[{'type': 'ADMINISTRATIVE_POINT_OF_CONTACT', '...",,[],,,[],


In [27]:
gbif_df.head(900)

Unnamed: 0,key,scientificName,decimalLatitude,decimalLongitude,eventDate,eventTime,coordinateUncertaintyInMeters,occurrenceStatus
0,4011915312,"Buteo swainsoni Bonaparte, 1838",33.813019,33.813019,2023-01-02T10:52,10:52:00-08:00,61.0,PRESENT
1,4028968442,"Buteo swainsoni Bonaparte, 1838",33.995642,33.995642,2023-01-26T09:00,09:00:00-08:00,483.0,PRESENT
2,4028827735,"Buteo swainsoni Bonaparte, 1838",33.995499,33.995499,2023-01-26T10:58,10:58:00-08:00,166.0,PRESENT
3,4103303596,"Buteo swainsoni Bonaparte, 1838",38.570301,38.570301,2023-01-24T15:02,15:02:00-08:00,4.0,PRESENT
4,4090593311,"Buteo swainsoni Bonaparte, 1838",38.794622,38.794622,2023-01-17T22:49:54.309Z,,0,PRESENT
...,...,...,...,...,...,...,...,...
895,4667655572,"Buteo swainsoni Bonaparte, 1838",-121.088990,-121.088990,76,,California,PRESENT
896,4667655593,"Buteo swainsoni Bonaparte, 1838",-120.626495,-120.626495,75,,California,PRESENT
897,4699503557,"Buteo swainsoni Bonaparte, 1838",-121.148730,-121.148730,69,,California,PRESENT
898,4718088442,"Buteo swainsoni Bonaparte, 1838",-118.180410,-118.180410,81,,California,PRESENT


In [70]:
'''
gbif_df['eventDate'] = gbif_df['eventDate'].astype(str).str.slice(0, 10)

# Step 2: Convert the date strings to datetime.date
gbif_df['eventDate'] = pd.to_datetime(gbif_df['eventDate'], errors='coerce').dt.date

# Step 3: Convert eventTime to timedelta64[ns]
gbif_df['eventTime'] = pd.to_timedelta(gbif_df['eventTime'], errors='coerce')

# Step 4: Get rows where eventTime is missing
missing_time_mask = gbif_df['eventTime'].isna()
gbif_missing = gbif_df[missing_time_mask].copy().sort_values('eventID')

# Step 5: Create synthetic time intervals (e.g., every 5 minutes)
interval = timedelta(minutes=5)
synthetic_times = [i * interval for i in range(len(gbif_missing))]

# Step 6: Assign synthetic times back to original DataFrame
gbif_df.loc[missing_time_mask, 'eventTime'] = synthetic_times

# Step 7: Combine cleaned eventDate with eventTime to get full datetime
gbif_df['eventDateTime'] = pd.to_datetime(gbif_df['eventDate'].astype(str)) + gbif_df['eventTime']

# Optional: Overwrite eventDate with full datetime
gbif_df['eventDate'] = gbif_df['eventDateTime']
''';

Let's move onto collecting data from Movebank, where we'll retrieve data related to animal movements which was collected during  studies in the past. These studies were provided to Movebank to make publicly accessible. 

In [60]:
movebank_base_url = "https://www.movebank.org/movebank/service/"
mb_email = "aw399@humboldt.edu"
mb_username = "aw399"
mb_password = "MQG3xrBg8SWKmyR"

What attributes are there so we know how we can filter down the studies request before requesting the associated tracking data. 

In [61]:
attributes_endpoint = movebank_base_url + "direct-read?attributes"
mb_attr_response = requests.get(attributes_endpoint, auth=HTTPBasicAuth(mb_username, mb_password))
mb_attr_response.text

'Specify one of the following entity types: deployment, deployment, event, event_reduced, individual, individual, sensor, study, study_attribute, tag, tag, tag_type, taxon\nOptional parameter: header-format=underscore\nWhen connecting via https you must specify the parameters user=... and password=...\nSpecify output attributes and filter conditions depending on the entity type.\n\nentity-type=study\nOutput attributes: access_profile_download_id, access_profile_id, acknowledgements, citation, contact_person_id, contact_person_name, default_profile_eventdata_id, default_profile_refdata_id, external_id, external_id_namespace_id, go_public_date, grants_used, has_quota, i_am_collaborator, i_am_owner, i_can_see_data, i_have_download_access, id, is_test, license_terms, license_type, main_location_lat, main_location_long, name, number_of_deployed_locations, number_of_deployments, number_of_individuals, number_of_tags, principal_investigator_address, principal_investigator_email, principal_inv

Let's collect studies where I can see the data and have download access, authenticate with our username and password, converting it into a CSV and then a pandas dataframe, and filter it down by timestamps, main location long/lat, and whether it includes any information about California or its native animals

In [63]:
mb_studies_endpoint = movebank_base_url + "direct-read"

mb_params = {
    "entity_type": "study",
    "i_can_see_data": "true",
    "i_have_download_access": "true"
}

mb_studies_response = requests.get(
    mb_studies_endpoint,
    params=mb_params,
    auth=HTTPBasicAuth(mb_username, mb_password)
)

mb_df_studies = pd.read_csv(StringIO(mb_studies_response.text))

mb_df_studies["timestamp_first_deployed_location"] = pd.to_datetime(mb_df_studies["timestamp_first_deployed_location"], errors='coerce')
mb_df_studies["timestamp_last_deployed_location"] = pd.to_datetime(mb_df_studies["timestamp_last_deployed_location"], errors='coerce')

mb_start = pd.to_datetime("2013-01-01")
mb_end = pd.to_datetime("2023-12-31")

mb_df_filtered = mb_df_studies[
    (mb_df_studies["timestamp_last_deployed_location"] >= mb_start) &
    (mb_df_studies["timestamp_first_deployed_location"] <= mb_end)
]

mb_df_filtered = mb_df_filtered[
    (mb_df_filtered["main_location_lat"] >= 32.0) &
    (mb_df_filtered["main_location_lat"] <= 42.0) &
    (mb_df_filtered["main_location_long"] >= -125.0) &
    (mb_df_filtered["main_location_long"] <= -114.0)
]

mb_df_filtered = mb_df_filtered[mb_df_filtered["name"].str.contains("california|sierra|bay area|america|us|ca|CA|California", case=False, na=False)]

mb_df_filtered.head()

# Download to analyze and filter down the studies we'll be using and animals with data we can work with
# mb_df_filtered.to_csv("movebank_studies.csv")

Unnamed: 0,acknowledgements,citation,go_public_date,grants_used,has_quota,i_am_owner,id,is_test,license_terms,license_type,...,there_are_data_which_i_cannot_see,i_have_download_access,i_am_collaborator,study_permission,timestamp_first_deployed_location,timestamp_last_deployed_location,number_of_deployed_locations,taxon_ids,sensor_type_ids,contact_person_name
258,,"Huysman AE, CastaÃ±eda XA, Johnson MD. 2021. D...",,,True,False,1426277950,False,,CC_BY,...,False,True,False,na,2016-03-17 16:40:00,2018-07-18 13:09:00,34012.0,Tyto furcata,GPS,ahuysman (Allison Huysman)
384,Support for this study was provided by the S. ...,"BA Barbaree, ME Reiter, CM Hickey, GW Page. 2...",,Grant to the Migratory Bird Conservation Partn...,True,False,1419917362,False,,CC_BY,...,False,True,False,na,2013-02-08 00:00:00,2013-04-23 00:00:00,212.0,"Calidris alpina,Limnodromus scolopaceus",Radio Transmitter,auoptimo (Blake Barbaree)
455,,A more complete version of this dataset is sto...,,,True,False,217784323,False,,CC_BY,...,False,True,False,na,2003-11-05 15:00:00,2016-12-08 14:04:04,679402.0,"Cathartes aura,Coragyps atratus",GPS,dbarber (David Barber)
511,C. G. Putnam provided the template for Figure ...,"Barbaree BA, Reiter ME, Hickey CM, Page GW. 20...",,Support for this study was provided by the S. ...,True,False,1437646021,False,,CC_BY,...,False,True,False,na,2012-08-19 07:00:00,2014-03-18 07:00:00,475.0,Limnodromus scolopaceus,Radio Transmitter,auoptimo (Blake Barbaree)
618,,"Bildstein KL, Barber D, Bechard MJ, GraÃ±a Gri...",,,True,False,481458,False,,CC_BY,...,False,True,False,na,2003-11-05 15:00:00,2025-04-24 00:23:36,2222231.0,"Cathartes aura,Coragyps atratus","GPS,Accessory Measurements",dbarber (David Barber)


We'll be focusing on tracking data starting from 2013 onward. Since not all studies span the full decade, we’ll later trim the NOAA weather dataset to align with the actual periods of movement data. Based on our metadata review, the species we’ll analyze include Barn Owls, Vultures, Red-tailed Hawks, Bobcats, Grey Foxes, and Blue & Fin Whales.

To simplify the data retrieval process, we’ll use pre-downloaded CSVs from Movebank’s archive rather than building a separate function to fetch each study. Below is a reference dictionary with the common animal names as keys and their corresponding Movebank study IDs (where applicable). We’ll load these files into dataframes for further analysis.

The selected Movebank studies reflect a diverse range of ecological and conservation-oriented research across different species in California and surrounding regions. After reviewing the abstracts and descriptions for each of these studies, it's clear that researchers had various objectives — many sought to understand how wildlife movements and mortality are influenced by human development and environmental pressures. 

Some studies focused on habitat fragmentation and the effects of urban sprawl (e.g., Barn Owls and Bobcats), while others monitored the impact of wind energy infrastructure on species like vultures and hawks. The whale study provides marine movement data in response to oceanographic and acoustic conditions, and Swainson's Hawk tracking adds migratory insight, though it was retrieved from a separate archive. Across all these projects, researchers aim to inform conservation planning, identify critical habitat areas, and explore the broader ecological consequences of human activity.

Let's collect tracking data for Blue and Fin whales, remember all of these datasets being read are from download csv files retrieved from Movebank's data archives.

In [71]:
movebank_hawk_df = pd.read_csv("./MovebankData/Space use by Swainson's Hawk (Buteo swainsoni) in the Natomas Basin, California.csv")
movebank_hawk_df.head()
#swainsonHawk_df.shape

Unnamed: 0,event-id,visible,timestamp,location-long,location-lat,ground-speed,heading,height-above-ellipsoid,migration-stage,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name
0,1583635119,True,2011-06-25 16:00:00.000,-121.62333,38.69267,5.1444,149.0,80.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
1,1583635120,True,2011-06-25 17:00:00.000,-121.624,38.69433,5.1444,150.0,160.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
2,1583635121,True,2011-06-25 18:00:00.000,-121.625,38.69383,5.1444,226.0,190.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
3,1583635122,True,2011-06-25 19:00:00.000,-121.623,38.69117,5.65884,210.0,280.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
4,1583635123,True,2011-06-25 20:00:00.000,-121.6255,38.69267,8.74548,16.0,220.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...


We check the information about the dataframe, converting datatypes where needed

In [72]:
movebank_hawk_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18828 entries, 0 to 18827
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   event-id                         18828 non-null  int64  
 1   visible                          18828 non-null  bool   
 2   timestamp                        18828 non-null  object 
 3   location-long                    18828 non-null  float64
 4   location-lat                     18828 non-null  float64
 5   ground-speed                     14289 non-null  float64
 6   heading                          18828 non-null  float64
 7   height-above-ellipsoid           18827 non-null  float64
 8   migration-stage                  18828 non-null  object 
 9   sensor-type                      18828 non-null  object 
 10  individual-taxon-canonical-name  18828 non-null  object 
 11  tag-local-identifier             18828 non-null  object 
 12  individual-local-i

In [73]:
movebank_hawk_df['timestamp'] = pd.to_datetime(movebank_hawk_df['timestamp'])

Now we check for null values amongst the columns, we probably should just get rid of any rows that have NaN long or lat values since we want to map location later on.

In [74]:
movebank_hawk_df.isna().sum()

event-id                              0
visible                               0
timestamp                             0
location-long                         0
location-lat                          0
ground-speed                       4539
heading                               0
height-above-ellipsoid                1
migration-stage                       0
sensor-type                           0
individual-taxon-canonical-name       0
tag-local-identifier                  0
individual-local-identifier           0
study-name                            0
dtype: int64

In [76]:
movebank_hawk_df = movebank_hawk_df.dropna(subset=["location-lat", "location-long"])
movebank_hawk_df.head()

Unnamed: 0,event-id,visible,timestamp,location-long,location-lat,ground-speed,heading,height-above-ellipsoid,migration-stage,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name
0,1583635119,True,2011-06-25 16:00:00,-121.62333,38.69267,5.1444,149.0,80.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
1,1583635120,True,2011-06-25 17:00:00,-121.624,38.69433,5.1444,150.0,160.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
2,1583635121,True,2011-06-25 18:00:00,-121.625,38.69383,5.1444,226.0,190.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
3,1583635122,True,2011-06-25 19:00:00,-121.623,38.69117,5.65884,210.0,280.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
4,1583635123,True,2011-06-25 20:00:00,-121.6255,38.69267,8.74548,16.0,220.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...


We know the common columns amongst them all and we can make a new species column to track for which animal this is, keeping identifiers for different animals being tracked in their respective studies.

Let's start collecting NOAA data for CA daily weather summaries. We should first check what datasets are available to us in California. They're likely many stations collecting weather summaries periodically since the state is quite large.

In [15]:
noaa_access_token = "sFsHrTCdOCOitEmsjGbCBLgKSdEbKgCP"
noaa_headers = {"token": noaa_access_token}
noaa_base_url = "https://www.ncei.noaa.gov/cdo-web/api/v2/"
CA_FIPS = "FIPS:06"
dataset_endpoint = noaa_base_url + "datasets"
CA_datasets_response = requests.get(dataset_endpoint, params = {"locationid": CA_FIPS}, headers=noaa_headers)
CA_datasets_response.json()

{'metadata': {'resultset': {'offset': 1, 'count': 11, 'limit': 25}},
 'results': [{'uid': 'gov.noaa.ncdc:C00861',
   'mindate': '1763-01-01',
   'maxdate': '2025-04-21',
   'name': 'Daily Summaries',
   'datacoverage': 1,
   'id': 'GHCND'},
  {'uid': 'gov.noaa.ncdc:C00946',
   'mindate': '1763-01-01',
   'maxdate': '2025-04-01',
   'name': 'Global Summary of the Month',
   'datacoverage': 1,
   'id': 'GSOM'},
  {'uid': 'gov.noaa.ncdc:C00947',
   'mindate': '1763-01-01',
   'maxdate': '2025-01-01',
   'name': 'Global Summary of the Year',
   'datacoverage': 1,
   'id': 'GSOY'},
  {'uid': 'gov.noaa.ncdc:C00345',
   'mindate': '1991-06-05',
   'maxdate': '2025-04-22',
   'name': 'Weather Radar (Level II)',
   'datacoverage': 0.95,
   'id': 'NEXRAD2'},
  {'uid': 'gov.noaa.ncdc:C00708',
   'mindate': '1994-05-20',
   'maxdate': '2025-04-18',
   'name': 'Weather Radar (Level III)',
   'datacoverage': 0.95,
   'id': 'NEXRAD3'},
  {'uid': 'gov.noaa.ncdc:C00821',
   'mindate': '2010-01-01',
   

Looks like we want to use the GHCND id for our future requests so we can pull CA data with DAILY weather summaries. It looks like data is available as early as 1763 and is still being updated as of April of 2025. We should check which data categories are available for daily weather so that we can take the readings that will likely matter more for our study. Also, we don't have unlimited time so it's good to start with a subset of presumingly relevant variables.

In [4]:
dataset_code = "GHCND"
datacategories_endpoint = noaa_base_url + "datacategories"
CA_datacat_response = requests.get(datacategories_endpoint, params = {"locationid": CA_FIPS, "datasetid": dataset_code, "limit": "1000"}, headers=noaa_headers).json()
CA_datacat_ids = [i['id'] for i in CA_datacat_response['results']]
CA_datacat_response
CA_datacat_ids

['EVAP', 'LAND', 'PRCP', 'SKY', 'SUN', 'TEMP', 'WATER', 'WIND', 'WXTYPE']

Now we know what data categories are there, we might just want all of them since we're searching for correlations or associations between weather and ecological patterns. Let's look at the datatypes associated with the list of data categories. After doing so, I created a list containing the most seemingly useful ones and gathered a list of tuples with their datatype codes and the description associated with each.

In [5]:
datatype_endpoint = noaa_base_url + "datatypes"
CA_datatype_response = requests.get(datatype_endpoint, params = {"locationid": CA_FIPS, "datasetid": dataset_code, "limit": "200"}, headers=noaa_headers).json()
# filter list of ids based on importance
CA_datatype_ids = [
    "TMIN",  # Min temperature
    "TMAX",  # Max temperature
    "AWND",  # Avg wind speed
    "TSUN",  # Total sunshine
    "WT08",  # Hail
    "TAVG",  # Avg temperature
    "PRCP",  # Precipitation
    "SNOW",  # Snowfall
    "SNWD",  # Snow depth
    ]
CA_datatype_info = [(i['id'], i['name']) for i in CA_datatype_response['results'] if i['id'] in CA_datatype_ids]
CA_datatype_info

[('AWND', 'Average wind speed'),
 ('PRCP', 'Precipitation'),
 ('SNOW', 'Snowfall'),
 ('SNWD', 'Snow depth'),
 ('TAVG', 'Average Temperature.'),
 ('TMAX', 'Maximum temperature'),
 ('TMIN', 'Minimum temperature'),
 ('TSUN', 'Total sunshine for the period'),
 ('WT08', 'Smoke or haze ')]

Now we know the datatypes available across our data categories for the dataset we found, time to grab some data across California, but first let's look at the location categories in case we need it later when searching for location-specific data or something to that effect.

In [6]:
locationcat_endpoint = noaa_base_url + "locationcategories"
CA_loccat_response = requests.get(locationcat_endpoint, params = {"locationid": CA_FIPS, "datasetid": dataset_code}, headers=noaa_headers).json()
CA_loccat_response

{'metadata': {'resultset': {'offset': 1, 'count': 12, 'limit': 25}},
 'results': [{'name': 'City', 'id': 'CITY'},
  {'name': 'Climate Division', 'id': 'CLIM_DIV'},
  {'name': 'Climate Region', 'id': 'CLIM_REG'},
  {'name': 'Country', 'id': 'CNTRY'},
  {'name': 'County', 'id': 'CNTY'},
  {'name': 'Hydrologic Accounting Unit', 'id': 'HYD_ACC'},
  {'name': 'Hydrologic Cataloging Unit', 'id': 'HYD_CAT'},
  {'name': 'Hydrologic Region', 'id': 'HYD_REG'},
  {'name': 'Hydrologic Subregion', 'id': 'HYD_SUB'},
  {'name': 'State', 'id': 'ST'},
  {'name': 'US Territory', 'id': 'US_TERR'},
  {'name': 'Zip Code', 'id': 'ZIP'}]}

I also want to know which stations are available in CA, since we might want more information later to map out the stations and how they correlate with tracking data we receive from Movebank later on. This will important considering the stations have distance between them, and so having a general location is helpful for mapping purposes.

In [7]:
station_endpoint = noaa_base_url + "stations"
CA_station_response = requests.get(station_endpoint, params={"locationid": CA_FIPS, "limit": "1000", "startdate": "2013-01-01", "enddate": "2023-12-31"}, headers=noaa_headers).json()
CA_station_response['results'][:2] # A few station records just so we can get an idea

[{'elevation': 26.5,
  'mindate': '1994-01-01',
  'maxdate': '2015-11-01',
  'latitude': 38.2177,
  'name': 'ACAMPO 5 NE, CA US',
  'datacoverage': 0.9469,
  'id': 'COOP:040010',
  'elevationUnit': 'METERS',
  'longitude': -121.2013},
 {'elevation': 863.5,
  'mindate': '1931-01-01',
  'maxdate': '2013-12-01',
  'latitude': 34.4938,
  'name': 'ACTON ESCONDIDO CANYON, CA US',
  'datacoverage': 0.8986,
  'id': 'COOP:040014',
  'elevationUnit': 'METERS',
  'longitude': -118.2713}]

Time to grab the meteorological or climatological data from NOAA across CA. I'll only be grabbing a decade's worth of data from 2013-2023, and we'll be searching for animal tracking records later that also fit within the range of this time frame.

We can only grab roughly a year's worth after requesting 10 years from API, as a result, we need to iteratively grab a response for each year and then extending our array with that json response result. 

In [None]:
data_endpoint = noaa_base_url + "data"

all_results = []

for year in range(2013, 2024): 
    startdate = f"{year}-01-01"
    enddate = f"{year}-12-31"
    
    params = {
        "startdate": startdate,
        "enddate": enddate,
        "datasetid": dataset_code, 
        "locationid": CA_FIPS,
        "datatypeid": CA_datatype_ids,
        "units":"standard", 
        "limit": "1000"
    }

    response = requests.get(data_endpoint, params=params, headers=noaa_headers)
    
    if response.status_code == 200:
        year_data = response.json().get("results", [])
        all_results.extend(year_data)
        print(f"Got data for {year}: {len(year_data)} records")
        sleep(3)
    else:
        print(f"Failed for {year} – {response.status_code}: {response.text}")

In [10]:
data_endpoint = noaa_base_url + "data"
all_results = []

for year in range(2013, 2024): 
    startdate = f"{year}-01-01"
    enddate = f"{year}-12-31"
    offset = 0
    year_data = []

    print(f"Starting retrieval for {year}...")

    while True:
        params = {
            "startdate": startdate,
            "enddate": enddate,
            "datasetid": dataset_code, 
            "locationid": CA_FIPS,
            "datatypeid": CA_datatype_ids,
            "units": "standard", 
            "limit": "1000",
            "offset": offset
        }

        response = requests.get(data_endpoint, params=params, headers=noaa_headers)

        if response.status_code == 200:
            page = response.json().get("results", [])
            print(f"Year {year}, Offset {offset}: Retrieved {len(page)} records")
            
            if not page:
                break

            year_data.extend(page)
            offset += 1000
            sleep(10)  # Respectful pause between calls
        else:
            print(f"Failed for {year} at offset {offset} – {response.status_code}: {response.text}")
            print(f"Moving on to next offset {offset+1000}")

    all_results.extend(year_data)
    print(f"Completed {year}: Total {len(year_data)} records\n")


Starting retrieval for 2013...
Year 2013, Offset 0: Retrieved 1000 records
Year 2013, Offset 1000: Retrieved 1000 records
Year 2013, Offset 2000: Retrieved 1000 records
Year 2013, Offset 3000: Retrieved 1000 records
Year 2013, Offset 4000: Retrieved 1000 records
Year 2013, Offset 5000: Retrieved 1000 records
Year 2013, Offset 6000: Retrieved 1000 records
Year 2013, Offset 7000: Retrieved 1000 records
Year 2013, Offset 8000: Retrieved 1000 records
Year 2013, Offset 9000: Retrieved 1000 records
Year 2013, Offset 10000: Retrieved 1000 records
Year 2013, Offset 11000: Retrieved 1000 records
Year 2013, Offset 12000: Retrieved 1000 records
Year 2013, Offset 13000: Retrieved 1000 records
Year 2013, Offset 14000: Retrieved 1000 records
Year 2013, Offset 15000: Retrieved 1000 records
Year 2013, Offset 16000: Retrieved 1000 records
Year 2013, Offset 17000: Retrieved 1000 records
Year 2013, Offset 18000: Retrieved 1000 records
Year 2013, Offset 19000: Retrieved 1000 records
Year 2013, Offset 2000

Year 2013, Offset 118000: Retrieved 1000 records
Year 2013, Offset 119000: Retrieved 1000 records
Year 2013, Offset 120000: Retrieved 1000 records
Year 2013, Offset 121000: Retrieved 1000 records
Year 2013, Offset 122000: Retrieved 1000 records
Year 2013, Offset 123000: Retrieved 1000 records
Year 2013, Offset 124000: Retrieved 1000 records
Failed for 2013 at offset 125000 – 503: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
</body></html>

Moving on to next offset 126000
Year 2013, Offset 125000: Retrieved 1000 records
Year 2013, Offset 126000: Retrieved 1000 records
Year 2013, Offset 127000: Retrieved 1000 records
Year 2013

KeyboardInterrupt: 

We've collected daily weather summary data for CA in standard units, for the provided datatype ids, between 2013 and 2023. We can now use pandas to transform this into a dataframe, where change datatypes, columns, and remove rows of data that have null values.

In [None]:
CA_NOAA_df = pd.json_normalize(all_results)
CA_NOAA_df.head()

To summarize, the data comes from [NOAA's NCEI API](https://www.ncdc.noaa.gov/cdo-web/webservices/v2) which provides archives for weather data across the United States. This data was collected for CA across 2013 to 2023, and the dataset comes with five variables: date, datatype, station, attributes, and values associated with each datatype. Attributes are divided by a code that's divided by a measurement flag, quality flag, source flag, and time of observation. I won't detail them here but you can find more information [here](https://docs.ropensci.org/rnoaa/articles/ncdc_attributes.html)

In [None]:
CA_NOAA_df.shape
CA_NOAA_df.isna().sum()
CA_NOAA_df.info()
CA_NOAA_df['date'] = pd.to_datetime(CA_NOAA_df['date'])
CA_NOAA_df['value'] = CA_NOAA_df['value'].astype(int)
CA_NOAA_df.value_counts("datatype") # Appears that there isn't some of the variables available that we requested such as average wind speed, sunshine, smoke/haze, average temperature, and hail
CA_NOAA_df.head()

## Gather Statistics

## Analyze Statistics 

## Use Above to Answer Questions using Inferential Statistics and Prediction

## Answer Questions and Conclude Findings

## References and Citations