# DATA271 Final Project - Weather and Animal Migration Patterns in California

---

## Research Details

---

### Introducing the Problem
This project performs a statistical investigative process to explore and analyze animal migration data in conjunction with meteorological datasets to identify environmental factors that influence the movement patterns of species across California. We aim to understand how temperature, precipitation, and other atmospheric conditions affect when and where animals migrate, and whether these patterns are shifting in response to broader climate variability. 

By pairing ecological tracking data from Movebank coupled with ecological data from CDFW, and detailed weather reports from NOAA, this project investigates correlations between environmental change and behavioral shifts in migratory species. The goal is to use spatial and temporal analysis to uncover trends and stressors that can inform environmental monitoring, conservation planning, and scientific understanding of wildlife ecology.

---

### Addressing the Problem
The approach will involve joining datasets on both geographic coordinates and dates to conduct spatiotemporal analysis. We’ll investigate time series trends of both weather conditions and migration activity to determine if patterns emerge—such as species arriving earlier due to warming winters, or retreating from drought-affected areas.

Additionally, we'll explore the possibility of species migrations being driven not just by weather, but also by ecological relationships such as predator-prey dynamics. By cross-referencing species presence and timing, we can identify potential cases where weather-driven migration may be influenced—or confounded—by avoidance of other species.

This work contributes toward a better understanding of how climate variability and atmospheric anomalies impact wildlife ecosystems, particularly within a climate-sensitive and biodiversity-rich region like California.

---

### Analysis Breakdown
We ask the following questions before conducting our official exploratory data analysis:

- What weather patterns or atmospheric variables (e.g., temperature, precipitation, wind patterns) correlate most strongly with migration timing or intensity for species in California?
- Are there identifiable climate thresholds or seasonal shifts that serve as predictors for migratory events?
- Can we spatially and temporally map migration patterns alongside climate variables to detect meaningful trends or changes over time?
- How might ongoing climate variability affect the predictability and consistency of migration behaviors in the near future?
- Are any migratory shifts better explained by predator-prey relationships than by weather factors, and can predator presence be used as a confounding control in determining causality?

Our analysis will be broken down into the following stages:

1. **Explore Individual Datasets**  
   Clean and summarize each dataset—checking for null values, date and location alignment, and variable consistency. Generate summary statistics and visualizations to understand baseline structure and behavior.

2. **Analyze Combined Datasets**  
   Merge animal movement and climate datasets on time and location. Create layered time series visualizations and heat maps to understand migratory behavior in relation to environmental variables.

3. **Evaluate Significance and Observational Limitations**  
   Evaluate both the correlation strength and ecological plausibility of observed patterns. Recognize the observational nature of the data and avoid overextending conclusions where causality cannot be proven.

4. **Answer Research Questions**  
   Revisit initial questions in light of findings. Highlight where results support or refute assumptions about how atmospheric conditions drive migration, and discuss the role of other ecological pressures.

5. **Recommendations & Further Exploration**  
   Suggest data-driven implications for conservation or climate adaptation strategies. Propose new directions for data collection (e.g., more granular predator presence data), and identify gaps or limitations in the current analysis.

---

### Datasets
1. [NOAA National Centers for Environmental Information](https://www.ncei.noaa.gov/) – historical and real-time climate data  
2. [California Department Fish and Wildlife](https://data-cdfw.opendata.arcgis.com/search?q=barn%20owl&tags=species) – public CDFW spatial data, and discover related applications to discover species-specific details  
3. [Movebank](https://www.movebank.org/) – open-source animal movement data across species  

---

### Libraries & Modules
- **Pandas:** For time series wrangling and merging datasets  
- **Numpy:** Core scientific computing and vectorized calculations  
- **Matplotlib & Seaborn:** Visualizations for temporal and distributional trends  
- **Plotly:** Interactive visualizations and geographic mapping  
- **Geopandas:** Spatial joins and mapping of migration paths and weather zones  

---

### Project Resources
- [GitHub Repository](https://github.com/toritotony/Data271FinalProject)


## Collecting Data

In [1]:
!pip install plotnine geopandas

Collecting plotnine
  Using cached plotnine-0.14.5-py3-none-any.whl.metadata (9.3 kB)
Collecting geopandas
  Using cached geopandas-1.0.1-py3-none-any.whl.metadata (2.2 kB)
Collecting matplotlib>=3.8.0 (from plotnine)
  Using cached matplotlib-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting pandas>=2.2.0 (from plotnine)
  Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting mizani~=0.13.0 (from plotnine)
  Using cached mizani-0.13.2-py3-none-any.whl.metadata (4.8 kB)
Collecting numpy>=1.23.5 (from plotnine)
  Using cached numpy-2.2.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting pyogrio>=0.7.2 (from geopandas)
  Using cached pyogrio-0.10.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (5.5 kB)
Collecting pyproj>=3.3.0 (from geopandas)
  Using cached pyproj-3.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (31 kB)
Col

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import numpy as np
from plotnine import *
import geopandas
import plotly
import requests
from time import *
from requests.auth import HTTPBasicAuth
from io import StringIO

### Let's start collecting NOAA data for CA daily weather summaries. We should first check what datasets are available

In [3]:
noaa_access_token = "sFsHrTCdOCOitEmsjGbCBLgKSdEbKgCP"
noaa_headers = {"token": noaa_access_token}
noaa_base_url = "https://www.ncei.noaa.gov/cdo-web/api/v2/"
CA_FIPS = "FIPS:06"
dataset_endpoint = noaa_base_url + "datasets"
CA_datasets_response = requests.get(dataset_endpoint, params = {"locationid": CA_FIPS}, headers=noaa_headers)
CA_datasets_response.json()

{'metadata': {'resultset': {'offset': 1, 'count': 11, 'limit': 25}},
 'results': [{'uid': 'gov.noaa.ncdc:C00861',
   'mindate': '1763-01-01',
   'maxdate': '2025-04-05',
   'name': 'Daily Summaries',
   'datacoverage': 1,
   'id': 'GHCND'},
  {'uid': 'gov.noaa.ncdc:C00946',
   'mindate': '1763-01-01',
   'maxdate': '2025-03-01',
   'name': 'Global Summary of the Month',
   'datacoverage': 1,
   'id': 'GSOM'},
  {'uid': 'gov.noaa.ncdc:C00947',
   'mindate': '1763-01-01',
   'maxdate': '2025-01-01',
   'name': 'Global Summary of the Year',
   'datacoverage': 1,
   'id': 'GSOY'},
  {'uid': 'gov.noaa.ncdc:C00345',
   'mindate': '1991-06-05',
   'maxdate': '2025-04-06',
   'name': 'Weather Radar (Level II)',
   'datacoverage': 0.95,
   'id': 'NEXRAD2'},
  {'uid': 'gov.noaa.ncdc:C00708',
   'mindate': '1994-05-20',
   'maxdate': '2025-04-02',
   'name': 'Weather Radar (Level III)',
   'datacoverage': 0.95,
   'id': 'NEXRAD3'},
  {'uid': 'gov.noaa.ncdc:C00821',
   'mindate': '2010-01-01',
   

### Looks like we want to use the GHCND id for our future requests so we can pull CA data with daily weather summaries. We should check which data categories are available for daily weather.

In [4]:
dataset_code = "GHCND"
datacategories_endpoint = noaa_base_url + "datacategories"
CA_datacat_response = requests.get(datacategories_endpoint, params = {"locationid": CA_FIPS, "datasetid": dataset_code, "limit": "1000"}, headers=noaa_headers).json()
CA_datacat_ids = [i['id'] for i in CA_datacat_response['results']]
CA_datacat_response
CA_datacat_ids

['EVAP', 'LAND', 'PRCP', 'SKY', 'SUN', 'TEMP', 'WATER', 'WIND', 'WXTYPE']

### Now we know what data categories are there, we might just want all of them since we're searching for correlations or associations between weather and ecological patterns. Let's look at the datatypes

In [5]:
datatype_endpoint = noaa_base_url + "datatypes"
CA_datatype_response = requests.get(datatype_endpoint, params = {"locationid": CA_FIPS, "datasetid": dataset_code, "limit": "200"}, headers=noaa_headers).json()
# filter list of ids based on importance
CA_datatype_ids = [
    "TMIN",  # Min temperature
    "TMAX",  # Max temperature
    "AWND",  # Avg wind speed
    "TSUN",  # Total sunshine
    "WT08",  # Hail
    "TAVG",  # Avg temperature
    "PRCP",  # Precipitation
    "SNOW",  # Snowfall
    "SNWD",  # Snow depth
    ]
CA_datatype_info = [(i['id'], i['name']) for i in CA_datatype_response['results'] if i['id'] in CA_datatype_ids]
CA_datatype_info

[('AWND', 'Average wind speed'),
 ('PRCP', 'Precipitation'),
 ('SNOW', 'Snowfall'),
 ('SNWD', 'Snow depth'),
 ('TAVG', 'Average Temperature.'),
 ('TMAX', 'Maximum temperature'),
 ('TMIN', 'Minimum temperature'),
 ('TSUN', 'Total sunshine for the period'),
 ('WT08', 'Smoke or haze ')]

### Now we know the datatypes available across our data categories for the dataset we found, time to grab some data across CA, but first let's look at the location categories in case we need it later when looking through the different readings across stations

In [6]:
locationcat_endpoint = noaa_base_url + "locationcategories"
CA_loccat_response = requests.get(locationcat_endpoint, params = {"locationid": CA_FIPS, "datasetid": dataset_code}, headers=noaa_headers).json()
CA_loccat_response

{'metadata': {'resultset': {'offset': 1, 'count': 12, 'limit': 25}},
 'results': [{'name': 'City', 'id': 'CITY'},
  {'name': 'Climate Division', 'id': 'CLIM_DIV'},
  {'name': 'Climate Region', 'id': 'CLIM_REG'},
  {'name': 'Country', 'id': 'CNTRY'},
  {'name': 'County', 'id': 'CNTY'},
  {'name': 'Hydrologic Accounting Unit', 'id': 'HYD_ACC'},
  {'name': 'Hydrologic Cataloging Unit', 'id': 'HYD_CAT'},
  {'name': 'Hydrologic Region', 'id': 'HYD_REG'},
  {'name': 'Hydrologic Subregion', 'id': 'HYD_SUB'},
  {'name': 'State', 'id': 'ST'},
  {'name': 'US Territory', 'id': 'US_TERR'},
  {'name': 'Zip Code', 'id': 'ZIP'}]}

### I also want to know which stations are available in CA, since we might want more information later to map out the stations and how they correlate with tracking data we receive from Movebank later on

In [7]:
station_endpoint = noaa_base_url + "stations"
CA_station_response = requests.get(station_endpoint, params={"locationid": CA_FIPS, "limit": "1000", "startdate": "2013-01-01", "enddate": "2023-12-31"}, headers=noaa_headers).json()
CA_station_response

{'metadata': {'resultset': {'offset': 1, 'count': 2662, 'limit': 1000}},
 'results': [{'elevation': 26.5,
   'mindate': '1994-01-01',
   'maxdate': '2015-11-01',
   'latitude': 38.2177,
   'name': 'ACAMPO 5 NE, CA US',
   'datacoverage': 0.9469,
   'id': 'COOP:040010',
   'elevationUnit': 'METERS',
   'longitude': -121.2013},
  {'elevation': 863.5,
   'mindate': '1931-01-01',
   'maxdate': '2013-12-01',
   'latitude': 34.4938,
   'name': 'ACTON ESCONDIDO CANYON, CA US',
   'datacoverage': 0.8986,
   'id': 'COOP:040014',
   'elevationUnit': 'METERS',
   'longitude': -118.2713},
  {'elevation': 1280.8,
   'mindate': '1943-11-01',
   'maxdate': '2015-11-01',
   'latitude': 41.19334,
   'name': 'ADIN RANGER STATION, CA US',
   'datacoverage': 0.9931,
   'id': 'COOP:040029',
   'elevationUnit': 'METERS',
   'longitude': -120.94458},
  {'elevation': 516.6,
   'mindate': '1952-10-01',
   'maxdate': '2015-11-01',
   'latitude': 32.8358,
   'name': 'ALPINE, CA US',
   'datacoverage': 0.9644,
  

### Time to grab the meteorological or climatological data from NOAA across CA. I'll only be grabbing a decade's worth of data from 2013-2023

### We can only grab a year's worth after requesting 10 years from API, as a result, we need to iteratively grab a response for each year and then extending our array with that json response result. 

In [8]:
data_endpoint = noaa_base_url + "data"

all_results = []

for year in range(2013, 2024): 
    startdate = f"{year}-01-01"
    enddate = f"{year}-12-31"
    
    params = {
        "startdate": startdate,
        "enddate": enddate,
        "datasetid": dataset_code, 
        "locationid": CA_FIPS,
        "datatypeid": CA_datatype_ids,
        "units":"standard", 
        "limit": "1000"
    }

    response = requests.get(data_endpoint, params=params, headers=noaa_headers)
    
    if response.status_code == 200:
        year_data = response.json().get("results", [])
        all_results.extend(year_data)
        print(f"Got data for {year}: {len(year_data)} records")
        sleep(3)
    else:
        print(f"Failed for {year} – {response.status_code}: {response.text}")

Got data for 2013: 1000 records
Got data for 2014: 1000 records
Got data for 2015: 1000 records
Got data for 2016: 1000 records
Got data for 2017: 1000 records
Got data for 2018: 1000 records
Got data for 2019: 1000 records
Got data for 2020: 1000 records
Got data for 2021: 1000 records
Got data for 2022: 1000 records
Got data for 2023: 1000 records


### We've collected daily weather summary data for CA in srandard units, for the provided datatype ids, between 2013 and 2023. We can now use pandas to transform this into a dataframe, where change datatypes, columns, and remove rows of data that have null values.

In [10]:
CA_NOAA_df = pd.json_normalize(all_results)
CA_NOAA_df.head()

Unnamed: 0,date,datatype,station,attributes,value
0,2013-01-01T00:00:00,PRCP,GHCND:US1CAAL0001,",,N,0700",0.0
1,2013-01-01T00:00:00,SNOW,GHCND:US1CAAL0001,",,N,0700",0.0
2,2013-01-01T00:00:00,PRCP,GHCND:US1CAAL0003,",,N,0800",0.0
3,2013-01-01T00:00:00,SNOW,GHCND:US1CAAL0003,",,N,0800",0.0
4,2013-01-01T00:00:00,PRCP,GHCND:US1CAAL0004,",,N,0700",0.0


### To summarize, the data comes from [NOAA's NCEI API](https://www.ncdc.noaa.gov/cdo-web/webservices/v2) which provides archives for weather data across the United States. This data was collected for CA across 2013 to 2023, and the dataset comes with five variables: date, datatype, station, attributes, and values associated with each datatype. Attributes are divided by a code that's divided by a measurement flag, quality flag, source flag, and time of observation. I won't detail them here but you can find more information [here](https://docs.ropensci.org/rnoaa/articles/ncdc_attributes.html)

In [11]:
CA_NOAA_df.shape

(11000, 5)

In [12]:
CA_NOAA_df.isna().sum()

date          0
datatype      0
station       0
attributes    0
value         0
dtype: int64

In [13]:
CA_NOAA_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11000 entries, 0 to 10999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        11000 non-null  object 
 1   datatype    11000 non-null  object 
 2   station     11000 non-null  object 
 3   attributes  11000 non-null  object 
 4   value       11000 non-null  float64
dtypes: float64(1), object(4)
memory usage: 429.8+ KB


In [14]:
CA_NOAA_df['date'] = pd.to_datetime(CA_NOAA_df['date'])
CA_NOAA_df['value'] = CA_NOAA_df['value'].astype(int)
CA_NOAA_df.head()

Unnamed: 0,date,datatype,station,attributes,value
0,2013-01-01,PRCP,GHCND:US1CAAL0001,",,N,0700",0
1,2013-01-01,SNOW,GHCND:US1CAAL0001,",,N,0700",0
2,2013-01-01,PRCP,GHCND:US1CAAL0003,",,N,0800",0
3,2013-01-01,SNOW,GHCND:US1CAAL0003,",,N,0800",0
4,2013-01-01,PRCP,GHCND:US1CAAL0004,",,N,0700",0


In [15]:
CA_NOAA_df.value_counts("datatype") # Appears that there isn't some of the variables available that we requested such as average wind speed, sunshine, smoke/haze, average temperature, and hail 

datatype
PRCP    5794
SNOW    3728
TMAX     504
TMIN     501
SNWD     473
Name: count, dtype: int64

### Let's move onto collecting data from Movebank, where we'll retrieve data related to animal movements 

In [16]:
movebank_base_url = "https://www.movebank.org/movebank/service/"
mb_email = "aw399@humboldt.edu"
mb_username = "aw399"
mb_password = "MQG3xrBg8SWKmyR"

### What attributes are there so we know how we can filter down the studies request before requesting the associated tracking data. 

In [17]:
attributes_endpoint = movebank_base_url + "direct-read?attributes"
mb_attr_response = requests.get(attributes_endpoint, auth=HTTPBasicAuth(mb_username, mb_password))
mb_attr_response.text

'Specify one of the following entity types: deployment, deployment, event, event_reduced, individual, individual, sensor, study, study_attribute, tag, tag, tag_type, taxon\nOptional parameter: header-format=underscore\nWhen connecting via https you must specify the parameters user=... and password=...\nSpecify output attributes and filter conditions depending on the entity type.\n\nentity-type=study\nOutput attributes: access_profile_download_id, access_profile_id, acknowledgements, citation, contact_person_id, contact_person_name, default_profile_eventdata_id, default_profile_refdata_id, external_id, external_id_namespace_id, go_public_date, grants_used, has_quota, i_am_collaborator, i_am_owner, i_can_see_data, i_have_download_access, id, is_test, license_terms, license_type, main_location_lat, main_location_long, name, number_of_deployed_locations, number_of_deployments, number_of_individuals, number_of_tags, principal_investigator_address, principal_investigator_email, principal_inv

### Let's collect studies where I can see the data and have download access, authenticate with our username and password, converting it into a CSV and then a pandas dataframe, and filter it down by timestamps, main location long/lat, and whether it includes any information about California or its native animals

In [18]:
from datetime import datetime
from requests.auth import HTTPBasicAuth

mb_studies_endpoint = movebank_base_url + "direct-read"

mb_params = {
    "entity_type": "study",
    "i_can_see_data": "true",
    "i_have_download_access": "true"
}

mb_studies_response = requests.get(
    mb_studies_endpoint,
    params=mb_params,
    auth=HTTPBasicAuth(mb_username, mb_password)
)

mb_df_studies = pd.read_csv(StringIO(mb_studies_response.text))

mb_df_studies["timestamp_first_deployed_location"] = pd.to_datetime(mb_df_studies["timestamp_first_deployed_location"], errors='coerce')
mb_df_studies["timestamp_last_deployed_location"] = pd.to_datetime(mb_df_studies["timestamp_last_deployed_location"], errors='coerce')

mb_start = pd.to_datetime("2013-01-01")
mb_end = pd.to_datetime("2023-12-31")

mb_df_filtered = mb_df_studies[
    (mb_df_studies["timestamp_last_deployed_location"] >= mb_start) &
    (mb_df_studies["timestamp_first_deployed_location"] <= mb_end)
]

mb_df_filtered = mb_df_filtered[
    (mb_df_filtered["main_location_lat"] >= 32.0) &
    (mb_df_filtered["main_location_lat"] <= 42.0) &
    (mb_df_filtered["main_location_long"] >= -125.0) &
    (mb_df_filtered["main_location_long"] <= -114.0)
]

mb_df_filtered = mb_df_filtered[mb_df_filtered["name"].str.contains("california|sierra|bay area|america|us|ca|CA|California", case=False, na=False)]

mb_df_filtered.head(20)

# Download to analyze and filter down the studies we'll be using and animals with data we can work with
# mb_df_filtered.to_csv("movebank_studies.csv")

Unnamed: 0,acknowledgements,citation,go_public_date,grants_used,has_quota,i_am_owner,id,is_test,license_terms,license_type,...,there_are_data_which_i_cannot_see,i_have_download_access,i_am_collaborator,study_permission,timestamp_first_deployed_location,timestamp_last_deployed_location,number_of_deployed_locations,taxon_ids,sensor_type_ids,contact_person_name
147,,"Huysman AE, CastaÃ±eda XA, Johnson MD. 2021. D...",,,True,False,1426277950,False,,CC_BY,...,False,True,False,na,2016-03-17 16:40:00,2018-07-18 13:09:00,34012.0,Tyto furcata,GPS,ahuysman (Allison Huysman)
290,Support for this study was provided by the S. ...,"BA Barbaree, ME Reiter, CM Hickey, GW Page. 2...",,Grant to the Migratory Bird Conservation Partn...,True,False,1419917362,False,,CC_BY,...,False,True,False,na,2013-02-08 00:00:00,2013-04-23 00:00:00,212.0,"Calidris alpina,Limnodromus scolopaceus",Radio Transmitter,auoptimo (Blake Barbaree)
371,,A more complete version of this dataset is sto...,,,True,False,217784323,False,,CC_BY,...,False,True,False,na,2003-11-05 15:00:00,2016-12-08 14:04:04,679402.0,"Cathartes aura,Coragyps atratus",GPS,dbarber (David Barber)
429,C. G. Putnam provided the template for Figure ...,"Barbaree BA, Reiter ME, Hickey CM, Page GW. 20...",,Support for this study was provided by the S. ...,True,False,1437646021,False,,CC_BY,...,False,True,False,na,2012-08-19 07:00:00,2014-03-18 07:00:00,475.0,Limnodromus scolopaceus,Radio Transmitter,auoptimo (Blake Barbaree)
570,,"Bildstein KL, Barber D, Bechard MJ, GraÃ±a Gri...",,,True,False,481458,False,,CC_BY,...,False,True,False,na,2003-11-05 15:00:00,2025-04-07 05:00:00,2217579.0,"Cathartes aura,Coragyps atratus","GPS,Accessory Measurements",dbarber (David Barber)
595,,Bloom PH (2015) Northward summer migration of ...,,,True,False,1073231887,False,,CC_BY,...,False,True,False,na,2007-08-22 00:00:00,2017-06-04 01:32:03,43006.0,Buteo jamaicensis,"GPS,Argos Doppler Shift",bloombio (Pete Bloom)
676,Work was conducted under permits to JA Estep (...,"Fleishman E, Anderson J, Dickson BG, Krolick D...",,Support for this work was provided by Brookfie...,True,False,164144882,False,These data have been published by the Movebank...,CUSTOM,...,False,True,False,na,2011-06-17 16:00:00,2013-08-28 06:00:00,18828.0,Buteo swainsoni,"GPS,Radio Transmitter",efleishman (Erica Fleishman)
697,We thank the Acopian Family for providing Hawk...,A more complete version of this dataset is sto...,,This research was supported in part by NASA un...,True,False,16880941,False,,CC_0,...,False,True,False,na,2003-11-14 16:00:00,2013-03-19 03:00:00,215719.0,Cathartes aura,GPS,dbarber (David Barber)
771,\tPlease include your email address when reque...,"Irvine LM, Palacios DM, Lagerquist BA, Mate BR...",,,True,False,943824007,False,,CC_BY,...,False,True,False,na,2014-08-04 18:44:30,2015-08-06 20:34:53,17150.0,"Balaenoptera physalus,Balaenoptera musculus",GPS,mmiwtg (Barb Lagerquist)
794,,"Serieys LEK, Matsushima SS, Wilmers CC. 2024. ...",,,True,False,5175345606,False,,CC_0,...,False,True,False,na,2017-06-03 21:00:27,2018-12-29 07:55:08,647712.0,"Lynx rufus,Urocyon cinereoargenteus","GPS,Acceleration",ucsc@bobcat (Laurel Serieys)


### We'll be focusing on tracking data starting from 2013 onward. Since not all studies span the full decade, we’ll later trim the NOAA weather dataset to align with the actual periods of movement data. Based on our metadata review, the species we’ll analyze include Barn Owls, Vultures, Red-tailed Hawks, Bobcats, Grey Foxes, and potentially Blue and Fin Whales.

### To simplify the data retrieval process, we’ll use pre-downloaded CSVs from Movebank’s archive rather than building a separate function to fetch each study. Below is a reference dictionary with the common animal names as keys and their corresponding Movebank study IDs (where applicable). We’ll load these files into dataframes for further analysis.

In [19]:
# We create this dictionary with the associated animal as the key and the study id as the value, but we collected the data and reference guides manually
mb_studies = {
    "BarnOwl": 1426277950,
    "Vultures": 217784323,
    "SwainsonHawk": "Found separately online through their archives",
    "Bobcat": 5175345606,
    "Whales": 1027467132  
}

### The selected Movebank studies reflect a diverse range of ecological and conservation-oriented research across different species in California and surrounding regions. After reviewing the abstracts and descriptions for each of these studies, it's clear that researchers had various objectives — many sought to understand how wildlife movements and mortality are influenced by human development and environmental pressures. 

Some studies focused on habitat fragmentation and the effects of urban sprawl (e.g., Barn Owls and Bobcats), while others monitored the impact of wind energy infrastructure on species like vultures and hawks. The whale study provides marine movement data in response to oceanographic and acoustic conditions, and Swainson's Hawk tracking adds migratory insight, though it was retrieved from a separate archive. Across all these projects, researchers aim to inform conservation planning, identify critical habitat areas, and explore the broader ecological consequences of human activity.

### Below are links and references to the studies for further reading:

- [Wing size but not wing shape is related to migratory behavior in a soaring bird](https://datarepository.movebank.org/entities/datapackage/21e96dfb-6323-4189-9039-557d6fbe34eb)
- [Habitat selection by a predator of rodent pests is resilient to wildfire in a vineyard agroecosystem](https://datarepository.movebank.org/entities/datapackage/128fb535-f0e2-4693-a225-531c97978dbc)
- [Space use by Swainson's hawk (Buteo swainsoni) in the Natomas Basin, California](https://datarepository.movebank.org/entities/datapackage/fb9f260b-b3fa-4c3b-a059-847341c43998)
- [Study "Bobcat habitat connectivity study in central California"](https://datarepository.movebank.org/entities/datapackage/ed70c4c7-6017-45c1-90b7-5fff24ea7d85)
- [Study "Blue and fin whales Southern California 2014-2015 - Argos data"](https://datarepository.movebank.org/entities/datapackage/4bd42b30-98f4-473d-b931-4c13b4ed3481)

In [20]:
# Reading the csv files for each study into their corresponding dataframe
blueFinWhales_df = pd.read_csv("./MovebankData/Blue and fin whales Southern California 2014-2015 - Argos data.csv")
blueFinWhales_df.head()
#blueFinWhales_df.shape

Unnamed: 0,event-id,visible,timestamp,location-long,location-lat,argos:best-level,argos:calcul-freq,argos:error-radius,argos:gdop,argos:iq,...,argos:sat-id,argos:semi-major,argos:semi-minor,height-above-ellipsoid,manually-marked-outlier,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name
0,13595136965,True,2014-08-04 21:16:56.000,-119.0726,33.99902,-126.0,401677402.3,1720.0,699.0,48.0,...,NP,8398.0,352.0,70.0,,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...
1,13595136966,True,2014-08-04 22:05:49.000,-119.07017,34.00611,-136.0,401677397.7,3456.0,2399.0,8.0,...,NN,5854.0,2040.0,91.0,,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...
2,13595136967,True,2014-08-04 22:58:10.000,-119.04418,34.00649,-128.0,401677400.7,3831.0,913.0,48.0,...,NP,13045.0,1125.0,150.0,,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...
3,13595136968,True,2014-08-04 23:07:29.000,-119.0372,33.96855,-136.0,401677403.8,3703.0,1997.0,0.0,...,NK,6311.0,2172.0,162.0,,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...
4,13595136969,True,2014-08-04 23:46:12.000,-119.01175,34.02623,-124.0,401677399.4,1716.0,1294.0,58.0,...,NN,18314.0,160.0,83.0,,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...


In [21]:
barnOwl_df = pd.read_csv("./MovebankData/Barn Owl Breeding Napa Valley California.csv")
barnOwl_df.head()
#barnOwl_df.shape

Unnamed: 0,event-id,visible,timestamp,location-long,location-lat,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name
0,17869541149,True,2018-04-24 19:03:00.000,-122.2469,38.208567,gps,Tyto furcata,URI20,Amc4,"Barn Owl Breeding Napa Valley, California"
1,17869541150,True,2018-04-24 19:04:00.000,-122.246933,38.20855,gps,Tyto furcata,URI20,Amc4,"Barn Owl Breeding Napa Valley, California"
2,17869541151,True,2018-04-24 19:05:00.000,-122.2469,38.208567,gps,Tyto furcata,URI20,Amc4,"Barn Owl Breeding Napa Valley, California"
3,17869541152,True,2018-04-24 19:06:00.000,-122.246883,38.208567,gps,Tyto furcata,URI20,Amc4,"Barn Owl Breeding Napa Valley, California"
4,17869541154,True,2018-04-24 19:40:13.000,,,gps,Tyto furcata,URI20,Amc4,"Barn Owl Breeding Napa Valley, California"


In [22]:
vultures_df = pd.read_csv("./MovebankData/Vultures Acopian Center USA 2003-2016.csv", low_memory=False)
vultures_df.head()
#vultures_df.shape

Unnamed: 0,event-id,visible,timestamp,location-long,location-lat,algorithm-marked-outlier,gps:hdop,gps:satellite-count,gps-time-to-fix,gps:vdop,...,height-raw,location-error-text,manually-marked-outlier,raptor-workshop:migration-state,vertical-error-numerical,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name
0,2165431108,True,2004-09-06 17:00:00.000,-75.28533,40.778,,,,,,...,,,,,,gps,Cathartes aura,52067,Irma,Vultures Acopian Center USA 2003-2016
1,2165431109,True,2004-09-06 18:00:00.000,-75.28533,40.77817,,,,,,...,,,,,,gps,Cathartes aura,52067,Irma,Vultures Acopian Center USA 2003-2016
2,2165431110,True,2004-09-06 19:00:00.000,-75.28933,40.77433,,,,,,...,,,,,,gps,Cathartes aura,52067,Irma,Vultures Acopian Center USA 2003-2016
3,2165431111,True,2004-09-06 20:00:00.000,-75.289,40.77433,,,,,,...,,,,,,gps,Cathartes aura,52067,Irma,Vultures Acopian Center USA 2003-2016
4,2165431112,True,2004-09-07 00:00:00.000,-75.289,40.77417,,,,,,...,,,,,,gps,Cathartes aura,52067,Irma,Vultures Acopian Center USA 2003-2016


In [23]:
swainsonHawk_df = pd.read_csv("./MovebankData/Space use by Swainson's Hawk (Buteo swainsoni) in the Natomas Basin, California.csv")
swainsonHawk_df.head()
#swainsonHawk_df.shape

Unnamed: 0,event-id,visible,timestamp,location-long,location-lat,ground-speed,heading,height-above-ellipsoid,migration-stage,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name
0,1583635119,True,2011-06-25 16:00:00.000,-121.62333,38.69267,5.1444,149.0,80.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
1,1583635120,True,2011-06-25 17:00:00.000,-121.624,38.69433,5.1444,150.0,160.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
2,1583635121,True,2011-06-25 18:00:00.000,-121.625,38.69383,5.1444,226.0,190.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
3,1583635122,True,2011-06-25 19:00:00.000,-121.623,38.69117,5.65884,210.0,280.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...
4,1583635123,True,2011-06-25 20:00:00.000,-121.6255,38.69267,8.74548,16.0,220.0,nestling,gps,Buteo swainsoni,105921,105921,Space use by Swainson's Hawk (Buteo swainsoni)...


In [24]:
bobcat_df = pd.read_csv("./MovebankData/Bobcat habitat connectivity study in central California-gps.csv", low_memory=False, nrows=500000)
bobcat_df.head()
#bobcat_df.shape

Unnamed: 0,event-id,visible,timestamp,location-long,location-lat,data-decoding-software,eobs:battery-voltage,eobs:fix-battery-voltage,eobs:horizontal-accuracy-estimate,eobs:key-bin-checksum,...,eobs:used-time-to-get-fix,ground-speed,heading,height-above-ellipsoid,manually-marked-outlier,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name
0,36972065122,True,2018-06-11 20:00:23.000,-121.578366,36.884591,,3608,3480,7.68,3066544741,...,22,0.01,0.0,144.4,,gps,Lynx rufus,6387,B27M,Bobcat habitat connectivity study in central C...
1,36972026092,True,2018-06-11 20:10:12.000,-121.578344,36.884537,,3607,3480,8.19,1388873481,...,11,0.11,339.48,147.9,,gps,Lynx rufus,6387,B27M,Bobcat habitat connectivity study in central C...
2,36972035576,True,2018-06-11 20:15:14.000,-121.578374,36.884538,,3603,3460,12.29,1637057194,...,13,0.19,0.0,138.6,,gps,Lynx rufus,6387,B27M,Bobcat habitat connectivity study in central C...
3,36972051901,True,2018-06-11 20:45:18.000,-121.578323,36.884473,,3604,3475,34.3,894991202,...,17,0.2,0.0,147.4,,gps,Lynx rufus,6387,B27M,Bobcat habitat connectivity study in central C...
4,36972064911,True,2018-06-11 21:00:23.000,-121.578321,36.884493,,3602,3448,32.0,672027911,...,22,0.44,340.79,135.8,,gps,Lynx rufus,6387,B27M,Bobcat habitat connectivity study in central C...


### We check the information about the dataframe, converting datatypes where needed

In [25]:
barnOwl_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53560 entries, 0 to 53559
Data columns (total 10 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   event-id                         53560 non-null  int64  
 1   visible                          53560 non-null  bool   
 2   timestamp                        53560 non-null  object 
 3   location-long                    34012 non-null  float64
 4   location-lat                     34012 non-null  float64
 5   sensor-type                      53560 non-null  object 
 6   individual-taxon-canonical-name  53560 non-null  object 
 7   tag-local-identifier             53560 non-null  object 
 8   individual-local-identifier      53560 non-null  object 
 9   study-name                       53560 non-null  object 
dtypes: bool(1), float64(2), int64(1), object(6)
memory usage: 3.7+ MB


In [34]:
barnOwl_df['timestamp'] = pd.to_datetime(barnOwl_df['timestamp'])

In [28]:
swainsonHawk_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18828 entries, 0 to 18827
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   event-id                         18828 non-null  int64  
 1   visible                          18828 non-null  bool   
 2   timestamp                        18828 non-null  object 
 3   location-long                    18828 non-null  float64
 4   location-lat                     18828 non-null  float64
 5   ground-speed                     14289 non-null  float64
 6   heading                          18828 non-null  float64
 7   height-above-ellipsoid           18827 non-null  float64
 8   migration-stage                  18828 non-null  object 
 9   sensor-type                      18828 non-null  object 
 10  individual-taxon-canonical-name  18828 non-null  object 
 11  tag-local-identifier             18828 non-null  object 
 12  individual-local-i

In [35]:
swainsonHawk_df['timestamp'] = pd.to_datetime(swainsonHawk_df['timestamp'])

In [32]:
blueFinWhales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4303 entries, 0 to 4302
Data columns (total 28 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   event-id                         4303 non-null   int64  
 1   visible                          4303 non-null   bool   
 2   timestamp                        4303 non-null   object 
 3   location-long                    4284 non-null   float64
 4   location-lat                     4284 non-null   float64
 5   argos:best-level                 4297 non-null   float64
 6   argos:calcul-freq                4303 non-null   float64
 7   argos:error-radius               4284 non-null   float64
 8   argos:gdop                       4284 non-null   float64
 9   argos:iq                         4284 non-null   float64
 10  argos:lat1                       4284 non-null   float64
 11  argos:lat2                       4284 non-null   float64
 12  argos:lc            

In [36]:
blueFinWhales_df['timestamp'] = pd.to_datetime(blueFinWhales_df['timestamp'])

In [37]:
vultures_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686382 entries, 0 to 686381
Data columns (total 23 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   event-id                         686382 non-null  int64  
 1   visible                          686382 non-null  bool   
 2   timestamp                        686382 non-null  object 
 3   location-long                    684217 non-null  float64
 4   location-lat                     684217 non-null  float64
 5   algorithm-marked-outlier         6970 non-null    object 
 6   gps:hdop                         116035 non-null  float64
 7   gps:satellite-count              113809 non-null  float64
 8   gps-time-to-fix                  16559 non-null   float64
 9   gps:vdop                         116035 non-null  float64
 10  ground-speed                     532574 non-null  float64
 11  heading                          511712 non-null  float64
 12  he

In [39]:
vultures_df['timestamp'] = pd.to_datetime(vultures_df['timestamp'])

In [40]:
bobcat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   event-id                           500000 non-null  int64  
 1   visible                            500000 non-null  bool   
 2   timestamp                          500000 non-null  object 
 3   location-long                      493776 non-null  float64
 4   location-lat                       493776 non-null  float64
 5   data-decoding-software             61395 non-null   float64
 6   eobs:battery-voltage               500000 non-null  int64  
 7   eobs:fix-battery-voltage           500000 non-null  int64  
 8   eobs:horizontal-accuracy-estimate  493776 non-null  float64
 9   eobs:key-bin-checksum              500000 non-null  int64  
 10  eobs:speed-accuracy-estimate       493776 non-null  float64
 11  eobs:start-timestamp               5000

In [42]:
bobcat_df['timestamp'] = pd.to_datetime(bobcat_df['timestamp'])

### Now we check for null values amongst the columns, we probably should just get rid of any rows that have NaN long or lat values since we want to map location later on.

In [48]:
barnOwl_df.isna().sum()

event-id                               0
visible                                0
timestamp                              0
location-long                      19548
location-lat                       19548
sensor-type                            0
individual-taxon-canonical-name        0
tag-local-identifier                   0
individual-local-identifier            0
study-name                             0
dtype: int64

In [49]:
vultures_df.isna().sum()

event-id                                0
visible                                 0
timestamp                               0
location-long                        2165
location-lat                         2165
algorithm-marked-outlier           679412
gps:hdop                           570347
gps:satellite-count                572573
gps-time-to-fix                    669823
gps:vdop                           570347
ground-speed                       153808
heading                            174670
height-above-ellipsoid             685518
height-raw                         254976
location-error-text                669823
manually-marked-outlier            686372
raptor-workshop:migration-state    486815
vertical-error-numerical           686332
sensor-type                             0
individual-taxon-canonical-name         0
tag-local-identifier                    0
individual-local-identifier             0
study-name                              0
dtype: int64

In [50]:
bobcat_df.isna().sum()

event-id                                  0
visible                                   0
timestamp                                 0
location-long                          6224
location-lat                           6224
data-decoding-software               438605
eobs:battery-voltage                      0
eobs:fix-battery-voltage                  0
eobs:horizontal-accuracy-estimate      6224
eobs:key-bin-checksum                     0
eobs:speed-accuracy-estimate           6224
eobs:start-timestamp                      0
eobs:status                               0
eobs:temperature                          0
eobs:type-of-fix                          0
eobs:used-time-to-get-fix                 0
ground-speed                           6224
heading                                6224
height-above-ellipsoid                 6224
manually-marked-outlier              499971
sensor-type                               0
individual-taxon-canonical-name           0
tag-local-identifier            

In [51]:
blueFinWhales_df.isna().sum()

event-id                              0
visible                               0
timestamp                             0
location-long                        19
location-lat                         19
argos:best-level                      6
argos:calcul-freq                     0
argos:error-radius                   19
argos:gdop                           19
argos:iq                             19
argos:lat1                           19
argos:lat2                           19
argos:lc                             27
argos:lon1                           19
argos:lon2                           19
argos:nb-mes                          0
argos:nb-mes-120                      6
argos:nopc                           19
argos:sat-id                          0
argos:semi-major                     19
argos:semi-minor                     19
height-above-ellipsoid               19
manually-marked-outlier            4296
sensor-type                           0
individual-taxon-canonical-name       0


In [52]:
swainsonHawk_df.isna().sum()

event-id                              0
visible                               0
timestamp                             0
location-long                         0
location-lat                          0
ground-speed                       4539
heading                               0
height-above-ellipsoid                1
migration-stage                       0
sensor-type                           0
individual-taxon-canonical-name       0
tag-local-identifier                  0
individual-local-identifier           0
study-name                            0
dtype: int64

In [53]:
blueFinWhales_df = blueFinWhales_df.dropna(subset=["location-lat", "location-long"])
barnOwl_df = barnOwl_df.dropna(subset=["location-lat", "location-long"])
vultures_df = vultures_df.dropna(subset=["location-lat", "location-long"])
swainsonHawk_df = swainsonHawk_df.dropna(subset=["location-lat", "location-long"])
bobcat_df = bobcat_df.dropna(subset=["location-lat", "location-long"])

### We know the common columns amongst them all and we can make a new species column to track for which animal this is, keeping identifiers for different animals being tracked in their respective studies.

In [55]:
# Define common columns
common_cols = [
    "timestamp", "location-lat", "location-long",
    "event-id", "sensor-type", "individual-taxon-canonical-name",
    "tag-local-identifier", "individual-local-identifier",
    "study-name", "visible"
]

blueFinWhales_df["species"] = "Blue/Fin Whale"
barnOwl_df["species"] = "Barn Owl"
vultures_df["species"] = "Vulture"
swainsonHawk_df["species"] = "Swainson's Hawk"
bobcat_df["species"] = "Bobcat"

blueFinWhales_df = blueFinWhales_df[common_cols + ["species"]]
barnOwl_df = barnOwl_df[common_cols + ["species"]]
vultures_df = vultures_df[common_cols + ["species"]]
swainsonHawk_df = swainsonHawk_df[common_cols + ["species"]]
bobcat_df = bobcat_df[common_cols + ["species"]]

### Let's combine them now into one dataframe so we now have two datasets to work with in total, this is nice because we can use the old dataframes if needed, but if there's overlap in times then it would be cool to visualize their movements together.

In [56]:
all_animals_df = pd.concat([
    blueFinWhales_df,
    barnOwl_df,
    vultures_df,
    swainsonHawk_df,
    bobcat_df
], ignore_index=True)

all_animals_df.head()

Unnamed: 0,timestamp,location-lat,location-long,event-id,sensor-type,individual-taxon-canonical-name,tag-local-identifier,individual-local-identifier,study-name,visible,species
0,2014-08-04 21:16:56,33.99902,-119.0726,13595136965,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...,True,Blue/Fin Whale
1,2014-08-04 22:05:49,34.00611,-119.07017,13595136966,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...,True,Blue/Fin Whale
2,2014-08-04 22:58:10,34.00649,-119.04418,13595136967,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...,True,Blue/Fin Whale
3,2014-08-04 23:07:29,33.96855,-119.0372,13595136968,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...,True,Blue/Fin Whale
4,2014-08-04 23:46:12,34.02623,-119.01175,13595136969,argos-doppler-shift,Balaenoptera musculus,2014CA-MK10-05644,2014CA-Bmu-05644,Blue and fin whales Southern California 2014-2...,True,Blue/Fin Whale


In [57]:
all_animals_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1235117 entries, 0 to 1235116
Data columns (total 11 columns):
 #   Column                           Non-Null Count    Dtype         
---  ------                           --------------    -----         
 0   timestamp                        1235117 non-null  datetime64[ns]
 1   location-lat                     1235117 non-null  float64       
 2   location-long                    1235117 non-null  float64       
 3   event-id                         1235117 non-null  int64         
 4   sensor-type                      1235117 non-null  object        
 5   individual-taxon-canonical-name  1235117 non-null  object        
 6   tag-local-identifier             1235117 non-null  object        
 7   individual-local-identifier      1235117 non-null  object        
 8   study-name                       1235117 non-null  object        
 9   visible                          1235117 non-null  bool          
 10  species                       

### Now we have our datasets and begin doing some basic visualizations and gather descriptive statistics to get a grasp on what we're working with. We should also attempt to build an interactive map where you can see the change in weather variables and animal locations, which we'll be seeking correlations for.

### I want to note that our [CDFW](https://data-cdfw.opendata.arcgis.com) data analysis won't be too relevant until later in our notebook, since it will mostly act as a supporting argument if we find some interesting results that support our claim that the changes in the atmosphere have been impacting migratory species. This will potentially help us understand the species better given that I'm not an ecologist or zoologist.

## Gather Statistics

## Analyze Statistics 

## Use Above to Answer Questions using Inferential Statistics and Prediction

## Answer Questions and Conclude Findings

## References and Citations