# 01 - Filter and resample eBird data for the area of interest

This notebook creates a sample of eBird data for the area of interest, to be analysed in the workflow.

### Description

The goal of this notebook is to prepare a sample dataset to be analysed by the workflow:

- select eBird occurrences tha fall inside the polygon of the areas of interests (Salinas Valley and California Valley 
- filter records that are based on bird sampling observation lists

### Input

- Parquet file with eBird dataset for US-CA (prepared using [pre01_convert_ebird_to_parquet](pre01_convert_ebird_to_parquet.ipynb))
- spatial polygons of the Salinas and California valleys
- Geopackage with polygons of the California Agricultural Valley (source: https://data.cnra.ca.gov/dataset/statewide-crop-mapping). 
The area contains two subregions: California Valley and Salinas Valley. 

### Processing

A spatial polygon layer was created for both Salinas and California agricultural valleys. From these polygons, land use areas of type `urban`, identified from the Statewide Crop Mapping data source from the California Natural Resources Agency (source: https://data.cnra.ca.gov/dataset/statewide-crop-mapping), were extracted by a spatial difference operation.

In our notebook, we will create a pandas dataframe containing points of bird occurrences
that occur inside the Salinas area and the California Valley, and will finally 
merge both in one dataset.

In [1]:
# import modules
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import geopandas as gpd
from pyproj import Proj, CRS,transform
%matplotlib inline

In [2]:
# set the input parquet file containing eBird dataset
in_path = '../external_data/ebd_US-CA_relJun-2022/ebd_US-CA_relJun-2022_2.parquet'

## 1. Process data for Salinas valley

Create a table with eBird occurrences for the Salinas valley.

In [3]:
# read the area polygon and calculate the bounding box - Salinas valley

salinas_area = gpd.read_file("../external_data/gis_california_valley/CA_agricultural_valley.gpkg", 
                             layer='area_salinas__difference')

gdf_salinas_bounds = pd.concat([salinas_area, salinas_area.bounds], axis=1)
gdf_salinas_bounds

Unnamed: 0,id,area,perimeter,geometry,minx,miny,maxx,maxy
0,0,0.256222,3.029087,"MULTIPOLYGON (((-120.88368 35.96933, -120.9033...",-121.893763,35.969284,-120.868632,37.036615


### Select records

Only eBird records that are classified in accordance with the following criteria are included in the bird dataset:

Selection criteria:
- counts: different from X (not presence-only type)
- protocol type: P21, P22
- duration of sampling event: <= 30 min
- Locality type: Personal
- EFFORT DISTANCE KM <= 1 or `NULL` (for cases of P21 locations)
- month of sampling: between April and June, inclusive
- ALL SPECIES REPORTED: 1
- APPROVED: 1

We will use some of these criteria to filter reading records from the parquet file.

In [4]:
# create filter based on the bounding box coordinates of the area polygon anfd additional filters
filters = [('LATITUDE','>=',35.969284), ('LATITUDE', '<=', 37.036615), # coordinates of the bounding box area
        ('LONGITUDE','>=',-121.893763),('LONGITUDE','<=',-120.868632),
        ('OBSERVATION COUNT','!=','X'),                                # remove records that only indicate presence
        ('LOCALITY TYPE','=','P'),                                     # only consider location of type personal
        ('PROTOCOL CODE','in',['P21', 'P22']),                         # the observation protocol type
        ('DURATION MINUTES','<=',30),                                  # the observation time
        ('ALL SPECIES REPORTED','=',1),                                # full bird lists are created
        ('APPROVED','=',1),                                            # record passes eBird data quality
        ] 

In [5]:
%%time
# read the parquet file - this may take a while, about 2 minutes
dataset = pq.read_table(in_path, filters=filters)

CPU times: user 13min 55s, sys: 8.25 s, total: 14min 3s
Wall time: 1min 16s


In [6]:
# convert to a pandas dataframe
pdf_salinas = dataset.to_pandas()

In [7]:
# show rows and columns number in the dataset
pdf_salinas.shape


(274177, 41)

There are 752841 records for the Salinas valley that fullfil requirements of filters set above.

In [8]:
# create a Geopandas DataFrame from eBird occurrence points - Salinas valley

crs = CRS('EPSG:4326')
gdf_salinas = gpd.GeoDataFrame(
    pdf_salinas, geometry=gpd.points_from_xy(pdf_salinas['LONGITUDE'], pdf_salinas['LATITUDE']), crs=crs)

In [9]:
%%time
# clip observation points with the area polygon

salinas_points = gpd.clip(gdf_salinas, salinas_area)

CPU times: user 276 ms, sys: 12 ms, total: 288 ms
Wall time: 383 ms


In [10]:
# show rows and columns number in the dataset
salinas_points.shape

(68873, 42)

There are 260716 records that fall inside the Salinas agricultural valley.

In [11]:
# save points to a csv file
salinas_points.to_csv('../process_data/salinas_points.csv')

## 2. Process data for California valley

Repeat the process, but for the bigger California valley.

In [12]:
# read the area polygon and calculate the bounding box - California Valley

valley_area = gpd.read_file("../external_data/gis_california_valley/CA_agricultural_valley.gpkg", 
                             layer='area_valley__difference')
gdf_valley_bounds = pd.concat([valley_area, valley_area.bounds], axis=1)
gdf_valley_bounds

Unnamed: 0,id,area,perimeter,geometry,minx,miny,maxx,maxy
0,0,6.402203,13.305924,"MULTIPOLYGON (((-119.27752 35.01807, -119.3224...",-122.468536,34.940458,-118.735043,40.215179


In [13]:
# create filter based on the bounding box coordinates of the area polygon
filters = [('LATITUDE','>=',34.940458), ('LATITUDE', '<=', 40.215179), # coordinates of the bounding box area
        ('LONGITUDE','>=',-122.468536),('LONGITUDE','<=',-118.735043),
        ('OBSERVATION COUNT','!=','X'),                                # remove records that only indicate presence
        ('LOCALITY TYPE','=','P'),                                     # only consider location of type personal
        ('PROTOCOL CODE','in',['P21', 'P22']),                         # the observation protocol type
        ('DURATION MINUTES','<=',30),                                  # the observation time
        ('ALL SPECIES REPORTED','=',1),                                # full bird lists are created
        ('APPROVED','=',1)                                             # record passes eBird data quality
       ] 


In [14]:
%%time
# read the parquet file - this may take a while, about 2 minutes.

dataset1 = pq.read_table(in_path, filters=filters)

CPU times: user 14min 39s, sys: 6.18 s, total: 14min 45s
Wall time: 1min 15s


In [15]:
# convert to a pandas dataframe
pdf_valley = dataset1.to_pandas()

In [16]:
# show rows and columns number in the dataset
pdf_valley.shape

(3275482, 41)

There are 6938406 records for the California valley that fullfil requirements of filters set above. 

In [17]:
# create a Geopandas DataFrame from eBird occurrence points - California valley

crs = CRS('EPSG:4326')
gdf_valley = gpd.GeoDataFrame(
    pdf_valley, geometry=gpd.points_from_xy(pdf_valley['LONGITUDE'], pdf_valley['LATITUDE']), crs=crs)

In [18]:
%%time
# clip observation points with the area polygon

valley_points = gpd.clip(gdf_valley, valley_area)

CPU times: user 4.86 s, sys: 205 ms, total: 5.07 s
Wall time: 5.07 s


In [19]:
# show rows and columns number in the dataset
valley_points.shape

(763120, 42)

There are 1698565 records that fall inside the California agricultural valley.

In [20]:
# save points to a csv file

valley_points.to_csv('../process_data/valley_points.csv')

## 3. Merge data from both valleys

Merge both tables in one dataset.

In [21]:
# merge the two tables and save
cal_points = pd.concat([salinas_points, valley_points])

In [22]:
# show rows and columns number in the dataset
cal_points.shape

(831993, 42)

There are 1959281 records that fall inside the Salinas and California agricultural valleys.

In [23]:
# see a sample of the dataframe
cal_points.head()

Unnamed: 0,GLOBAL UNIQUE IDENTIFIER,LAST EDITED DATE,TAXONOMIC ORDER,CATEGORY,TAXON CONCEPT ID,COMMON NAME,SCIENTIFIC NAME,SUBSPECIES COMMON NAME,SUBSPECIES SCIENTIFIC NAME,EXOTIC CODE,...,DURATION MINUTES,EFFORT DISTANCE KM,EFFORT AREA HA,NUMBER OBSERVERS,ALL SPECIES REPORTED,GROUP IDENTIFIER,APPROVED,REVIEWED,REASON,geometry
272647,URN:CornellLabOfOrnithology:EBIRD:OBS1446126022,2022-06-02 18:22:45.475308,26451,species,avibase-603194D3,Bewick's Wren,Thryomanes bewickii,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)
270014,URN:CornellLabOfOrnithology:EBIRD:OBS1449752068,2022-06-02 18:29:15.579465,999,species,avibase-F93AC929,California Quail,Callipepla californica,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)
256272,URN:CornellLabOfOrnithology:EBIRD:OBS1449752064,2022-06-02 18:29:15.579465,2369,species,avibase-00124D98,Mourning Dove,Zenaida macroura,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)
255571,URN:CornellLabOfOrnithology:EBIRD:OBS1446126020,2022-06-02 18:22:45.475308,31941,species,avibase-78509A5D,Lark Sparrow,Chondestes grammacus,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)
255752,URN:CornellLabOfOrnithology:EBIRD:OBS1449752057,2022-06-02 18:29:15.579465,11494,species,avibase-20C2214E,American Kestrel,Falco sparverius,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)


In [24]:
# save to a file
cal_points.to_csv('../process_data/cal_points.csv')

## 4. Apply further filters on eBird sampling protocol

We need to apply filters to data that where not included in the reading of the parquet file: 
- observation between April and June
- Effort distance less or equal to 1 km or `NULL` (for cases of P21 locations)

In [25]:
# convert the column with dates to a datetime type
cal_points['OBSERVATION DATE'] = pd.to_datetime(cal_points['OBSERVATION DATE'], errors='coerce')

In [26]:
# filter records, removing those that were not sampled within a 1 km effort distance, or in the period between April 
# and June (growing season)

eBird_sample = cal_points[ \
       ((cal_points['EFFORT DISTANCE KM'].isnull()) | (cal_points['EFFORT DISTANCE KM'] <= 1)) & \
       ((cal_points['OBSERVATION DATE'].dt.month >= 4) & (cal_points['OBSERVATION DATE'].dt.month <= 6))]

In [27]:
eBird_sample.shape

(186486, 42)

There are 78200 occurrences in the area of interest that fullfil the requirements in terms of sampling protocol and the remaining criteria defined above.

In [28]:
eBird_sample.to_csv('../process_data/eBird_sample.csv')

In [29]:
# determine how many sampling points, grouping records by sampling event identifier

cal_group = eBird_sample.groupby('SAMPLING EVENT IDENTIFIER').first()
cal_group.shape

(20942, 41)

There are 12352 sampling events - *lists, in the concepts of eBird* -  in the area of interest that fullfil the requirements in terms of sampling protocol and the remaining criteria defined above.

In [30]:
# see a preview of the dataframe
cal_group.head()

Unnamed: 0_level_0,GLOBAL UNIQUE IDENTIFIER,LAST EDITED DATE,TAXONOMIC ORDER,CATEGORY,TAXON CONCEPT ID,COMMON NAME,SCIENTIFIC NAME,SUBSPECIES COMMON NAME,SUBSPECIES SCIENTIFIC NAME,EXOTIC CODE,...,DURATION MINUTES,EFFORT DISTANCE KM,EFFORT AREA HA,NUMBER OBSERVERS,ALL SPECIES REPORTED,GROUP IDENTIFIER,APPROVED,REVIEWED,REASON,geometry
SAMPLING EVENT IDENTIFIER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S100923939,URN:CornellLabOfOrnithology:EBIRD:OBS1320749039,2022-01-17 11:52:36.03516,4699,species,avibase-42393721,Anna's Hummingbird,Calypte anna,Rock Pigeon (Feral Pigeon),Columba livia (Feral Pigeon),N,...,10.0,,,34.0,1,G7732261,1,0,,POINT (-121.46581 36.52371)
S101565234,URN:CornellLabOfOrnithology:EBIRD:OBS1328634976,2022-01-28 00:21:09.891672,16330,species,avibase-B182DED2,Black Phoebe,Sayornis nigricans,,,N,...,30.0,,,1.0,1,,1,0,,POINT (-121.96756 38.47047)
S101565449,URN:CornellLabOfOrnithology:EBIRD:OBS1328629006,2022-01-28 00:26:02.930986,7445,species,avibase-CAA5E106,White-tailed Kite,Elanus leucurus,,,,...,10.0,,,1.0,1,,1,0,,POINT (-121.96756 38.47047)
S101565463,URN:CornellLabOfOrnithology:EBIRD:OBS1328623740,2022-01-28 00:26:57.633822,8094,species,avibase-536A5157,Red-tailed Hawk,Buteo jamaicensis,,,,...,20.0,,,1.0,1,,1,0,,POINT (-121.96756 38.47047)
S101623296,URN:CornellLabOfOrnithology:EBIRD:OBS1329328826,2022-01-29 00:58:28.375498,406,species,avibase-27B2749A,Wood Duck,Aix sponsa,,,,...,5.0,,,1.0,1,,1,0,,POINT (-121.36947 38.62569)


## 5. Represent points on a map

Do a quick representation of the points on a map.

In [31]:
# create points geopandas from eBird, group by

points = gpd.GeoDataFrame(
    cal_group, geometry=gpd.points_from_xy(cal_group['LONGITUDE'], cal_group['LATITUDE']), crs=crs)

points['geometry'].explore()

## 6. Create a list of the bird species occuring in the area of interest

Create a list of species occurring in the area of interest. This will be combined with the trait data provided by AVONET database. This table will be used to define which species are native to the area of study.

In [32]:
sps_cal = eBird_sample[['TAXON CONCEPT ID', 'SCIENTIFIC NAME', 'TAXONOMIC ORDER']].drop_duplicates()

In [33]:
sps_cal.head()

Unnamed: 0,TAXON CONCEPT ID,SCIENTIFIC NAME,TAXONOMIC ORDER
272647,avibase-603194D3,Thryomanes bewickii,26451
270014,avibase-F93AC929,Callipepla californica,999
256272,avibase-00124D98,Zenaida macroura,2369
255571,avibase-78509A5D,Chondestes grammacus,31941
255752,avibase-20C2214E,Falco sparverius,11494


In [34]:
# determine number of rows and columns
sps_cal.shape

(442, 3)

There are 431 species in the area of interest. We will join this table with the taxonomic data from eBird, to obtain the taxonomic order classification for each species.

In [35]:
# read eBird taxonomic data
path = '../external_data/ebird_taxonomy/'
tax_file = path + 'eBird_Taxonomy_v2021.csv'
pdf_t = pd.read_csv(tax_file)

In [36]:
# merge species in eBird occurrences with taxonomy table

sps_cal = pd.merge(sps_cal, pdf_t, left_on=['TAXONOMIC ORDER'], right_on=['TAXON_ORDER'])

In [37]:
# add Order column to the table

sps_cal = sps_cal[['TAXON CONCEPT ID', 'SCIENTIFIC NAME', 'TAXONOMIC ORDER', 'ORDER1']]

In [38]:
# convert IDs to capital letters. Need because AVONET uses capital letters in IDs
sps_cal['TAXON CONCEPT ID'] = sps_cal['TAXON CONCEPT ID'].str.upper()

Read the transformed AVONET database. This table expanded the AVONET database with the following information:
- ForagingNiche, from [Pigot et al (2020)](https://doi.org/10.1038/s41559-019-1070-4)
- ForagingNicheReclass, extended classification of Foraging Niche by the authors for species not classified by Pigot et al.(2020)
- Annual_crops, use of annual crops by species as feeding area
- Permanent_crops, use of permament crops by species as feeding area
- Proportion_invertebrates_diet, the proportion of the diet composed by invertebrates

In [39]:
# read eBird Avonet data with reclassified traits.

path='../external_data/AVONET/'
avonet_file = path + 'AvibaseApr2023_reclass_v1.csv'
df_a = pd.read_csv(avonet_file)

In [40]:
# preview AVONET table
df_a.head()

Unnamed: 0,Species2,Family2,Order2,Avibase.ID2,Total.individuals,Female,Male,Unknown,Complete.measures,Beak.Length_Culmen,...,Habitat.Density,Migration,Trophic.Level,Trophic.Niche,Primary.Lifestyle,ForagingNiche,ForagingNicheReclass,Annual_crops,Permanent_crops,Proportion_invertebrates_diet
0,Busarellus nigricollis,Accipitridae,Accipitriformes,AVIBASE-37148B74,6,1,2,3,5.0,40.8,...,1,1.0,Carnivore,Aquatic predator,Insessorial,Aquatic perch,,,,
1,Buteogallus aequinoctialis,Accipitridae,Accipitriformes,AVIBASE-478C48D0,4,1,0,3,4.0,36.1,...,2,1.0,Carnivore,Aquatic predator,Insessorial,Aquatic perch,,,,
2,Buteogallus anthracinus,Accipitridae,Accipitriformes,AVIBASE-97FDBB06,5,2,3,0,4.0,39.6,...,2,1.0,Carnivore,Aquatic predator,Insessorial,Aquatic perch,,,,
3,Haliaeetus albicilla,Accipitridae,Accipitriformes,AVIBASE-5A3D91D3,5,2,3,0,4.0,74.1,...,3,2.0,Carnivore,Aquatic predator,Aerial,Aquatic aerial,,,,
4,Haliaeetus humilis,Accipitridae,Accipitriformes,AVIBASE-B7B03CB8,5,1,4,0,4.0,49.5,...,3,1.0,Carnivore,Aquatic predator,Insessorial,Aquatic perch,,,,


Although both eBird and AVONET use Avibase IDs for species, not always is possible to find matches between both. For this reason, we will try also matches based on the name.

In [41]:
# test matches based on ID

sps_cal1 = pd.merge(sps_cal, df_a, 
                     left_on=['TAXON CONCEPT ID'], right_on=['Avibase.ID2'], how='inner', indicator=True)

In [42]:
sps_cal1

Unnamed: 0,TAXON CONCEPT ID,SCIENTIFIC NAME,TAXONOMIC ORDER,ORDER1,Species2,Family2,Order2,Avibase.ID2,Total.individuals,Female,...,Migration,Trophic.Level,Trophic.Niche,Primary.Lifestyle,ForagingNiche,ForagingNicheReclass,Annual_crops,Permanent_crops,Proportion_invertebrates_diet,_merge
0,AVIBASE-603194D3,Thryomanes bewickii,26451,Passeriformes,Thryomanes bewickii,Troglodytidae,Passeriformes,AVIBASE-603194D3,5,2,...,2.0,Carnivore,Invertivore,Generalist,Invertivore ground,Invertivore ground,1.0,1.0,1.0,both
1,AVIBASE-F93AC929,Callipepla californica,999,Galliformes,Callipepla californica,Odontophoridae,Galliformes,AVIBASE-F93AC929,5,2,...,1.0,Herbivore,Omnivore,Terrestrial,,,,,,both
2,AVIBASE-00124D98,Zenaida macroura,2369,Columbiformes,Zenaida macroura,Columbidae,Columbiformes,AVIBASE-00124D98,15,5,...,1.0,Herbivore,Granivore,Terrestrial,Granivore ground,,,,,both
3,AVIBASE-78509A5D,Chondestes grammacus,31941,Passeriformes,Chondestes grammacus,Passerellidae,Passeriformes,AVIBASE-78509A5D,5,2,...,2.0,Herbivore,Granivore,Generalist,Granivore ground,,,,,both
4,AVIBASE-20C2214E,Falco sparverius,11494,Falconiformes,Falco sparverius,Falconidae,Falconiformes,AVIBASE-20C2214E,7,1,...,3.0,Carnivore,Omnivore,Insessorial,,Omnivore Insessorial,0.0,1.0,0.3,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,AVIBASE-15EE0D36,Mareca penelope,454,Anseriformes,Mareca penelope,Anatidae,Anseriformes,AVIBASE-15EE0D36,10,5,...,3.0,Herbivore,Omnivore,Terrestrial,,,,,,both
293,AVIBASE-CEA5B6AA,Seiurus aurocapilla,32843,Passeriformes,Seiurus aurocapilla,Parulidae,Passeriformes,AVIBASE-CEA5B6AA,21,5,...,3.0,Carnivore,Invertivore,Terrestrial,Invertivore ground,Invertivore ground,1.0,1.0,1.0,both
294,AVIBASE-11512CF4,Egretta caerulea,7233,Pelecaniformes,Egretta caerulea,Ardeidae,Pelecaniformes,AVIBASE-11512CF4,11,2,...,1.0,Carnivore,Aquatic predator,Terrestrial,Aquatic ground,,,,,both
295,AVIBASE-D4540F88,Plegadis falcinellus,7346,Pelecaniformes,Plegadis falcinellus,Threskiornithidae,Pelecaniformes,AVIBASE-D4540F88,8,4,...,2.0,Carnivore,Aquatic predator,Terrestrial,Aquatic ground,,,,,both


Only 297 species matched. We will check, based on the scientific name.

In [43]:
# test merge based on scientific name

sps_cal_tax = pd.merge(sps_cal, df_a, 
                     left_on=['SCIENTIFIC NAME'], right_on=['Species2'], how='inner', indicator=True)

In [44]:
sps_cal_tax

Unnamed: 0,TAXON CONCEPT ID,SCIENTIFIC NAME,TAXONOMIC ORDER,ORDER1,Species2,Family2,Order2,Avibase.ID2,Total.individuals,Female,...,Migration,Trophic.Level,Trophic.Niche,Primary.Lifestyle,ForagingNiche,ForagingNicheReclass,Annual_crops,Permanent_crops,Proportion_invertebrates_diet,_merge
0,AVIBASE-603194D3,Thryomanes bewickii,26451,Passeriformes,Thryomanes bewickii,Troglodytidae,Passeriformes,AVIBASE-603194D3,5,2,...,2.0,Carnivore,Invertivore,Generalist,Invertivore ground,Invertivore ground,1.0,1.0,1.0,both
1,AVIBASE-2A34DC8D,Thryomanes bewickii,26459,Passeriformes,Thryomanes bewickii,Troglodytidae,Passeriformes,AVIBASE-603194D3,5,2,...,2.0,Carnivore,Invertivore,Generalist,Invertivore ground,Invertivore ground,1.0,1.0,1.0,both
2,AVIBASE-F93AC929,Callipepla californica,999,Galliformes,Callipepla californica,Odontophoridae,Galliformes,AVIBASE-F93AC929,5,2,...,1.0,Herbivore,Omnivore,Terrestrial,,,,,,both
3,AVIBASE-00124D98,Zenaida macroura,2369,Columbiformes,Zenaida macroura,Columbidae,Columbiformes,AVIBASE-00124D98,15,5,...,1.0,Herbivore,Granivore,Terrestrial,Granivore ground,,,,,both
4,AVIBASE-78509A5D,Chondestes grammacus,31941,Passeriformes,Chondestes grammacus,Passerellidae,Passeriformes,AVIBASE-78509A5D,5,2,...,2.0,Herbivore,Granivore,Generalist,Granivore ground,,,,,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372,AVIBASE-15EE0D36,Mareca penelope,454,Anseriformes,Mareca penelope,Anatidae,Anseriformes,AVIBASE-15EE0D36,10,5,...,3.0,Herbivore,Omnivore,Terrestrial,,,,,,both
373,AVIBASE-CEA5B6AA,Seiurus aurocapilla,32843,Passeriformes,Seiurus aurocapilla,Parulidae,Passeriformes,AVIBASE-CEA5B6AA,21,5,...,3.0,Carnivore,Invertivore,Terrestrial,Invertivore ground,Invertivore ground,1.0,1.0,1.0,both
374,AVIBASE-11512CF4,Egretta caerulea,7233,Pelecaniformes,Egretta caerulea,Ardeidae,Pelecaniformes,AVIBASE-11512CF4,11,2,...,1.0,Carnivore,Aquatic predator,Terrestrial,Aquatic ground,,,,,both
375,AVIBASE-D4540F88,Plegadis falcinellus,7346,Pelecaniformes,Plegadis falcinellus,Threskiornithidae,Pelecaniformes,AVIBASE-D4540F88,8,4,...,2.0,Carnivore,Aquatic predator,Terrestrial,Aquatic ground,,,,,both


In this case, 382 species matched. We will maximize the combination species in eBird dataset and AVONET table using both matches.

In [45]:
sps_cal2 = pd.concat([sps_cal1, sps_cal_tax])
sps_cal2 = sps_cal2.drop_duplicates()

In [46]:
sps_cal2.shape

(377, 41)

### Filter species based on traits

Based on trait data, include only species that provide service to the control of crop pests. These include:
- Trophic.Level: Carnivore, Omnivore
- Trophic.Niche: Invertivore, Omnivore
- Primary.Lifestyle: Aerial, Generalist, Insessorial, Terrestrial

In [47]:
# filter species list
sps_cal2 = sps_cal2[ \
            (sps_cal2['Trophic.Level'].isin(['Carnivore', 'Omnivore'])) & \
            (sps_cal2['Trophic.Niche'].isin(['Invertivore', 'Omnivore'])) & \
            (sps_cal2['Primary.Lifestyle'].isin(['Aerial', 'Generalist', 'Insessorial', 'Terrestrial']))
             ]

In [48]:
sps_cal2.shape

(172, 41)

There are 172 species that provide pest control services in crops.

In [49]:
sps_cal2.to_csv('../process_data/sps_california.csv')

### Filter species based on distribution

Only species native to the area - nesting and growing season in the area - should be considered. Therefore, species are filtered based on this criteria. For the current example, this classification was performed by experts. In the 
future with be based on the Birdlife database.


In [50]:
# read file indication if species is native to the area

native_class = pd.read_csv('../process_data/sps_california_native_class.csv')


In [51]:
# merge based on scientific name

sps_cal3 = pd.merge(sps_cal2, native_class[['TAXON CONCEPT ID', 'NativeToValCal']], 
                     left_on=['TAXON CONCEPT ID'], right_on=['TAXON CONCEPT ID'], how='left')

In [52]:
sps_cal4 = sps_cal3[(sps_cal3['NativeToValCal'] == 1)]

In [53]:
# Create list of unique values for species names

sps_unique = sps_cal4.drop_duplicates(subset=['Species2'])
sps_unique.shape

(63, 42)

There are 63 species that are native to the area of interest.

## 7. Filter occurrence list with species

Finnaly, filter eBird occurrences by species that fullfil trait requirements.

In [54]:
# Merge eBird occurrences with 
eBird_sample.head()

Unnamed: 0,GLOBAL UNIQUE IDENTIFIER,LAST EDITED DATE,TAXONOMIC ORDER,CATEGORY,TAXON CONCEPT ID,COMMON NAME,SCIENTIFIC NAME,SUBSPECIES COMMON NAME,SUBSPECIES SCIENTIFIC NAME,EXOTIC CODE,...,DURATION MINUTES,EFFORT DISTANCE KM,EFFORT AREA HA,NUMBER OBSERVERS,ALL SPECIES REPORTED,GROUP IDENTIFIER,APPROVED,REVIEWED,REASON,geometry
272647,URN:CornellLabOfOrnithology:EBIRD:OBS1446126022,2022-06-02 18:22:45.475308,26451,species,avibase-603194D3,Bewick's Wren,Thryomanes bewickii,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)
270014,URN:CornellLabOfOrnithology:EBIRD:OBS1449752068,2022-06-02 18:29:15.579465,999,species,avibase-F93AC929,California Quail,Callipepla californica,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)
256272,URN:CornellLabOfOrnithology:EBIRD:OBS1449752064,2022-06-02 18:29:15.579465,2369,species,avibase-00124D98,Mourning Dove,Zenaida macroura,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)
255571,URN:CornellLabOfOrnithology:EBIRD:OBS1446126020,2022-06-02 18:22:45.475308,31941,species,avibase-78509A5D,Lark Sparrow,Chondestes grammacus,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)
255752,URN:CornellLabOfOrnithology:EBIRD:OBS1449752057,2022-06-02 18:29:15.579465,11494,species,avibase-20C2214E,American Kestrel,Falco sparverius,,,,...,6.0,,,2.0,1,G8497992,1,0,,POINT (-121.29641 36.26400)


In [55]:
eBird_sample2 = pd.merge(eBird_sample, sps_unique, left_on=['SCIENTIFIC NAME'], right_on=['Species2'], how='inner')

In [56]:
eBird_sample2.shape

(78942, 84)

In [57]:
eBird_sample2.groupby('SAMPLING EVENT IDENTIFIER').first().shape

(18783, 83)

There are 78942 bird occurrences that correspond to 18783 sampling points.

In [58]:
eBird_sample2.to_csv('../process_data/eBird_sample.csv')