<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
    
<center><h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background-color:#1E90FF; border:0; color:#FFF5EE' role="tab" aria-controls="home">Content</h2></center>
    
In this notebook I look to explore which birds were at the right place and in the right time to have a chance of making into a test set recording. In the [competition notebook](https://www.kaggle.com/stefankahl/birdclef2021-exploring-the-data), Stefan Kahl points that all birds were likely to be observed at the test sites, but were they there at the right time? To answer this question, each site is assigned a circular region of 200 km in radius. If any of the birds were previously recorded in the same month as the test recordings on that site, the bird could potentially be present in test recordings.

In [None]:
import folium
import warnings
import pandas as pd
import geopandas as gpd
import geopy.distance
from shapely.geometry import Point
import numpy as np
import os
from datetime import datetime, timedelta

warnings.filterwarnings(action='ignore')

First, lets get the metadata

In [None]:
df_meta = pd.read_csv('../input/birdclef-2021/train_metadata.csv',)
df_meta.head()

We will also need the description of test site parameters, such as location and date.

In [None]:
files = []
for dirname, _, filenames in os.walk('/kaggle/input/birdclef-2021/test_soundscapes'):
    files += [os.path.join(dirname, filename) for filename in filenames]
files

Lets draw 200 kilometer regions around each point and consider this to be a site. If the bird has ever been recorded within a circle, it could have potentially been recorded in the test run.

In [None]:
df_sites = pd.DataFrame()

for file in files[1:]:
    with open(file, 'r') as f:
        lines=f.readlines()
        df_sites = df_sites.append({"name" : lines[0].strip(), 
                                    "radius" : 200, 
                                    "latitude" : float(lines[-2].split()[-1]), 
                                    "longitude" : float(lines[-1].split()[-1]),
                                    "alias" : file.split("/")[-1].split("_")[0]}, ignore_index=True)

df_sites

In [None]:
m = folium.Map(location=[21.612581945168355, -79.0603262312263], tiles="cartodbpositron", zoom_start=4)

for item in df_sites.iterrows():
    folium.Circle(location=[item[1]["latitude"], item[1]["longitude"]], popup=item[1]["alias"], fill_color='#00CED1', radius=item[1]["radius"]*1000, weight=2, color="#000").add_to(m)

m

Next, lets retrieve all months, when the recording took place, in the test set. Together with location, lets check all species that were observed in the right place and in the right time.

In [None]:
df_dates = pd.read_csv(files[0])
df_dates["date"] = pd.to_datetime(df_dates["date"].astype(str), format="%Y%m%d")

df_dates["month"] = df_dates["date"].apply(lambda x: x.month)
df_meta['month'] = df_meta['date'].apply(lambda x: x.split("-")[1]).astype(int)

df_dates.head()

In [None]:
site_params = dict([(site, []) for site in df_dates["site"].unique()])
for row in df_dates.iterrows():
    site_params[row[1]["site"]].append(row[1]["month"])

for site in site_params:
    site_params[site] = {"months" : list(set(site_params[site]))}

for spatial in df_sites.values:
    site_params[spatial[0]]["latlon"] = (spatial[1], spatial[2])
    site_params[spatial[0]]["R"] = spatial[-1]
    
site_params

In [None]:
def right_place_time(lat, lon, date):
    """
    Calculate if an observation was made within test site parameters (coordinates and time)
    """
    check = False
    for site, params in site_params.items():
        # Check within site
        check_site = (geopy.distance.distance(params["latlon"], (lat, lon)).km < params["R"]) and (date in params["months"])
        check = check or (check_site > 0)

    return check

right_place_time(42.3005, -72.5877, 0)

In [None]:
df_meta["right_place_time"] = df_meta.apply(lambda r: right_place_time(r['latitude'], r['longitude'], r["month"]), axis=1)

print("Percentage of records withing test sites at matching times of year: {:.2f}%".format(100*len(df_meta[df_meta["right_place_time"]])/len(df_meta)))

In [None]:
print("Of {} species {} were observed within sites at the same time of year".format(df_meta["primary_label"].nunique(), df_meta[df_meta["right_place_time"]]["primary_label"].nunique()))

<a id="1"></a><center><h2 style='background-color:#1E90FF; border:0; color:#FFF5EE'>Summary</h2></center>

With 200 km regions around test site, 273 species have been recorded at the same time of year (same month). This narrows down a range of suspects :-) Hope you find this notebook helpful, and please consider upvoting it if you did.