# BirdCLEF: Migration Patterns & Morning vs Night Birds

So I have went through  introductory notebooks from @stefankahl and further explored the ideas that he presented there.
Source: https://www.kaggle.com/stefankahl/birdclef2021-exploring-the-data

* First, I decided to remove the low-rating recordings from the dataset. 
* Second, to check for the migration patterns.
* Third, I checked the ratio of morning versus night birds. The idea would be to ensure the even distribution of both categories in training and validation sets.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

import glob
import ntpath
import time
from tqdm import tqdm
from math import radians, cos, sin, asin, sqrt
from collections import defaultdict

### Load CSV Data

In [None]:
PATH = '../input/birdclef-2021/'
#load csv files
train_metadata = pd.read_csv(PATH + 'train_metadata.csv')
train_soundscape_labels = pd.read_csv(PATH + 'train_soundscape_labels.csv')
test = pd.read_csv(PATH + 'test.csv')
test_dates = pd.read_csv(PATH + 'test_soundscapes/test_set_recording_dates.csv')
sample_submission = pd.read_csv(PATH + 'sample_submission.csv')

### 1. Remove Low-Rating Recordings

In [None]:
#select only the columns that I need for further work
train = train_metadata[['primary_label', 'secondary_labels', 'type','date','latitude','longitude','time','rating','filename']]
print(f'Train set size BEFORE cleaning: {len(train_metadata)}')
#Remove low-rating recordings and reset indexes of the dataframe
train = train.loc[(train_metadata.rating < 0.5) | (train.rating >= 2.0)].reset_index().drop('index',axis=1)
print(f'Train set size AFTER cleaning: {len(train)}')

Removing ~1k examples is not a big deal. So let's stick to it.

### 2. Migration Patterns

Prior to migration pattern analysis we should extract the week number/month number for each recording for both train and test sets.

#### 2.1 Extract Date-Related Features

In [None]:
#Get month or week number
MONTH = True

if MONTH:

    #1. TRAIN SHORT RECORDINGS
    #convert to datetime
    train['month'] = pd.to_datetime(train.date.astype('str').apply(lambda x: x[5:-3]), format='%m', errors='coerce')
    #chech how many null values were there
    print(f'Number of null values in Date column: {len(train.loc[train.month.isnull()])}')
    #drop null values and reset index
    train = train.dropna().reset_index().drop('index',axis=1)
    #extract week number from the resulting datetime
    train['month'] = train.month.dt.month

    #2. TRAIN SOUNDSCAPES
     
    '''
    train_soundscape_labels does not contain date-related information but it 
    can actually be extracted from the actual filenames from the **train_soundscapes** directory:
    ''' 
    #get the file names from the directory
    filenames = [ntpath.basename(path).strip('.ogg') for path in glob.glob(PATH+'train_soundscapes/*.ogg')]

    #map audio_id to recording date
    id_to_date = dict()
    for i in range(len(filenames)):
        s = filenames[i].split('_')
        id_to_date[int(s[0])] = s[-1]

    #assign the date column
    train_soundscape_labels['date'] = [id_to_date[idx] for idx in train_soundscape_labels.audio_id]
    #convert the date column into datetime
    train_soundscape_labels.date = pd.to_datetime(train_soundscape_labels.date, format='%Y%m%d')
    #get the week number
    train_soundscape_labels['month'] = train_soundscape_labels.date.dt.month

    #3. TEST DATA
    #convert test data's date column into datetime
    test_dates.date = pd.to_datetime(test_dates.date.astype('str'))
    #extract the week only
    test_dates['month'] = test_dates.date.dt.month
    
else: #WEEK
    
    #1. TRAIN SHORT RECORDINGS
    #convert to datetime
    train.date = pd.to_datetime(train.date, format='%Y-%m-%d', errors='coerce')
    #chech how many null values were there
    print(f'Number of null values in Date column: {len(train.loc[train.date.isnull()])}')
    #drop null values and reset index
    train = train.dropna().reset_index().drop('index',axis=1)
    #extract week number from the resulting datetime
    train['week'] = train.date.dt.week
    
    #2. TRAIN SOUNDSCAPES
    #get the file names from the directory
    filenames = [ntpath.basename(path).strip('.ogg') for path in glob.glob(PATH+'train_soundscapes/*.ogg')]

    #map audio_id to recording date
    id_to_date = dict()
    for i in range(len(filenames)):
        s = filenames[i].split('_')
        id_to_date[int(s[0])] = s[-1]

    #assign the date column
    train_soundscape_labels['date'] = [id_to_date[idx] for idx in train_soundscape_labels.audio_id]
    #convert the date column into datetime
    train_soundscape_labels.date = pd.to_datetime(train_soundscape_labels.date, format='%Y%m%d')
    #get the week number
    train_soundscape_labels['week'] = train_soundscape_labels.date.dt.week
    train_soundscape_labels.head()

    #3. TEST DATA
    #convert test data's date column into datetime
    test_dates.date = pd.to_datetime(test_dates.date.astype('str'))
    #extract the week only
    test_dates['week'] = test_dates.date.dt.week
    test_dates.head()

Now we have got the week number during which every recording was made for both training and test sets. Time to do some feature engineering. 

Here are 2 main approaches I came up with:

1. Check where is every bird vocalization observed during the given week/month and assign to it the site name which is the closest to it during this week/month. While this approach may capture the migrational patterns of mid- to long-distance migrating birds, it is most likely to fail if it is a year-round bird. For example,take a look at [House Sparrow](https://ebird.org/science/status-and-trends/houspa/abundance-map-weekly). 
2. Assign the site name to each recording based on where was the biggest amount of vocalizations of the given bird species observed during the week/month when this recording was made. I think this is the best approach.

In addition, prior to trying both approaches, one must be careful with what data is used  - it will not make sense to calculate distance between the site and recording if the recording took place somewhere in Europe or Asia while all the recording sites are located in Americas. To avoid such problems, let's work with data from America only. I know that to this point I have already removed around 5k recordings from the data, but it's just a trade-off I have to face to get more precise features.


In [None]:
#select the recordings which were made only in Americas (left to the 25th west meridian)
print(f'Trainset length BEFORE cleaning: {len(train)}')
train = train.loc[train.longitude < -25.0].reset_index().drop('index',axis=1)
print(f'Trainset length AFTER cleaining: {len(train)}')

Minus another 5k recordings... Let's see how it affected the class distribution:

In [None]:
# Code adapted from https://www.kaggle.com/shahules/bird-watch-complete-eda-fe and 
# https://www.kaggle.com/stefankahl/birdclef2021-exploring-the-data

# Unique eBird codes
species1 = train_metadata['primary_label'].value_counts()
species2 = train['primary_label'].value_counts()

# Make bar chart
fig = go.Figure(data=[go.Bar(y=species1.values, x=species1.index,name="Before Cleaning"),
                      go.Bar(y=species2.values, x=species2.index, name="After Cleaning")],
                layout=go.Layout(margin=go.layout.Margin(l=0, r=0, b=10, t=50)))
fig.update_layout(title='Number of traning samples per species')


# Show charts
fig.show()

We can see that our class distribution got even more imbalanced. We also lost 2 species:

In [None]:
#get lost species during data cleaning
lost_species = np.setdiff1d(train_metadata.primary_label.unique(),train.primary_label.unique())

#get species that have low number of records and combine with lost species
scarce = list(train.groupby("primary_label").filter(lambda x: len(x) < 25).primary_label.unique()) + list(lost_species)

print(f'Lost Species: {len(lost_species)}')
print(f'Scarce Species (<25 recordings): {len(scarce)}')

My data cleaning approach is not perfect, but I think I will still stick to it. **I would also love to get your advice on how to make it better!**
Here are ways to possibly preserve more recordings:

1. When processing date column, extract month instead of week number.
2. Include bird recordings from all over the world. But than my migration feature engineering approaches(which you can explore in the next code cell) will not work.
3. Get the data of scarce bird species from previous competition's data. I have tried it, but it has only limited number of classes.

### 2.2 Migration Feature Engineering

#### Approach 1

In [None]:
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    
    Source: #https://stackoverflow.com/questions/42686300/how-to-check-if-coordinate-inside-certain-area-python
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r
    
def get_closest_site(df, sites):
    '''
        Get the site which is closest to the bird vocalization recording. Returns pandas.Series contatining site names assigned to each recording.
        
        Attributes:
            df(pandas.DataFrame): df containing vocalization recordings
            sites(list of dicts): list of dictionaries containing coordinates(lat,long) and name of the sites
    '''
    
    distances = defaultdict(list)
    #get the coordinates of all records
    lat = df.latitude
    long = df.longitude
    
    for recording_site in sites:
        #calculate the distance between the site and the vocalization recordings with haversine function
        for i in range(len(df)):

            lat1 = recording_site['lat']
            lon1 = recording_site['long']
            lat2 = lat[i]
            lon2 = long[i]

            distance = haversine(lon1, lat1, lon2, lat2)
            
            distances[i].append(distance)
    #get the name of the site which is the closest to the vocalization recording
    for i in range(len(distances)):
        idx =  distances[i].index(min(distances[i]))
        distances[i] = sites[idx]['name']
    
    return pd.Series(distances.values())

#Coordinates of the recording locations taken from the competition's txt data
COL = {'lat': 5.57, 'long': -75.85, 'name': 'COL'} # Colombia
COR = {'lat': 10.12, 'long': -84.51, 'name': 'COR'} # Costa Rica 
SNE = {'lat': 38.49, 'long': -119.95, 'name': 'SNE'} # Sierra Nevada
SSW = {'lat': 42.47, 'long': -76.45, 'name': 'SSW'} # Sapsucker Woods
sites = [COL,COR,SNE,SSW]

#Assign to the train dataframe
train['site'] = get_closest_site(train,sites)
#display a sample of data
train.head()

#### Approach 2

In [None]:
def check_proximity(data, birdcode, time_period, recording_site ,radius = 250.0):
    '''
        Check whether the vocalization recording of the given bird is within a certain radius of the recording side during the given time period.
        
        Attributes:
            data(pandas.DataFrame): training data
            birdcode(str): eBird code
            time_period(tuple): ('week'/'month', number)
            recording_site(dict): dictionary with coordinates(lat,long) and name of the site
            radius(float): radius in kilometers
    '''
    
    if time_period[0] == 'week':
        #get the data only for specific week
        df = data.loc[data.week == time_period[1]]
    elif time_period[0] == 'month':
        #get the data only for specific month
        df = data.loc[data.month == time_period[1]]
    
    freq = 0
    #pick the records of the given bird
    df = df.loc[df.primary_label == birdcode].reset_index().drop('index',axis=1)
    lat = df.latitude
    long = df.longitude
    #calculate the distance between the site and the vocalization recordings with haversine function
    for i in range(len(df)):
        
        lat1 = recording_site['lat']
        lon1 = recording_site['long']
        lat2 = lat[i]
        lon2 = long[i]
        
        distance = haversine(lon1, lat1, lon2, lat2)
        #check whether the vocalization is within the radius and update the frequency counter
        if distance <= radius:
            freq+=1
        else:
            continue
    return freq

#Coordinates of the recording locations taken from the competition's txt data
COL = {'lat': 5.57, 'long': -75.85, 'name': 'COL'} # Colombia
COR = {'lat': 10.12, 'long': -84.51, 'name': 'COR'} # Costa Rica 
SNE = {'lat': 38.49, 'long': -119.95, 'name': 'SNE'} # Sierra Nevada
SSW = {'lat': 42.47, 'long': -76.45, 'name': 'SSW'} # Sapsucker Woods
sites = [COL,COR,SNE,SSW]
#pick the radius
radius = 500.0
time_measure = 'month' if MONTH else 'week'

freqs = dict()
for t in tqdm(sorted(train[time_measure].unique())):
    site_freqs = defaultdict(dict)
    #loop over sites
    for i,site in enumerate(sites):
        #loop over bird species
        for b in train.primary_label.unique():
            site_freqs[sites[i]['name']][b] = check_proximity(train,b,(time_measure,t),site,radius)
    freqs[t] = dict(pd.DataFrame(site_freqs).idxmax(axis=1))
    
# assign new column to the train df
train['site'] = [freqs[train[time_measure][idx]][train.primary_label[idx]] for idx in range(len(train))]
train.head()

### 3. Morning vs Night Birds

In [None]:
''' Check the ratio between morning/night birds'''

# Convert time column into datetime and remove rows with missing values in it
train['time'] = pd.to_datetime(train['time'], errors='coerce')
train = train.dropna(subset=['time']).reset_index().drop('index',axis=1)
#get only the hour
train.time = train['time'].dt.hour.astype('int')
#categorize -> morning vs night bird
train.time = train.time.apply(lambda x: 0 if x<=13 and x>4 else 1)

def get_btype(bcodes):
    
    btype = {}
    for b in bcodes:

        v = train.loc[train.primary_label == b].time.value_counts()
        if len(v)<2:
            btype[b] = v.keys()[0]

        elif v[0] > v[1]:
            btype[b] = 0 # morning
        else:
            btype[b] = 1 # night
    return btype

btype = get_btype(train.primary_label.unique())
train['btype'] = [btype[p] for p in train.primary_label]
#plot
fig = px.bar(train.btype.value_counts(), title = 'Morning vs Night Birds')
fig.show()

It is clear that the majority of the bird vocalizations were obseved during the **morning hours**. This could have happened due to the fact that people who record these vocalisations typically go out during the morning/day hours or that the birds from this dataset are really mostly singing during the mornings. Both facts might be true, but it actually does not matter. The initial idea was to ensure the even distribution of morning/night birds in training and validations sets, but as a result, it is not that important in this case.

## Excited to see what I will discover next!