In [None]:
import os
import pandas as pd
import matplotlib

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

# Check whether the script is running locally or as a Kaggle notebook and define paths accordingly
if os.getcwd() == "/kaggle/working":
    TRAIN_CSV_PATH = "../input/rfcx-species-audio-detection/"
else:
    TRAIN_CSV_PATH = "data/rfcx-species-audio-detection/"
    matplotlib.use("QT5Agg")
TRAIN_TP_CSV_PATH = os.path.join(TRAIN_CSV_PATH, "train_tp.csv")
TRAIN_FP_CSV_PATH = os.path.join(TRAIN_CSV_PATH, "train_fp.csv")

# Build the overall train_df based on the TP/FP CSV files
train_tp_df = pd.read_csv(TRAIN_TP_CSV_PATH)
train_tp_df["is_tp"] = True
train_fp_df = pd.read_csv(TRAIN_FP_CSV_PATH)
train_fp_df["is_tp"] = False
train_df = pd.concat([train_tp_df, train_fp_df], axis=0, ignore_index=True)


**What is the dataframe structure?**

In [None]:
print(train_df.info())

**How training instances are distributed over species_id and songtype_id? Is the dataset balanced?**

In [None]:
print(train_df.groupby(["species_id", "songtype_id"])["is_tp"].describe())

For what concerns the distribution of the labels over the values of (species_id, songtype_id), the dataset looks balanced: there are about 350 instances for each (species_id, songtype_id). For each (species_id, songtype_id), only about 50 out of about 350 are of TP type; hence, for each (species_id, songtype_id) there are approximately 6x FP instances than TP: the baseline accuracy for a "Zero Rule" classifier would be about 85.7%

**How many species are associated to two song-types (values 1 and 4)? How are they distributed statistically?**

In [None]:
print("Song-type ID values for each species ID")
print(train_df.groupby("species_id")["songtype_id"].unique())

Species ID 17 and 23 have records with both song-type values (1 and 4); in order to use properly FP information also for these species, classes 17 and 23 should become 4 classes, given by (species_id, songtype_id), namely: (17, 1), (17, 4), (23, 1), (23, 4).

**Is it relevant for training the time and frequency location of each label?**

In [None]:
tp_fp_discordance_df = train_df.groupby(["recording_id", "species_id", "songtype_id"]).apply(
    lambda df: df["is_tp"].nunique() != 1).sort_values()
print(tp_fp_discordance_df)

The last two rows of the data-frame above (records a2441a74b and 178b835e3) have both TP and FP annotations for the same (species_id, songtype_id) in different time frames:


In [None]:
print(train_df[(train_df["recording_id"] == "a2441a74b")])
print(train_df[(train_df["recording_id"] == "178b835e3")])


Hence, if all the TP & FP labels were associated to the whole spectrogram, it would be (in these few cases) completely incorrect. Possible actions are:
1. remove punctually these labels from the dataset, in order to ignore them;
2. setup the pipeline to select only [t_min, t_max] slices of the spectrogram to feed the training process.

In [None]:
print(train_df.groupby(["species_id", "songtype_id"])[["f_min", "f_max"]].nunique())


For some (species_id, songtype_id) values, there are only single values of f_min and f_max. In other cases, e.g. for (species_id, songtype_id) = (14, 1) there are up to 3 f_min and f_max values:


In [None]:
print(train_df[train_df["species_id"] == 14].head(5))


**What is the distribution of the labelled time interval duration values? What is their minimum/maximum?**


In [None]:
t_deltas = (train_df["t_max"]-train_df["t_min"])
print(t_deltas.describe())
t_deltas.hist(bins=20)
print("Percentage of time deltas less than 4.5 s: {:.1f}%".format(
    t_deltas[t_deltas < 4.5].count()/t_deltas.count()*100))

All labelled time intervals have durations between about 2.6 s and 7.9 s; anyway, only about 1/10th lasts more than 4.5 s


**What are minimum and maximum values of f_min and f_max respectively?**

In [None]:
print("Minimum value of f_min: {:0.1f} Hz".format(train_df["f_min"].min()))
print("Maximum value of f_max: {:0.1f} Hz".format(train_df["f_max"].max()))