# Download Data
Using the Kaggle API, we download the data, uncomment the following cell to do it.

In [231]:
#!kaggle competitions download -c birdclef-2025

# Libraries

In [232]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
from ydata_profiling import ProfileReport

# Understanding Data

La competencia **BirdCLEF+ 2025** en Kaggle se centra en la identificación de especies (aves, anfibios, mamíferos e insectos) a partir de grabaciones de audio realizadas en la Reserva Natural El Silencio, Colombia. Este desafío es esencial para mejorar el monitoreo de poblaciones animales con fines de conservación.

**Archivos proporcionados:**

- **train_audio/**: Grabaciones cortas de sonidos individuales de diversas especies, aportadas por usuarios de xeno-canto.org, iNaturalist y el Archivo Sonoro de Colombia (CSA) del Instituto Humboldt.

- **test_soundscapes/**: Aproximadamente 700 grabaciones de 1 minuto de duración en formato ogg a 32 kHz, utilizadas para evaluar las soluciones enviadas.

- **train.csv**: Metadatos asociados a las grabaciones de entrenamiento, incluyendo:
  - **primary_label**: Código de la especie.
  - **secondary_labels**: Otras especies presentes en la grabación.
  - **latitude & longitude**: Coordenadas de la grabación.
  - **author**: Usuario que proporcionó la grabación.
  - **filename**: Nombre del archivo de audio.
  - **rating**: Calidad de la grabación (1 a 5).
  - **collection**: Fuente de la grabación (XC, iNat o CSA).

- **sample_submission.csv**: Ejemplo de formato para las predicciones, donde cada fila representa un segmento de audio y las columnas corresponden a las especies objetivo.

- **taxonomy.csv**: Información sobre las especies, incluyendo ID de taxón y clase (Aves, Amphibia, Mammalia, Insecta).

- **recording_location.txt**: Detalles sobre la ubicación de grabación (Reserva Natural El Silencio).


# Unifying data and fix some problems

## train.csv and taxonomy.csv

Here I import the datasets and drop irrelevant columns for the predictions

In [233]:
# Import train.csv and taxonomy.csv as dataframe
train = pd.read_csv("birdclef-2025/train.csv")
urls= train["url"]
train = train.drop(['author', 'license',"url"],axis=1)

taxonomy = pd.read_csv("birdclef-2025/taxonomy.csv")

I create a function to separate the scientific name, which follows binomial nomenclature—meaning it first has the  `genus` with an initial capital letter and then the `species` and I create a separate column for each in the dataframe

In [234]:
def separate_name(name):
    patron = re.compile(r'\b([A-Z][a-z]+)(?:\s+([a-z]+))?\b')
    resultado = patron.search(name)
    if resultado:
        return resultado.group(1), resultado.group(2)
    else:
        return None, None

train[['genus', 'species']] = pd.DataFrame(
    [separate_name(name) for name in train['scientific_name']], 
    index=train.index
)


Since we can have the same information in a single dataframe and obtain the `inat_taxon_id`, I merge the two dataframes. Additionally, the `longitude` data is incorrect because there are positive values, which is inconsistent since Colombia is located in the quadrant where longitude is always negative.

In [235]:
taxotrain = train.merge(
    taxonomy[['primary_label', 'inat_taxon_id',"class_name"]], 
    on='primary_label',                  
    how='left'                           
)

taxotrain["longitude"] = taxotrain["longitude"].apply(lambda x: x if x < 0 else -x)


Export new dataframe.

In [236]:
taxotrain.to_csv("birdclef-2025/taxotrain.csv")

The taxotrain dataframe will be the one I use to do a first, more basic analysis.

# Analysis by Class, Genus ans Species

In [229]:
# taxotrain = pd.read_csv("birdclef-2025/taxotrain.csv")
taxotrain

Unnamed: 0,primary_label,secondary_labels,type,filename,collection,rating,latitude,longitude,scientific_name,common_name,genus,species,inat_taxon_id,class_name
0,1139490,[''],[''],1139490/CSA36385.ogg,CSA,0.0,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Ragoniella,pulchella,1139490,Insecta
1,1139490,[''],[''],1139490/CSA36389.ogg,CSA,0.0,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Ragoniella,pulchella,1139490,Insecta
2,1192948,[''],[''],1192948/CSA36358.ogg,CSA,0.0,7.3791,-73.7313,Oxyprora surinamensis,Oxyprora surinamensis,Oxyprora,surinamensis,1192948,Insecta
3,1192948,[''],[''],1192948/CSA36366.ogg,CSA,0.0,7.2800,-73.8582,Oxyprora surinamensis,Oxyprora surinamensis,Oxyprora,surinamensis,1192948,Insecta
4,1192948,[''],[''],1192948/CSA36373.ogg,CSA,0.0,7.3791,-73.7313,Oxyprora surinamensis,Oxyprora surinamensis,Oxyprora,surinamensis,1192948,Insecta
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28559,ywcpar,[''],[''],ywcpar/iNat77392.ogg,iNat,0.0,7.6921,-80.3379,Amazona ochrocephala,Yellow-crowned Parrot,Amazona,ochrocephala,19003,Aves
28560,ywcpar,[''],[''],ywcpar/iNat78624.ogg,iNat,0.0,8.9918,-79.4877,Amazona ochrocephala,Yellow-crowned Parrot,Amazona,ochrocephala,19003,Aves
28561,ywcpar,[''],[''],ywcpar/iNat789234.ogg,iNat,0.0,9.2316,-70.2041,Amazona ochrocephala,Yellow-crowned Parrot,Amazona,ochrocephala,19003,Aves
28562,ywcpar,[''],[''],ywcpar/iNat819873.ogg,iNat,0.0,10.5838,-66.8545,Amazona ochrocephala,Yellow-crowned Parrot,Amazona,ochrocephala,19003,Aves


In [230]:
taxotrain[taxotrain["species"].isna()]

Unnamed: 0,primary_label,secondary_labels,type,filename,collection,rating,latitude,longitude,scientific_name,common_name,genus,species,inat_taxon_id,class_name
458,48124,[''],[''],48124/CSA03598.ogg,CSA,0.0,5.6521,-73.5077,Tettigoniidae,Katydids,Tettigoniidae,,48124,Insecta
459,48124,[''],[''],48124/CSA18785.ogg,CSA,0.0,5.7892,-73.5504,Tettigoniidae,Katydids,Tettigoniidae,,48124,Insecta
460,48124,[''],[''],48124/CSA18795.ogg,CSA,0.0,5.7892,-73.5504,Tettigoniidae,Katydids,Tettigoniidae,,48124,Insecta
461,48124,[''],[''],48124/CSA18798.ogg,CSA,0.0,5.7892,-73.5504,Tettigoniidae,Katydids,Tettigoniidae,,48124,Insecta
462,48124,[''],[''],48124/CSA34485.ogg,CSA,0.0,3.5732,-76.5809,Tettigoniidae,Katydids,Tettigoniidae,,48124,Insecta
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
600,52884,[''],[''],52884/CSA35533.ogg,CSA,0.0,3.5637,-76.6194,Gryllidae,True Crickets,Gryllidae,,52884,Insecta
601,52884,[''],[''],52884/CSA35537.ogg,CSA,0.0,3.5264,-76.7371,Gryllidae,True Crickets,Gryllidae,,52884,Insecta
602,52884,[''],[''],52884/CSA35592.ogg,CSA,0.0,4.0058,-69.8986,Gryllidae,True Crickets,Gryllidae,,52884,Insecta
603,52884,[''],[''],52884/CSA36342.ogg,CSA,0.0,7.3791,-73.7313,Gryllidae,True Crickets,Gryllidae,,52884,Insecta
