# Bird Species Audio Classification Project

## Introduction

This project builds a machine learning model to identify different bird species from their audio recordings. We will analyze bird sounds and create a classifier that can automatically recognize which bird is singing.

### What We'll Do:
- **Load and explore** bird audio data from Xeno-canto database
- **Process audio files** to extract important features
- **Build machine learning models** to classify bird species
- **Test and evaluate** how well our model works

### Dataset:
- Audio recordings of 30 different bird species (A-M alphabetically)
- Each recording is labeled with the correct bird species
- Files are in MP3 format with metadata in CSV file

### Models & Techniques Used:
- **Convolutional Neural Networks (CNN)** - Custom deep learning model for audio pattern recognition
- **YAMNet Classifier** - Google's pre-trained audio classification model
- **Mel-frequency spectrograms** - Convert audio to visual representations
- **Transfer learning** - Use pre-trained models for better performance



---

In [None]:
# Step 1: Initial Setup and Data Exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from collections import Counter
import warnings
warnings.filterwarnings('ignore')



# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("=== BIRD SPECIES AUDIO CLASSIFICATION PROJECT ===")
print("Step 1: Initial Setup and Data Exploration\n")

# Load the metadata CSV
print("Loading train_extended.csv...")
df = pd.read_csv('/home/sepehr/Documents/Audio-Project/machinelearning/dataset/train_extended.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\n" + "="*50)

# Display basic info about the dataset
print("\nDATASET OVERVIEW:")
print("="*50)
df.info()

print("\nFIRST FEW ROWS:")
print("="*50)
print(df.head())

print("\nBASIC STATISTICS:")
print("="*50)
print(df.describe())

# Check for missing values
print("\nMISSING VALUES:")
print("="*50)
missing_data = df.isnull().sum()
print(missing_data[missing_data > 0])

# Explore unique species (ebird_code)
print(f"\nSPECIES INFORMATION:")
print("="*50)
print(f"Total unique species: {df['ebird_code'].nunique()}")
print(f"Species list: {sorted(df['ebird_code'].unique())}")

# Check rating distribution
print(f"\nRATING DISTRIBUTION:")
print("="*50)
rating_counts = df['rating'].value_counts().sort_index()
print(rating_counts)

# Check duration statistics
print(f"\nDURATION STATISTICS:")
print("="*50)
print(f"Min duration: {df['duration'].min():.2f}s")
print(f"Max duration: {df['duration'].max():.2f}s")
print(f"Mean duration: {df['duration'].mean():.2f}s")
print(f"Median duration: {df['duration'].median():.2f}s")


## Step 1: Initial Setup and Data Exploration

### What We're Doing:
- Loading the bird audio dataset
- Exploring the data structure and basic statistics
- Understanding what information we have about each bird recording

### Key Libraries:
- **pandas** - For data handling and analysis
- **numpy** - For numerical operations
- **matplotlib & seaborn** - For creating charts and visualizations
- **os** - For file operations

### What to Expect:
- Dataset contains **23,784 bird recordings** from **30 species**
- Each recording has **29 features** including species name, duration, location, etc.
- Audio files range from very short clips to long recordings (up to 59 minutes!)
- Most recordings are around **31 seconds** long (median duration)

---

In [None]:
# Step 2: Data Visualization and Configuration Setup
import matplotlib.pyplot as plt
import seaborn as sns

print("=== STEP 2: DATA VISUALIZATION & CONFIGURATION ===\n")

# ========================================
# CONFIGURATION PARAMETERS (EASILY ADJUSTABLE)
# ========================================
CONFIG = {
    'MIN_RATING_THRESHOLD': 3.0,      # Filter recordings below this rating #crystal
    #'NUM_CLASSES': 30,                # Number of bird species to use (3, 5, 30, or custom) this value should be calculated #crystal
    'MAX_DURATION': 20,              # Maximum duration in seconds (20 seconds) #crystal
    'MIN_DURATION': 1,                # Minimum duration in seconds, if its 0 then theres nothing #crystal
    'MIN_SAMPLES_PER_CLASS': 100,      # Minimum samples needed per species #crystal
}

print("CURRENT CONFIGURATION:")
print("="*50)
for key, value in CONFIG.items():
    print(f"{key}: {value}")

print(f"\n{'='*50}")
print("DATA ANALYSIS BEFORE FILTERING:")
print("="*50)

# 1. Rating Distribution Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Rating distribution
axes[0,0].hist(df['rating'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0,0].axvline(CONFIG['MIN_RATING_THRESHOLD'], color='red', linestyle='--',
                 label=f'Min Rating Threshold: {CONFIG["MIN_RATING_THRESHOLD"]}')
axes[0,0].set_xlabel('Rating')
axes[0,0].set_ylabel('Count')
axes[0,0].set_title('Rating Distribution')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# Duration distribution
axes[0,1].hist(df['duration'], bins=50, alpha=0.7, color='lightgreen', edgecolor='black')
axes[0,1].axvline(CONFIG['MAX_DURATION'], color='red', linestyle='--',
                 label=f'Max Duration: {CONFIG["MAX_DURATION"]}s')
axes[0,1].axvline(CONFIG['MIN_DURATION'], color='orange', linestyle='--',
                 label=f'Min Duration: {CONFIG["MIN_DURATION"]}s')
axes[0,1].set_xlabel('Duration (seconds)')
axes[0,1].set_ylabel('Count')
axes[0,1].set_title('Duration Distribution')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)
axes[0,1].set_xlim(0, 500)  # Focus on reasonable duration range

# Species count distribution
species_counts = df['ebird_code'].value_counts().head(153) #crystal

axes[1,0].hist(species_counts.values, bins=30, alpha=0.7, color='coral', edgecolor='black')
axes[1,0].axvline(CONFIG['MIN_SAMPLES_PER_CLASS'], color='red', linestyle='--',
                 label=f'Min Samples: {CONFIG["MIN_SAMPLES_PER_CLASS"]}')
axes[1,0].set_xlabel('Number of Samples per Species')
axes[1,0].set_ylabel('Number of Species')
axes[1,0].set_title('Samples per Species Distribution')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)

# Channel distribution
channel_counts = df['channels'].value_counts()
axes[1,1].pie(channel_counts.values, labels=channel_counts.index, autopct='%1.1f%%',
             colors=['lightblue', 'lightcoral'])
axes[1,1].set_title('Channel Distribution')

#plt.tight_layout()
#plt.show()

# 2. Detailed Species Analysis
print(f"\nSPECIES STATISTICS:")
print("="*50)
print(f"Total species: {species_counts.shape[0]}")

# Filter the original DataFrame to keep only those species
# Get the top 153 ebird_codes
# Step 1: Get value counts (i.e., frequency) of unique values
value_counts = df['ebird_code'].value_counts()

print("value counts: ", value_counts)

# Step 2: Sort the index (unique values) alphabetically
value_counts_sorted = value_counts.sort_index()

print("value counts sorted: ", value_counts_sorted)

top_153_unique = value_counts_sorted[:153]

print("value counts sorted am: ", top_153_unique)


# Step 2: Filter to keep only ebird_codes with at least 100 occurrences
valid_codes = value_counts[value_counts > CONFIG['MIN_SAMPLES_PER_CLASS']].index

# Step 5: Get all rows from the original df with those top ebird_codes
filtered_df = df[df['ebird_code'].isin(top_153_unique.index)]


print("filter data with top 153: ", top_153_unique)
display(filtered_df)


print("filter data with atleast 100 samples in 153: ")
# Step 5: Get all rows from the original df with those top ebird_codes
filtered_df = filtered_df[filtered_df['ebird_code'].isin(valid_codes)]
species_counts = filtered_df ['ebird_code'].value_counts()
print(f"number of species  with 100 samples: ", species_counts.shape[0])
display(filtered_df)

updatedvalue_counts = filtered_df['ebird_code'].value_counts()
print("updated species value counts sorted a to m: ", updatedvalue_counts)



# 3. Apply Filters and Show Impact
print(f"\n{'='*50}")
print("APPLYING FILTERS:")
print("="*50)

# Filter by duration
filtered_df = filtered_df[filtered_df['duration'] < 20]
species_counts = filtered_df ['ebird_code'].value_counts()
print(f"number of species  with duration < 20: ", species_counts.shape[0])
display(filtered_df)

# Filter by rating
filtered_df = filtered_df[filtered_df['rating'] > 3]
species_counts = filtered_df['ebird_code'].value_counts()
print(f"number of species  with rating > 3: ", species_counts.shape[0])
display(filtered_df)

# Remove rows where 'url' is missing or empty
filtered_df = filtered_df[filtered_df['url'].notna() & (filtered_df['url'].str.strip() != '')]

print(f"Remaining rows after removing missing/empty URLs: {len(filtered_df)}")
print(f"number of species  that contain url data ", species_counts.shape[0])
display(filtered_df)

print(f"number of species COUNTS ", species_counts)

# # 4. Select Top N Classes
# print(f"\n{'='*50}")
# print(f"SELECTING TOP {CONFIG['NUM_CLASSES']} CLASSES:")
# print("="*50)

# Get top N species by sample count (after filtering)
#final_species_counts = df_filtered['ebird_code'].value_counts().head(153) #crystal
# if CONFIG['NUM_CLASSES'] <= len(final_species_counts):
#     selected_species = final_species_counts.head(CONFIG['NUM_CLASSES']).index.tolist()
#     df_final = df_filtered[df_filtered['ebird_code'].isin(selected_species)].copy()

#     print(f"Selected {len(selected_species)} species:")
#     for i, species in enumerate(selected_species, 1):
#         count = final_species_counts[species]
#         print(f"{i:2d}. {species}: {count:3d} samples")

#     print(f"\nFinal dataset: {len(df_final):,} samples across {len(selected_species)} species")

#     # Class balance visualization
#     plt.figure(figsize=(12, 6))
#     final_counts = df_final['ebird_code'].value_counts()
#     plt.bar(range(len(final_counts)), final_counts.values, color='steelblue', alpha=0.7)
#     plt.xlabel('Species (ordered by sample count)')
#     plt.ylabel('Number of Samples')
#     plt.title(f'Class Distribution - Top {CONFIG["NUM_CLASSES"]} Species (After Filtering)')
#     plt.xticks(range(len(final_counts)), final_counts.index, rotation=45, ha='right')
#     plt.grid(True, alpha=0.3)
#     plt.tight_layout()
#     plt.show()

#     # Calculate class imbalance ratio
#     max_samples = final_counts.max()
#     min_samples = final_counts.min()
#     imbalance_ratio = max_samples / min_samples
#     print(f"\nClass imbalance ratio: {imbalance_ratio:.2f} (max: {max_samples}, min: {min_samples})")

# else:
#     print(f" Not enough species meet the criteria! Only {len(final_species_counts)} species available.")
#     print("Consider reducing MIN_SAMPLES_PER_CLASS or NUM_CLASSES")

# # 5. Save configuration and filtered dataset info
# print(f"\n{'='*50}")
# print("SUMMARY FOR NEXT STEPS:")
# print("="*50)
# print(f" Configuration set for {CONFIG['NUM_CLASSES']} classes")
# print(f" {len(df_final):,} total samples after filtering")
# print(f" Rating threshold: ≥{CONFIG['MIN_RATING_THRESHOLD']}")
# print(f" Duration range: {CONFIG['MIN_DURATION']}-{CONFIG['MAX_DURATION']} seconds")
# print(f" Ready for audio file validation and preprocessing")

# # Create a summary for the next step
# STEP2_SUMMARY = {
#     'df_final': df_final,
#     'selected_species': selected_species,
#     'config': CONFIG,
#     'class_counts': final_counts.to_dict()
# }

print(f"\n Next step: Audio file validation and path checking")

## Step 2: Data Visualization and Configuration Setup

### What We're Doing:
- Setting up **filters** to clean our data and keep only high-quality recordings
- Creating **visualizations** to understand our dataset better
- Selecting the **best bird species** for our classification model

### Configuration Settings:
We set up important thresholds to filter our data:
- **Rating ≥ 2.0** - Only keep good quality recordings
- **Duration 5-300 seconds** - Remove very short or very long clips
- **≥20 samples per species** - Ensure we have enough data for each bird
- **Top 30 species** - Focus on the most common birds

### Key Insights from Visualizations:
- **Rating Distribution**: Most recordings have ratings between 3-4 (good quality)
- **Duration**: Most clips are under 100 seconds, with many very short recordings
- **Species Balance**: Some bird species have many more recordings than others
- **Audio Quality**: About 57% are stereo, 43% are mono recordings

### Final Dataset After Filtering:
- **11,725 total samples** from **30 bird species**
- **Class imbalance ratio: 8.84** (most common species has 8.84x more samples than least common)
- **Top species**: Red Crossbill (1,397 samples), House Sparrow (1,085 samples)
- **Least represented**: Great Horned Owl (158 samples), House Finch (163 samples)

### Why These Filters Matter:
- **Better model performance** - High-quality data leads to better predictions
- **Balanced training** - Each species needs enough examples to learn from
- **Consistent audio length** - Helps our model process audio more effectively
- **Quality control** - Only keeping recordings rated 2.0+ ensures good audio quality

---

In [None]:
# Step 3: Audio File Validation and Processing Setup
import os
import librosa
import numpy as np
from pathlib import Path
import random
from tqdm import tqdm
import urllib.request

print("=== STEP 3: AUDIO FILE VALIDATION & PROCESSING SETUP ===\n")

# # Use the filtered dataset from Step 2
# # For this step, we'll work with the configuration from Step 2
# CONFIG = {
#     'MIN_RATING_THRESHOLD': 2.0,
#     'NUM_CLASSES': 30,
#     'MAX_DURATION': 300,
#     'MIN_DURATION': 5,
#     'MIN_SAMPLES_PER_CLASS': 20,
# }

# # Re-apply the same filtering logic to recreate df_final
# print("Recreating filtered dataset...")
# df_filtered = df[df['rating'] >= CONFIG['MIN_RATING_THRESHOLD']].copy()
# df_filtered = df_filtered[
#     (df_filtered['duration'] >= CONFIG['MIN_DURATION']) &
#     (df_filtered['duration'] <= CONFIG['MAX_DURATION'])
# ].copy()

# species_counts_filtered = df_filtered['ebird_code'].value_counts()
# valid_species = species_counts_filtered[species_counts_filtered >= CONFIG['MIN_SAMPLES_PER_CLASS']].index
# df_filtered = df_filtered[df_filtered['ebird_code'].isin(valid_species)].copy()

# final_species_counts = df_filtered['ebird_code'].value_counts()
# selected_species = final_species_counts.head(CONFIG['NUM_CLASSES']).index.tolist()
# df_final = df_filtered[df_filtered['ebird_code'].isin(selected_species)].copy()

# print(f"Filtered dataset ready: {len(df_final)} samples, {len(selected_species)} species")

final_species_counts = filtered_df['ebird_code'].value_counts()
selected_species = final_species_counts.index.tolist()

# ========================================
# 1. FIND AUDIO FILES AND VALIDATE PATHS
# ========================================
print(f"\n{'='*50}")
print("AUDIO FILE PATH VALIDATION:")
print("="*50)

# Look for the audio files
base_path = Path('dataset/raw/train_extended.csv')
print(f"Base path: {base_path}")
print(f"Base path exists: {base_path.exists()}")


# print(f"\nDirectories containing MP3 files:")
# for mp3_dir in mp3_dirs[:5]:  # Show first 5
#     mp3_count = len([f for f in os.listdir(mp3_dir) if f.endswith('.mp3')])
#     print(f"  - {mp3_dir}: {mp3_count} MP3 files")

# if len(mp3_dirs) > 5:
#     print(f"  ... and {len(mp3_dirs)-5} more directories")

# ========================================
# 2. CREATE FULL FILE PATHS
# ========================================
print(f"\n{'='*50}")
print("CREATING FILE PATHS:")
print("="*50)

def find_audio_file_path(filename, base_dirs):
    """Find the full path of an audio file"""
    for base_dir in base_dirs:
        full_path = os.path.join(base_dir, filename)
        if os.path.exists(full_path):
            return full_path
    return None


# Check how many files we found
files_found = filtered_df['url'].notna().sum()
files_missing = filtered_df['url'].isna().sum()

print(f"Files found: {files_found}")
print(f"Files missing: {files_missing}")
print(f"Success rate: {files_found/(files_found+files_missing)*100:.1f}%")


# ========================================
# 3. AUDIO VALIDATION AND BASIC ANALYSIS
# ========================================
print(f"\n{'='*50}")
print("AUDIO FILE VALIDATION:")
print("="*50)

# Test loading a few random audio files

filtered_df['download_url'] = filtered_df['url'].astype(str).str.rstrip('/') + '/download'

sample_files = filtered_df.sample(n=min(5, len(filtered_df)))

# Create a directory to store downloaded audio
os.makedirs("temp_audio", exist_ok=True)




audio_info = []

for idx, row in sample_files.iterrows():
    try:


        # Get file name from URL
        url = row['download_url']
        filename = os.path.join("temp_audio", url.split("/")[-2] + '.mp3')

        # Download if not already present
        if not os.path.exists(filename):
            urllib.request.urlretrieve(url, filename)
            print(f"Downloaded: {filename}")

        # Load audio
        audio, sr = librosa.load(filename, sr=None, duration=10)
        duration = len(audio) / sr
        print("Testing audio file loading...")

        info = {
            'filename': row['filename'],
            'species': row['ebird_code'],
            'csv_duration': row['duration'],
            'actual_duration': duration,
            'sample_rate': sr,
            'channels_csv': row['channels'],
            'audio_shape': audio.shape,
            'success': True
        }
        audio_info.append(info)
        print(f" {row['download_url']}: {duration:.1f}s, {sr}Hz, shape={audio.shape}")

    except Exception as e:
        info = {
            'filename': row['url'],
            'species': row['ebird_code'],
            'error': str(e),
            'success': False
        }
        audio_info.append(info)
        print(f" {row['url']}: Error - {str(e)}")

# ========================================
# 4. DATASET SPLIT PREPARATION
# ========================================
print(f"\n{'='*50}")
print("DATASET SPLIT PREPARATION:")
print("="*50)

# Create class mapping
class_mapping = {species: idx for idx, species in enumerate(selected_species )}
reverse_mapping = {idx: species for species, idx in class_mapping.items()}

print("Class mapping created:")
for species, idx in list(class_mapping.items())[:5]:
    print(f"  {idx}: {species}")
print(f"  ... (showing first 5 of {len(class_mapping)} classes)")

# Add numeric labels
filtered_df['class_id'] = filtered_df['ebird_code'].map(class_mapping)

# Prepare for stratified split
print(f"\nPreparing stratified split...")
print(f"Class distribution before split:")
class_dist = filtered_df['ebird_code'].value_counts()
print(class_dist.head(10))


# ========================================
# 5. AUDIO PROCESSING CONFIGURATION
# ========================================
print(f"\n{'='*50}")
print("AUDIO PROCESSING CONFIGURATION:")
print("="*50)

AUDIO_CONFIG = {
    'SAMPLE_RATE': 22050,           # Standard sample rate for audio ML
    'MAX_AUDIO_LENGTH': 10,         # Maximum audio length in seconds
    'N_MELS': 128,                  # Number of mel bands for spectrogram
    'N_FFT': 2048,                  # FFT window size
    'HOP_LENGTH': 512,              # Hop length for STFT
    'SPECTROGRAM_HEIGHT': 128,      # Height of spectrogram image
    'SPECTROGRAM_WIDTH': 432,       # Width of spectrogram image (for 10s audio)
}

print("Audio processing configuration:")
for key, value in AUDIO_CONFIG.items():
    print(f"  {key}: {value}")

# Calculate expected spectrogram dimensions
expected_time_steps = (AUDIO_CONFIG['MAX_AUDIO_LENGTH'] * AUDIO_CONFIG['SAMPLE_RATE']) // AUDIO_CONFIG['HOP_LENGTH']
print(f"\nExpected spectrogram shape: ({AUDIO_CONFIG['N_MELS']}, {expected_time_steps})")

# ========================================
# 6. SUMMARY FOR NEXT STEPS
# ========================================
print(f"\n{'='*50}")
print("STEP 3 SUMMARY:")
print("="*50)
print(f" Audio files validated: {files_found}/{files_found+files_missing} found")
print(f" Dataset ready: {len(filtered_df)} samples")
print(f" Classes: {len(selected_species)} species")
print(f" Class mapping created")
print(f" Audio processing config set")
print(f" Ready for train/validation/test split")

# Save some key info for next step
STEP3_INFO = {
    'df_final': filtered_df,
    'class_mapping': class_mapping,
    'reverse_mapping': reverse_mapping,
    'selected_species': selected_species,
    'audio_config': AUDIO_CONFIG,
    'config': CONFIG
}

print(f"\n Next step: Train/Validation/Test split and first audio preprocessing")


## Step 3: Audio File Validation and Processing Setup

### What We're Doing:
- **Finding audio files** in the dataset folders and checking if they exist
- **Creating file paths** so our code knows where each audio recording is stored
- **Testing audio loading** to make sure we can read the files correctly
- **Setting up audio processing** parameters for converting sound to spectrograms

### Key Audio Processing Settings:
- **Sample rate**: 22,050 Hz (standard for machine learning)
- **Max length**: 10 seconds per audio clip
- **Spectrogram size**: 128 x 432 pixels (converts sound to image format)
- **Mel bands**: 128 frequency bands (captures important audio features)

### Validation Results:
- **Audio files found**: Successfully located MP3 files in 153 directories
- **File structure**: Audio organized by species (A-M folder contains subfolders for each bird)
- **Loading test**: Audio files can be read properly with correct sample rates and durations
- **Class mapping**: Created numeric labels (0-29) for each of the 30 bird species

### Why This Step Matters:
- **File verification** - Ensures all audio files exist and can be loaded
- **Standardization** - Sets consistent audio processing parameters
- **Organization** - Maps species names to numbers for machine learning
- **Quality check** - Tests that audio data matches the CSV information

### Next Steps Ready:
- All 11,725 audio samples are accessible and validated
- Audio processing configuration is set for consistent training
- Dataset is ready to be split into training, validation, and test sets

### Links to Data:
- Bird Audio Filtered Files output: https://drive.google.com/drive/folders/1s8IkzkHDz_2Pnb_OLzlBkT54iRMhkszx?usp=sharing

- Spectrogram output: https://drive.google.com/drive/folders/1sWf2GzTZ5ELXiIt6razSTtYN54haC1bQ?usp=drive_link


---

In [None]:
# Step 4 Modified: Data Split and Audio Preprocessing (No External Dependencies)
import librosa
import numpy as np

# Hotfix for deprecated NumPy types used in older librosa versions
np.complex = np.complex128
np.float = float
np.int = int
np.bool = bool

import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import scipy.signal
from scipy.signal import butter, filtfilt
import warnings
import urllib.request
from urllib.error import HTTPError, URLError
import time
import os

warnings.filterwarnings('ignore')

GDRIVE_AUDIO_DIR = '/home/sepehr/Documents/Audio-Project/machinelearning/dataset/filtered/birdaudio'
os.makedirs(GDRIVE_AUDIO_DIR, exist_ok=True)

GDRIVE_MELSPEC_DIR = '/home/sepehr/Documents/Audio-Project/machinelearning/dataset/filtered/melspec'
os.makedirs(GDRIVE_MELSPEC_DIR, exist_ok=True)

print("=== STEP 4: DATA SPLIT & AUDIO PREPROCESSING (MODIFIED) ===\n")

AUDIO_CONFIG = {
    'SAMPLE_RATE': 22050,
    'MAX_AUDIO_LENGTH': 10,
    'N_MELS': 128,
    'N_FFT': 2048,
    'HOP_LENGTH': 512,
    'SPECTROGRAM_HEIGHT': 128,
    'SPECTROGRAM_WIDTH': 432,
}
# ========================================
# 1. STRATIFIED TRAIN/VALIDATION/TEST SPLIT
# ========================================
print(f"\n{'='*50}")
print("STRATIFIED DATA SPLIT:")
print("="*50)

# First split: train+val vs test (80/20)
X = filtered_df[['download_url', 'filename', 'ebird_code']].copy()
y = filtered_df['class_id'].copy()

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: train vs validation (75/25 of remaining = 60/20 of total)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Data split completed:")
print(f"  Training set:   {len(X_train):,} samples ({len(X_train)/len(filtered_df)*100:.1f}%)")
print(f"  Validation set: {len(X_val):,} samples ({len(X_val)/len(filtered_df)*100:.1f}%)")
print(f"  Test set:       {len(X_test):,} samples ({len(X_test)/len(filtered_df)*100:.1f}%)")

# Verify class distribution is maintained
print(f"\nClass distribution verification:")
train_dist = y_train.value_counts().sort_index()
val_dist = y_val.value_counts().sort_index()
test_dist = y_test.value_counts().sort_index()

sample_classes = train_dist.index[:5]
for class_id in sample_classes:
    species = selected_species[class_id]
    total_samples = train_dist[class_id] + val_dist[class_id] + test_dist[class_id]
    train_pct = train_dist[class_id] / total_samples * 100
    val_pct = val_dist[class_id] / total_samples * 100
    test_pct = test_dist[class_id] / total_samples * 100
    print(f"  {species}: Train={train_pct:.1f}%, Val={val_pct:.1f}%, Test={test_pct:.1f}%")

In [None]:
# Store data splits for next step
SPLITS_DATA = {
    'X_train': X_train,
    'X_val': X_val,
    'X_test': X_test,
    'y_train': y_train,
    'y_val': y_val,
    'y_test': y_test,
    'class_mapping': class_mapping,
    'selected_species': selected_species
}

print(f"\n Next step: CNN model architecture and batch data generation")

In [6]:
def remove_missing_melspec(X, y, mel_dir):
    """Remove entries from X and y where mel-spec .npy file is missing."""
    missing_indices = []

    for idx, row in X.iterrows():
        rec_id = row['download_url'].split('/')[-2]
        ebird_code = row['ebird_code']
        filename = f"{ebird_code}_{rec_id}.npy"
        mel_path = os.path.join(mel_dir, filename)

        if not os.path.exists(mel_path):
            print(f"Missing mel-spec file: {filename}")
            missing_indices.append(idx)

    # Drop and reset index
    X_clean = X.drop(index=missing_indices).reset_index(drop=True)
    y_clean = y.drop(index=missing_indices).reset_index(drop=True)

    return X_clean, y_clean



In [7]:
def load_melspec_dataset(X, y, mel_dir, target_shape=(128, 432)):
    X_data, y_data = [], []

    for _, row in X.iterrows():
        rec_id = row['download_url'].split('/')[-2]
        ebird_code = row['ebird_code']
        filename = f"{ebird_code}_{rec_id}.npy"
        mel_path = os.path.join(mel_dir, filename)

        if os.path.exists(mel_path):
            mel = np.load(mel_path)

            # Resize or pad mel-spec
            if mel.shape[1] < target_shape[1]:
                # pad to the right
                pad_width = target_shape[1] - mel.shape[1]
                mel = np.pad(mel, ((0, 0), (0, pad_width)), mode='constant')
            elif mel.shape[1] > target_shape[1]:
                # crop to the target shape
                mel = mel[:, :target_shape[1]]

            X_data.append(mel)
            y_data.append(y.loc[row.name])
        else:
            print(f"Missing mel-spec file: {filename}")
    
    X_data = np.array(X_data)
    y_data = np.array(y_data)

    return X_data, y_data



In [None]:
mel_dir = '/home/sepehr/Documents/Audio-Project/machinelearning/dataset/filtered/melspec'

# 1) Remove any rows whose mel-spec is missing
X_train_df, y_train_sr = remove_missing_melspec(X_train, y_train, mel_dir)
X_val_df,   y_val_sr   = remove_missing_melspec(X_val,   y_val,   mel_dir)
X_test_df,  y_test_sr  = remove_missing_melspec(X_test,  y_test,  mel_dir)

# 2) Merge the labels back into each DataFrame
X_train_df = X_train_df.assign(class_id=y_train_sr.values)
X_val_df   = X_val_df.assign(  class_id=y_val_sr.values)
X_test_df  = X_test_df.assign( class_id=y_test_sr.values)

# 3) Now you can inspect .head() safely
print(X_train_df.shape, y_train_sr.shape)
print(X_train_df.head())
print(y_train_sr.head())

print(X_val_df.shape, y_val_sr.shape)
print(X_val_df.head())
print(y_val_sr.head())

print(X_test_df.shape, y_test_sr.shape)
print(X_test_df.head())
print(y_test_sr.head())


In [9]:
def extract_features_from_mel(rec_id, ebird_code, melspec_dir):
    mel = np.load(os.path.join(melspec_dir, f"{ebird_code}_{rec_id}.npy"))
    mean_feat = mel.mean(axis=1)
    std_feat  = mel.std(axis=1)
    return np.hstack([mean_feat, std_feat])

In [None]:
def make_feat_matrix_from_mel(X_df, melspec_dir):
    feats, labs = [], []
    for _, row in X_df.iterrows():
        rec_id     = row['download_url'].rstrip('/').split('/')[-2]
        code       = row['ebird_code']
        feats.append(extract_features_from_mel(rec_id, code, melspec_dir))
        labs.append(row['class_id'])
    return np.vstack(feats), np.array(labs)

MEL_DIR = '/home/sepehr/Documents/Audio-Project/machinelearning/dataset/filtered/melspec'

X_train_feat, y_train_feat = make_feat_matrix_from_mel(X_train_df, mel_dir)
X_val_feat,   y_val_feat   = make_feat_matrix_from_mel(X_val_df,   mel_dir)
X_test_feat,  y_test_feat  = make_feat_matrix_from_mel(X_test_df,  mel_dir)

print("Feature shapes:", 
      X_train_feat.shape, 
      X_val_feat.shape, 
      X_test_feat.shape)

In [None]:
# 2) Build feature matrices
X_train_feat, y_train_feat = make_feat_matrix_from_mel(X_train_df, mel_dir)
X_val_feat,   y_val_feat   = make_feat_matrix_from_mel(X_val_df,   mel_dir)
X_test_feat,  y_test_feat  = make_feat_matrix_from_mel(X_test_df,  mel_dir)

print("Shapes:", X_train_feat.shape, X_val_feat.shape, X_test_feat.shape)

In [12]:
# X_train, y_train = load_melspec_dataset(X_train_np, y_train_np, mel_dir)
# print(X_train.shape)
 
# X_test, y_test = load_melspec_dataset(X_test_np, y_test_np, mel_dir)
# print(X_test.shape)

# X_val, y_val = load_melspec_dataset(X_val_np, y_val_np, mel_dir)
# print(X_val.shape)


# X_val = X_val.reshape((X_val.shape[0], -1))
# X_train = X_train.reshape((X_train.shape[0], -1))
# X_test = X_test.reshape((X_test.shape[0], -1))


# model training

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline      import Pipeline

In [14]:
# create shorthands for convenience
X_tr, y_tr = X_train_feat, y_train_feat
X_te, y_te = X_test_feat,  y_test_feat

In [None]:
RF_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('rf',     RandomForestClassifier(n_estimators=200, random_state=42))
])
RF_pipe.fit(X_tr, y_tr)
pred = RF_pipe.predict(X_te)

print("RF (handcrafted) acc:", accuracy_score(y_te, pred))
print(classification_report(y_test_feat, pred))

In [None]:
sgd_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('sgd',   SGDClassifier(
                  loss='log_loss',        # logistic
                  learning_rate='optimal',
                  max_iter=1000, tol=1e-3,
                  random_state=42))
])
sgd_pipe.fit(X_tr, y_tr)
pred_sgd = sgd_pipe.predict(X_te)
print("SGD Accuracy:", accuracy_score(y_te, pred_sgd))
print(classification_report(y_te, pred_sgd))


In [None]:
nb = GaussianNB()
nb.fit(X_tr, y_tr)
pred_nb = nb.predict(X_te)
print("NB Accuracy:", accuracy_score(y_te, pred_nb))
print(classification_report(y_te, pred_nb))

In [None]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_tr, y_tr)
pred_dt = dt.predict(X_te)
print("Decision Tree Accuracy:", accuracy_score(y_te, pred_dt))
print(classification_report(y_te, pred_dt))

In [None]:
knn_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('knn',   KNeighborsClassifier(n_neighbors=5))
])
knn_pipe.fit(X_tr, y_tr)
pred_knn = knn_pipe.predict(X_te)
print("KNN Accuracy:", accuracy_score(y_te, pred_knn))
print(classification_report(y_te, pred_knn))

In [None]:
svm_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('svm',   SVC(kernel='rbf', C=1.0, random_state=42))
])
svm_pipe.fit(X_tr, y_tr)
pred_svm = svm_pipe.predict(X_te)
print("SVM Accuracy:", accuracy_score(y_te, pred_svm))
print(classification_report(y_te, pred_svm))