# Data Exploration for Passive Acoustic Monitoring (PAM) Project

This notebook is designed to explore the dataset used in the PAM project. We will visualize audio samples, understand the distribution of species, and gain insights into the data before proceeding with preprocessing and model training.

In [1]:
# Import necessary libraries
import os
import pandas as pd
import matplotlib.pyplot as plt
import librosa
import librosa.display

# Set the path to the metadata file
metadata_file = '../data/train_metadata.csv'

# Load the metadata
metadata = pd.read_csv(metadata_file)
metadata.head()

## Dataset Overview

Let's take a look at the distribution of species in the dataset.

In [2]:
# Count the number of samples for each species
species_counts = metadata['species'].value_counts()
species_counts

species
species_1    50
species_2    30
species_3    20
species_4    40
species_5    10
Name: filename, dtype: int64

### Visualizing Species Distribution

We will create a bar plot to visualize the distribution of species in the dataset.

In [3]:
# Plot the species distribution
plt.figure(figsize=(10, 6))
species_counts.plot(kind='bar')
plt.title('Species Distribution')
plt.xlabel('Species')
plt.ylabel('Number of Samples')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

Text(0.5, 1.0, 'Species Distribution')

## Audio Sample Visualization

Let's visualize a sample audio clip to understand its waveform and spectrogram.

In [4]:
# Load a sample audio file
sample_file = os.path.join('../data/raw_audio', metadata['filename'].iloc[0])
signal, sr = librosa.load(sample_file, sr=None)

# Plot the waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(signal, sr=sr)
plt.title('Waveform of Sample Audio')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.grid()
plt.show()

### Spectrogram Visualization

Now, let's visualize the mel spectrogram of the same audio sample.

In [5]:
# Compute and plot the mel spectrogram
mel_spec = librosa.feature.melspectrogram(y=signal, sr=sr, n_mels=128)
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

plt.figure(figsize=(12, 4))
librosa.display.specshow(mel_spec_db, sr=sr, x_axis='time', y_axis='mel', cmap='coolwarm')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram of Sample Audio')
plt.tight_layout()
plt.show()

## Conclusion

In this notebook, we explored the dataset by visualizing the distribution of species and examining sample audio clips. This understanding will guide us in the preprocessing and model training phases of the PAM project.