**Part 0: Exploratory Data Analysis**: https://www.kaggle.com/code/virajkadam/birdclef2022-data-and-audio-eda


**Part 2: Training a No-call Classifier**:https://www.kaggle.com/code/virajkadam/birdclef22-p2-no-call-classifier

# Introduction

<h3 style= 'color:green'> The objective the competition is to classify the birds using the bird audio recordings. </h3>
    

**Extracting Spectrograms**

From the blog here : https://www.macaulaylibrary.org/2021/07/19/from-sound-to-images-part-1-a-deep-dive-on-spectrogram-creation/

What even is a Spectrogram?

    A spectrogram tracks the sound frequencies (vertical axis) which appear in the waveform, as a function of time (horizontal axis). Brighter colors correspond to louder sounds.

Why use spectrograms? 

    Rather than working with waveforms directly, we have the option of representing our sound as an image either as a spectrogram or some other image representation. This approach has several advantages.
    First, image representations are indispensable tools for bird sound ID experts who are trying to identify species in a recording.
    Second, by using image representations, we have the option of using well-understood computer vision model architectures like RestNets along with pretrained weights from Imagenet.
    
**The NO Call classifier**    

    As a part of training , we will convert the audio recordings into 10-second spectograms. These individual spectograms are not guaranteed to have a bird call in them, and hence it is important to filter out the empty/noisy portions of the audio for better training. A lot of the test recordings also do not have a bird call in them. 


    Hence we will train a classifier to filter out no-call samples here : https://www.kaggle.com/code/virajkadam/birdclef22-p2-no-call-classifier/edit

**This is the step 1 of the training pipeline. Link to the next notebook:**

# Creating Spectogram : 

from : https://www.macaulaylibrary.org/2021/07/19/from-sound-to-images-part-1-a-deep-dive-on-spectrogram-creation/

There are a number of choices one can make when constructing a spectrogram. Among them are the following:

• Clip length: How many seconds of audio should a spectrogram represent?

    > Shorter clips often eliminated important context from the soundscape,while longer clips are more expensive to train.

• STFT window length: Represents a tradeoff between a spectrogram’s level of resolution in the time domain (short window length) and resolution in the frequency domain (long window length). 

    > 256 or 512 samples 

• STFT hop size: A shorter hop size leads to higher resolution in the time domain, but results in larger inputs to the model. In turn, larger inputs may slow down model training and inference.
    
    > 64 or 128 samples
    
• Mel scaling: In a frequency spectrogram, the vertical distance that represents an octave is not constant. As a result, it may be difficult for convolutional filters to learn to recognize harmonies, overtones, and repeated harmonic patterns. Mel scaling rescales the frequency axis, so that fixed differences in musical pitch (e.g. an octave or a fifth) correspond to fixed vertical distances. One possible downside to using mel scaling is that high frequency sounds will become compressed at the top of the spectrogram, and therefore might be harder to distinguish.
    
    > On or off

• Image rescaling: Choices of the parameters above affect the spatial dimensions of the resulting spectrogram. To make meaningful comparisons, we chose a set of image dimensions to rescale our spectrograms to.

    >rescale to [128, 512] or [96, 512] (for hop size 128), or [128, 1024] (for hop size 64)

# Imports

In [None]:
!pip install audiomentations -q
!pip install pqdm -q

import os
import pandas as pd
import numpy as np
import json

from pathlib import Path
import pqdm
from tqdm import tqdm


import matplotlib.pyplot as plt
from PIL import Image


import warnings
warnings.filterwarnings(action='ignore')

#audio
import librosa
from IPython.display import Audio
#audio augmentations'
from audiomentations import Compose,AddGaussianSNR,Shift,TimeStretch,TimeMask,FrequencyMask,PolarityInversion

# Config

In [None]:
# Global vars
RANDOM_SEED = 7
SAMPLE_RATE = 22000
SIGNAL_LENGTH = 10 # seconds
SPEC_SHAPE = (128, 512) # height x width
FMIN = 500
FMAX = 12500
hop_length = int(SIGNAL_LENGTH * SAMPLE_RATE / (SPEC_SHAPE[1] - 1))

**Loading data**

In [None]:
def json_to_pd(file):
    '''read and conv json file to pd row'''
    with open(file) as f:
        json_data = pd.json_normalize(json.loads(f.read()))
        
    return json_data



In [None]:
#BIRDCLEF 22 DATA 


train_audio = '../input/birdclef-2022/train_audio'
train_metadata = pd.read_csv('../input/birdclef-2022/train_metadata.csv')

#taking samples with rating > 2.5 
train_metadata=train_metadata.query('rating>=2.5')


#make directories to save spectograms
Train_Spectrograms = './Train_Spectrograms'
Freefield_Spectograms = './Freefield_Spectrograms'

!mkdir $Train_Spectrograms
!mkdir $Freefield_Spectograms

In [None]:
train_metadata.shape

In [None]:
#add a filepath to each audio 

train_metadata['filepath'] = train_audio + '/' + train_metadata['filename']
train_metadata.head()

**Freefield Data**

ABOUT

This dataset contains 7690 10-second audio files in a standardised format, extracted from contributions on the Freesound archive which were labelled with the "field-recording" tag. Note that the original tagging (as well as the audio submission) is crowdsourced, so the dataset is not guaranteed to consist purely of "field recordings" as might be defined by practitioners. The intention is to represent the content of an archive collection on such a topic, rather than to represent a controlled definition of such a topic.

Each audio file has a corresponding text file, containing metadata such as author and tags. 
The dataset has been randomly split into 10 equal-size subsets. This is so that you can perform 10-fold crossvalidation in machine-learning experiments, or can use fixed subsets of the data (e.g. use one subset for development, and others for later validation). Each of the 10 subsets has about 128 minutes of audio; the dataset totals over 21 hours of audio.



In [None]:
# for no-call classification (using the freefield data)

#credit to : 


#get all the json files (with description of the sounds)
file_list = Path("../input/freefield1010/freefield1010").rglob("*.json")

all_audio = []

for filepath in file_list:
    #conv json to pd 
    row = json_to_pd(filepath)
    #add filepath to image()
    row['filepath'] = str(filepath).rsplit('.',maxsplit=1)[0]  + '.wav'
    
    #append row to list
    all_audio.append(row)
    

    
    
freefield_df = pd.concat(all_audio,
                         ignore_index=True)

#check if there is a bird call in audio 
freefield_df['has_bird_call'] = freefield_df['tags'].apply(lambda x: 'bird' in x).astype(int)

freefield_df.head(3)

In [None]:
#check number of birdcalls vs no calls in freefield data 
print('Number of bird calls',freefield_df[freefield_df['has_bird_call']==1].shape[0])
print('Number of bird calls',freefield_df[freefield_df['has_bird_call']!=1].shape[0])



In [None]:
#undersampling the dataset, to have 1/4 of bird calls, 3/4 of no-bird-calls.

freefield_unsam = freefield_df[freefield_df['has_bird_call']==1].copy()
freefield_unsam = freefield_unsam.append(freefield_df[freefield_df['has_bird_call']!=1].sample(n= 378 * 3,random_state=RANDOM_SEED).copy(),
                                        ignore_index=True)

freefield_unsam.head(2)

In [None]:
freefield_unsam.shape

# Extracting Spectrograms

**Helper Functions**

In [None]:
def plot_spec(path):
    fig,ax = plt.subplots(figsize=(12,6))
    
    im = plt.imread(fname=path)
    plt.axis('off')
    plt.imshow(im,cmap='jet')
    plt.colorbar(shrink=0.25)
    plt.show()

**Add Audio Augmentations**

In [None]:
#Audio Augmentation:
augmentations = Compose(
    [
            FrequencyMask(min_frequency_band=0.05, max_frequency_band=0.15, p=0.25),
            TimeStretch(min_rate=0.11,max_rate=0.3,p=0.25),
            AddGaussianSNR(min_snr_in_db=5, max_snr_in_db=40, p=0.25)
                        ]
                        )

In [None]:

#function to save stfts

def save_stft(file_path,
              dir_path,
              Augment=True,
              ):
    '''extracting mel-specs from given audio data and saving them to given folder'''
    
    #list to store spectogram and labels
    stft_id = []
        
    #load audio     
    sig,sr=librosa.load(file_path,
                        sr=SAMPLE_RATE)
    
    
    n=0
    # break the signal into n second chunks
    for i in range(0,len(sig),int(SIGNAL_LENGTH*SAMPLE_RATE)):
        
        window = sig[i:i + int(SIGNAL_LENGTH * SAMPLE_RATE)]

        # End of signal
        if len(window) < int(SIGNAL_LENGTH * SAMPLE_RATE):
            break
            
            
            
        #Apply audio Augmentations :   
        if Augment:
            window=augmentations(window,
                                 sample_rate=sr)
            
        # extracting mel-spectrograms:
        mel_spec = librosa.feature.melspectrogram(window,
                                                  sr=SAMPLE_RATE, 
                                                  n_fft=1024, 
                                                  hop_length=hop_length, 
                                                  n_mels=SPEC_SHAPE[0], 
                                                  fmin=FMIN, 
                                                  fmax=FMAX
                                                  )
        
        
        # log scaling (convert to decibels)
        mel_spec = librosa.core.power_to_db(mel_spec,
                                            ref=np.max)
        
        # Normalize
        mel_spec -= mel_spec.min()
        mel_spec /= mel_spec.max()
        
        
        #saving Image
        
        #image_id
        ids=file_path.split('/')[-1].split('.')[0]
        
        save_id=f'{ids}_{n}.jpg'
        save_path=os.path.join(dir_path,save_id)
        
        n+=1
        
        image = Image.fromarray(mel_spec * 255.0).convert("L")
        image.save(save_path)
        
        #saving_image ids and labels
        stft_id.append(save_id)
        
    return stft_id



# **Extracting Train Spectrograms**

In [None]:
train_ids=[]
for idx,row in tqdm(train_metadata.iterrows()):
    
    #save spectograms
    audio_ids = save_stft(file_path= row.filepath,
                           dir_path =Train_Spectrograms)
    
    train_ids.extend(audio_ids)
    

**Saving DataFRame with spectrogram ids**

In [None]:
train_df = pd.DataFrame(train_ids,
                        columns=['spec_id'])
train_df['file_id'] = train_df['spec_id'].apply(lambda x: x.split('_')[0])


train_metadata['file_id'] = train_metadata['filename'].apply(lambda x: x.split('/')[1].split('.')[0])


#join dfs
train_df = train_df.merge(train_metadata,
                          on='file_id',
                          how='left')


train_df.to_csv('train_df.csv',
                index=False)

train_df.shape

In [None]:
plot_spec('./Train_Spectrograms/XC177993_0.jpg')

In [None]:
plot_spec('./Train_Spectrograms/XC177993_1.jpg')

# **Extracting Spectrograms from free-field data.(for no-call classifier)**

In [None]:
freefield_ids=[]
for idx, row in tqdm(freefield_unsam.iterrows()):
    
    #save spectograms
    audio_ids = save_stft(file_path= row.filepath,
                           dir_path =Freefield_Spectograms)
    
    freefield_ids.extend(audio_ids)
    
print('Number of samples extracted ',len(freefield_ids))

In [None]:
plot_spec('./Freefield_Spectrograms/44218_0.jpg')

In [None]:
plot_spec('./Freefield_Spectrograms/111096_0.jpg')

In [None]:
#saving free field df 
freefield_unsam.to_csv('freefield_downsampled.csv',index=False)