# Odd Prediction explained with code example

Mark Eckdahl
https://www.kaggle.com/meckdahl/

If you are here, you are probably confused, may have looked at the original skeleton code already and still confused (that was me).
I added more to @cwthomson's code to make it quite a bit more useful.

First off, this code leverages a birdcall-check (Big Thanks to @shonenkov for a nice checking dataset and notebook and several discussions to make this competition better)

[ORIGINAL CODE from @cwthompson] 
In this notebook we will import the test data and make a simple prediction.

 @cwthompson put together a working skeletal submission example here:

https://www.kaggle.com/cwthompson/birdsong-making-a-prediction



# Key Points

1. Code and Commits will not find the bird test files, they are hidden UNTIL SUBMIT is done - Yes flying blind
2. This code uses @shonenkov's alternative data for coding and committing IF SUBMIT CODE is NOT FOUND
3. I added some functionionality to this code to put path in test db and much more printing of what is going on
4. Remember, your code will only see the ACTUAL test data after you press Submit button on whatever submission.csv file you have (good luck!)

In [None]:
import numpy as np
import pandas as pd
import librosa
import librosa.display
from pathlib import Path

np.random.seed(0)

import warnings
warnings.filterwarnings('ignore')


nSAMPLERATE = 22050

There are several pieces of information that we are given about the competition and the test set. Two of the important pieces are quoted below.

The following can be found on the [evaluation page](https://www.kaggle.com/c/birdsong-recognition/overview/evaluation):
> Submissions will be evaluated based on their row-wise micro averaged F1 score.
> 
> For each row_id/time window, you need to provide a space separated list of the set of birds that made a call beginning or ending in that time window. If there are no bird calls in a time window, use the code nocall.
> 
> There are three sites in the test set. Sites 1 and 2 are labeled in 5 second increments, while site 3 was labeled per audio file due to the time consuming nature of the labeling process.

This explains how we need to structure our submission file. There will be several submissions for each test audio file, split by time windows (5 seconds) - unless they are from site 3.

The following can be found on the [data page](https://www.kaggle.com/c/birdsong-recognition/data):
>The hidden test set audio consists of approximately 150 recordings in mp3 format, each roughly 10 minutes long. The recordings were taken at three separate remote locations. Sites 1 and 2 were labeled in 5 second increments and need matching predictions, but due to the time consuming nature of the labeling process the site 3 files are only labeled at the file level. Accordingly, site 3 has relatively few rows in the test set and needs lower time resolution predictions.
>
>Two example soundscapes from another data source are also provided to illustrate how the soundscapes are labeled and the hidden dataset folder structure. The two example audio files are BLKFR-10-CPL_20190611_093000.pt540.mp3 and ORANGE-7-CAP_20190606_093000.pt623.mp3. These soundscapes were kindly provided by Jack Dumbacher of the California Academy of Science's Department of Ornithology and Mammology.

This is just further information on the test data.

# Loading Audio

Firstly, we need to be able to read in five second windows of the test audio. We can do this using librosa. If the audio is from site 3 then we need to whole audio clip, and we can do this by setting duration to None.

In [None]:
def load_test_clip(path, start_time, duration=5):
    return librosa.load(path, offset=start_time, duration=duration, sr=nSAMPLERATE)[0] 


# Getting Test Data

The information on the test audio is given in test.csv. We have outputted that below. The test audio is also contained in the test_audio folder.

In [None]:
TEST = Path("../input/birdsong-recognition/test_audio").exists()

if TEST:
    DATA_DIR = str(Path("../input/birdsong-recognition/"))
    print("SUBMIT PATH EXISTS, using it")
else:
    # dataset created by @shonenkov, thanks!
    DATA_DIR = str(Path("../input/birdcall-check/"))
    print("Alternative Data PATH in use.")

print("DATA_DIR (Dynamic) = ",DATA_DIR)


TEST_FOLDER = DATA_DIR + '/test_audio/'
test = pd.read_csv(DATA_DIR + "/test.csv")

test.head()

# Possible Birds

The possible birds can be found in the training set, with the ebird_code feature. Almost all are six letter codes. The first twenty have been outputted below.

In [None]:
train = pd.read_csv('../input/birdsong-recognition/train.csv')
birds = train['ebird_code'].unique()
birds=np.append(birds,'nocall')
birds[0:20]

In [None]:
def addPathToTest(test, root_path=TEST_FOLDER):
    test['path'] = root_path + test['audio_id'] + '.mp3'      
    return test
    
test_path = addPathToTest(test)

test_path.head()

# Making Predictions

With all of the information we have found so far, it is now possible for us to make predictions. This can be done in different ways depending on your model - for this example we will just be selecting random birds. For each row:

1. Extract the information from test.csv  
2. Load in the correct clip (using librosa)  
3. Make a prediction  
4. Store the prediction  

In [None]:

def make_prediction(sound_clip, birds, bSite1or2=True):
    if (np.random.randint(0,80) % 79 == 0) or not bSite1or2:
        return np.random.choice(birds)
    return "nocall" # near half "nocall", so just guessing at ~10%

In [None]:
try:  # Comment out top and bottom to allow error propogation
# if True:
    preds = []
    nCount=0
    for index, row in test.iterrows():
        nCount +=1
        # Get test row information
        site = row['site']
        start_time = row['seconds'] - 5
        row_id = row['row_id']
        audio_id = row['audio_id']
        row_path = row['path']
        print("count = #", nCount, ", start, end = ", start_time, ", ", row['seconds'],"######" )
        
        bSite1or2 = (site == 'site_1' or site == 'site_2')
        # Get the test sound clip
        if bSite1or2:
#             sound_clip = load_test_clip(TEST_FOLDER + audio_id + '.mp3', start_time)
            sound_clip = load_test_clip(row_path, start_time)
            print("===== site = ", site, " - fn = ", (audio_id ), " - 5 sec")
            if nCount < 3:
                print("+++++ path = ", (TEST_FOLDER + audio_id + '.mp3') )
        else:
            sound_clip = load_test_clip(TEST_FOLDER + audio_id + '.mp3', 0, duration=None)
            print("===== site = ", site, " - fn = ", (audio_id ), "- No LIMIT")

        # Make the prediction
        pred = make_prediction(sound_clip, birds, bSite1or2 )

        # Store prediction
        preds.append([row_id, pred])
        if (TEST == False) and (nCount >= 15): # make sure not submitting 
            print("Exit after quick test of 15")
            break
    preds = pd.DataFrame(preds, columns=['row_id', 'birds'])

except Exception as e:
    preds = pd.read_csv('../input/birdsong-recognition/sample_submission.csv')
    print("EXCEPTION -- count = ", nCount," - ", row_id, audio_id)
    print("EXCEPTION  = ",e)

# Outputting Submission

In [None]:
preds.to_csv('submission.csv', index=False)

In [None]:
preds.tail(30)