# Rainforest t-SNE with Colored Labels
This post was inspired by and based off of [Bojan's Notebook](https://www.kaggle.com/tunguz/visualizing-fft-features-with-t-sne-and-umap), which showed how to use [Rapids](https://rapids.ai/) to more quickly create t-SNE visualizations with a GPU. I found the post interesting, but really wanted to see colors for each species in the final plot.

Instead of forking the original notebook, I decided to start from scratch. It allowed me more control over how I processed the data, which I prefer.

In [None]:
# Generic and sound-processing libs
import numpy as np
import pandas as pd
import os
!pip install librosa
import librosa

# Rapids and plotting libs
import cupy as cp
import cudf, cuml
from cuml.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

## Determining clip locations
I don't want to use the entire 1-minute clip for each song for a few reasons:
* Most songs are only 1-2 seconds long, and I'm afraid they'll get lost in the noise of a 60 second clip.
* The 60 second recording may have multiple songs. I want to label each clip as having 1 **main** song. I realize even in a 6 second clip, there may be multiple songs. However, it'll happen less often.

To account for this, I will be taking 10 second clips, centered at the center of the song in question. This means a recording will have multiple rows: One for each song in the recording. This is consistent with the train_tp and train_fp files.

In [None]:
# Inputs
CLIP_SIZE = 6 # How long to make each clip (s). Centered at the midpoint of the song
F_MIN = 90 # Of all the rows of true positive data, this is the lowest f_min frequency
F_MAX = 14000 # Of all the rows of true positive data, this is the highest f_max frequency
Nfft = 2048 # FFT Parameter
hop_len = 512 # FFT Parameter
sr = 44100 # FFT Parameter

In [None]:
train_tp = pd.read_csv('/kaggle/input/rfcx-species-audio-detection/train_tp.csv')
train_fp = pd.read_csv('/kaggle/input/rfcx-species-audio-detection/train_fp.csv')

# Combine datasets, after labeling which are TP and which are FP
train_tp['tp'] = 'tp'
train_fp['tp'] = 'fp'

train_fp = train_fp.sample(frac=1, random_state=234)
train_fp = train_fp.head(len(train_tp)) # Too many fp rows, so will only keep as many as I have of tp rows

train = pd.concat([train_tp, train_fp])

# Create identification labels
train['species_song'] = train['species_id'].astype(str) + '-' + train['songtype_id'].astype(str)
train['species_song_tp'] = train['species_song'].astype(str) + '-' + train['tp'].astype(str)

def define_data_clips(df):
    ''' 
    Returns the same dataframe, but indicates 10 second clips to use for t-SNE, centered at location of song. 
    Removes rows that extend above/below allowed thresholds.  
    '''
    df['t_center'] = df['t_min'] + (df['t_max'] - df['t_min'])/2
    df['t_clip_lower'] = df['t_center'] - CLIP_SIZE/2
    df['t_clip_upper'] = df['t_center'] + CLIP_SIZE/2
    df = df[(df['t_clip_lower'] > 0) & (df['t_clip_upper'] < 60)] # Bounds must be within a minute, so all clips have same shape (10s)
    return df

train = define_data_clips(train)
print(len(train))

In [None]:
# Determine bins to keep
freqs = librosa.fft_frequencies(sr=sr, n_fft=Nfft)
bin_size = freqs[1]-freqs[0]
# freq[i] is the lower bound of the bin. freq[i+1] is the upper bound of bin i

# Initialize bins to keep
i_low = 0
i_high = len(freqs)
for i in range(len(freqs)):
    if freqs[i]+ bin_size < F_MIN:
        i_low = i+1 # At end of loop, will return the bin number for the first bin to exceed low threshold
    if freqs[i] > F_MAX:
        i_high = i
        break # Break loop once finding first bin above threshold


## Generate SFFTs for each clip
I use a pandas apply function to load the recording, clip it down to the times of interest, and convert it to an SFFT.

I actually considered using Rapids to do all of the FFT calculations, but decided against it when I realized I would have to re-write the FFT calculations. Additionally, I don't think Rapids can parallelize loading the data from the .flac files, which was also time consuming.

**If that's not the case, someone let me know in the comments!**

In [None]:
%%time
def create_sfft(row, sr, Nfft, hop_len, i_low, i_high):
    '''
    For each row in the DB, create the 10s clip around the data and store the results in the dataframe.
    '''
    offset = row['t_clip_lower']
    duration = row['t_clip_upper'] - row['t_clip_lower']
    fname = f"../input/rfcx-species-audio-detection/train/{row['recording_id']}.flac"
    x , sr = librosa.load(fname, sr=sr, offset=offset, duration=duration)
    
    X = librosa.stft(x, n_fft= Nfft, hop_length=hop_len)
    Xdb = librosa.amplitude_to_db(abs(X)).astype(np.float16) # Need to reduce from float32 to float16 to fit in memory
    Xdb = Xdb[i_low:i_high]
    
    row['sfft'] = Xdb
    return row

train = train.apply(lambda row: create_sfft(row, sr, Nfft, hop_len, i_low, i_high), axis=1)
train['sfft'].iloc[0].shape

## Verify all clips have the same size
As shown below, all clips are 1025x862. We want all the clips to have the same shape, so this is good. Otherwise, we'd need to pad/clip bins as needed.

In [None]:
train['sfft_n_freq_bins'] = train['sfft'].apply(lambda x: x.shape[0])
train['sfft_n_time_bins'] = train['sfft'].apply(lambda x: x.shape[1])

# As expected, all data has the same shape. Good!
print(f"med freq bins: {train['sfft_n_freq_bins'].median()}")
print(f"min freq bins: {train['sfft_n_freq_bins'].min()}")
print(f"max freq bins: {train['sfft_n_freq_bins'].max()}")
print(f"med time bins: {train['sfft_n_time_bins'].median()}")
print(f"min time bins: {train['sfft_n_time_bins'].min()}")
print(f"max time bins: {train['sfft_n_time_bins'].max()}")   

## t-SNE
Create a numpy array as an input to the t-SNE algorithm, then run t-SNE using Rapids.

In [None]:
# Create numpy array
train_sfft = train['sfft'].to_list()
train_sfft = np.array(train_sfft)
train_sfft = train_sfft.reshape(train_sfft.shape[0], -1)
print(train_sfft.shape)

In [None]:
# Build t-SNE
tsne = TSNE(n_components=2)
train_sfft_2D = tsne.fit_transform(train_sfft)

In [None]:
# Put parameters back in the dataframe
train_sfft_2D = cp.asnumpy(train_sfft_2D)
train['tsneX'] = train_sfft_2D[:, 0]
train['tsneY'] = train_sfft_2D[:, 1]

# Plots
Exlore the data, with colored labels. Possibly good features for training?

In [None]:
fig = px.scatter(train, x="tsneX", y="tsneY", color="species_song_tp",
                 labels={
                     "tsneX": "X",
                     "tsneY": "Y",
                     "species_song_tp": "Species + Song + TP"
                 },
                 opacity = 0.5, 
                 title=f'Rainforest t-SNE: {CLIP_SIZE} second clip')
fig.update_yaxes(matches=None, showticklabels=False, visible=True)
fig.update_xaxes(matches=None, showticklabels=False, visible=True)
fig.show()

In [None]:
train_tp = train[train['tp']=='tp']
fig = px.scatter(train_tp, x="tsneX", y="tsneY", color="species_song",
                 labels={
                     "tsneX": "X",
                     "tsneY": "Y",
                     "species_song": "Species + Song"
                 },
                 opacity = 0.5, 
                 title=f'Rainforest t-SNE, True Positives Only: {CLIP_SIZE} second clip')
fig.update_yaxes(matches=None, showticklabels=False, visible=True)
fig.update_xaxes(matches=None, showticklabels=False, visible=True)
fig.show()

In [None]:
train_tp['species_id_str'] = train_tp['species_id'].apply(lambda x: str(x))
fig = px.scatter(train_tp, x="tsneX", y="tsneY", color="species_id_str",
                 labels={
                     "tsneX": "X",
                     "tsneY": "Y",
                     "species_id_str": "Species"
                 },
                 opacity = 0.5, 
                 title=f'Rainforest t-SNE, True Positives Only: : {CLIP_SIZE} second clip')
fig.update_yaxes(matches=None, showticklabels=False, visible=True)
fig.update_xaxes(matches=None, showticklabels=False, visible=True)
fig.show()

## Save Training Data with t-SNE labels

In [None]:
train.head()
train.drop(columns=['sfft', 'sfft_n_freq_bins', 'sfft_n_time_bins'], inplace=True)
train.to_csv('train_output.csv', index=False)

# Takeaways
The output data (train_output.csv) is a great resource for exploring true positives versus false positives, and how similar they may look. Additionally, the t-SNE features *might* be good for training predictive models.

While developing this notebook, I found myself running into CPU and RAM limits. I had to make a few adjustments to solve these issues. Specifically:
* Converting the SFFT outputs from float32 to float16
* Taking a 6-second clip length instead of a 10-second length
* Removing frequencies outside of the global range of f_min and f_max

I could have also adjusted the SFFT parameters, such as lowering the sampling rate, but didn't need to.

Feel free to leave your comments!