# ERROR:
File size limits killing kernel. Function to export list of matrices breaks after first round of pickling. Managed to brute force my way through a chunk of data, but the two resulting .pkl files are 14GB and 18GB, which is a. too large considering this is only a quarter of the data, and b. doesn't make sense (first should be larger than second, so they might also be bust). But can't tell bc pickle can't handle importing something that large...  
  
EDIT: Kernel survives by editing function so pickling occurs at len=2000 instead of 3000, and using smaller dimensions for matrices (lower resolution). Now 3x12 GBs of pickled lists, doubtful this is the best way to go about data management.  


# Generating model inputs from raw audio

Generating spectrograms and then saving those images as _images_ loses much of the information stored in the spectrogram input data. The spectrogram input is a complex-valued matrix D, consisting of magnitude and phase of frequency bin f at frame t. A plotted spectrogram shows time on the x-axis, frequency on the y-axis, with amplitude/intensity/decibels as color.  The data here have a minimum sample rate of 50KHz, signifying 50,000 samples per second for the :30 clip. The frequency and time binning done to generate spectrogram matrices groups these 1.5M+ 'samples' into windows of specified length; the lowest resolution used for marine mammal acoustics yields a matrix of dimensions >1000 x >1000.

If transposing the resulting image to a .png and reimporting, the resulting matrix dimensions represent pixels (based on physical size of export), and resulting matrix values represent a color on a different scale than audio intensity. Very nuanced standardization of spectrograms, cmaps, etc may yield a comparable input for CNN; but given the significant smoothing, binning, resampling, frequency ranges, and other parameters that go into generating each spectrogram matrix, I think the information loss is too great, or at least too unquantifiable.

To "feed spectrograms to CNN", what we want is the true spectrogram matrix, not the matrix of the .png (or other image format) rendering of the spectrogram matrix. 

---
Parameters used herein, based on marine mammal and underwater acoustic literature review:
* sr=50KHz --> all data are either 50k or 64k; resample the 64k files to 50k so working with equal times per frame
* n_fft=4096; downsized to accommodate file size issues. Ideal was n_fft>=8192 (based on methods from Thomas et al 2019, and applying to the unique attributes of _S. longirostris_ vocalizations)
    * n_fft=128 could best capture clicks (bursts and/or trains) 
        * 128/50000 = .00256 sec = 2560$\mu$s per FFT
        * fastest dolphin clicks (burst pulses) ~ 1750 clicks/sec = 570$\mu$s between clicks, each click 50-128 $\mu$s duration 
        * 2560$\mu$s window could capture several clicks; too short for whistles
    * more inclusive window ~ sr/4 = 12500 --> n_fft=8192(.16 sec); n_fft=16384(.33s)
        * _S. longirostris_ whistle duration = 0.05-1.28s, avg .49s
        * don't want to exceed n_fft=16384 (next power of 2 > 32K) because matrix size getting out of hand
    * Future goal: use multiple n_fft windows: train separate NNs on spectrograms of different n_ffts and compare performance, OR stack/interpolate into single spectrogram 
* win_length = n_fft
    * Future: smaller values to better discriminate clicks
* hop_length = n_fft/2; downsized to accommodate file size issues. Ideal was win_length/4 (default)
* window: Hann window
    * forces signal in a block to be periodic
    * Future: Hamming, Blackman
    
---
  
This notebook generates spectrogram matrices for training and test data with Librosa's stft function.



In [2]:
import pandas as pd
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
import os
import wave
import librosa
import librosa.display
import IPython.display as ipd
import pickle

## Step 1. set aside holdout set
This set will remain separate from regular train/test data for final validation.  
500 positive files, 1538 negative files

In [2]:
#relocate positives and negatives from existing folders to holdout folders
def define_holdout(n_pos, pos_origin, pos_dest, n_neg, neg_origin, neg_dest):
    import shutil
    
    #positives
    for i in range(n_pos):
        shutil.move(pos_origin + np.random.choice(os.listdir(pos_origin)), pos_dest)
        
    #negatives
    for j in range(n_neg):
        shutil.move(neg_origin + np.random.choice(os.listdir(neg_origin)), neg_dest)
        
    print(f'# positives in holdout: {len(os.listdir(pos_dest))}')
    print(f'# negatives in holdout: {len(os.listdir(neg_dest))}')
    print(f'# pos for tts: {len(os.listdir(pos_origin))}')
    print(f'# neg for tts: {len(os.listdir(neg_origin))}')

In [3]:
pos_origin = '../scratch_data/yes_dolphin/'
pos_dest = '../scratch_data/holdout/positives'
neg_origin = '../scratch_data/no_dolphin/'
neg_dest = '../scratch_data/holdout/negatives'

# define_holdout(500, pos_origin, pos_dest, 1538, neg_origin, neg_dest)

# positives in holdout: 501
# negatives in holdout: 1538
# pos for tts: 5588
# neg for tts: 16766

## Step 2. Create spectrograms for tts
* Pos and Neg files currently only differentiated by directory; attach label (Yes/No) here
* Save spectrogram matrices and labels in list(s), save list via pickle or librosa cache?

---
* Adding all pos ID spectrogram matrices (only 1/4 total tts data) to a single list breaks kernel somewhere between 4000-5000 files. Instead, pickle increments of 2000, tupled with their presence/absence labels.
* Kernel still died after exporting first pkl @ 3000 (presumably emptying spectrolist, so _list_ size not the issue now; must be pickle limits)

In [20]:
def listospects(in_path, n_fft, win_length, window, pos_neg, ex_path):
    import pickle
    
    files = os.listdir(in_path)
    spectrolist = []
    count = 1
    
    for i in files:
    #for i in range(3001, len(files)):  #for when you had to rerun from the middle...
        y, sr = librosa.load(in_path + i, sr=50000) #hardcode sr
        S = np.abs(librosa.stft(y, n_fft=n_fft, win_length=win_length, hop_length=int(n_fft/2),
                               window=window))
        
        spectrolist.append((S, pos_neg)) #append tuple: ([spectro matrix], presence/absence)
        
        if len(spectrolist) % 1000 == 0:
            print(len(spectrolist))
            
        if len(spectrolist) % 2000 == 0: #save results so far and clear list (size limits?)
            with open(f'{ex_path}tts_{count}.pkl', mode = 'wb') as pickle_out:
                pickle.dump(spectrolist, pickle_out)
            count += 1
            spectrolist = []
    #export whatever's in list when pau        
    if len(spectrolist) > 0:
        with open(f'{ex_path}tts_{count}.pkl', mode = 'wb') as pickle_out:
            pickle.dump(spectrolist, pickle_out)
            
    print("thanks for all the fish")

In [None]:
#positive IDs first
#kernel dies after first pickle_out

in_path = '../scratch_data/yes_dolphin/'
ex_path = '../scratch_data/tts_matrices/'
# yes_tts = listospects(in_path=in_path, n_fft=8192, win_length=8192,
#                       window=signal.windows.hann, pos_neg=1, ex_path=ex_path)

1000
2000


In [4]:
with open ('../scratch_data/tts_matrices/tts_1a.pkl', mode = 'rb') as pickle_in:
    workplease = pickle.load(pickle_in)

EOFError: Ran out of input

In [7]:
os.path.getsize('../scratch_data/tts_matrices/tts_1a.pkl') #ruh roh

14532512366

In [22]:
# try with smaller n_fft?
# changed hop_length to n_fft/2 in function for this example; chang back to ideal nfft/4 if can!
in_path = '../scratch_data/yes_dolphin/'
ex_path = '../scratch_data/tts_matrices/'
fft_sm = 4096
yes_tts = listospects(in_path=in_path, n_fft=fft_sm, win_length=fft_sm,
                      window=signal.windows.hann, pos_neg=1, ex_path=ex_path)

1000
2000
1000
2000
1000
thanks for all the fish


In [24]:
with open ('../scratch_data/tts_matrices/tts_1.pkl', mode = 'rb') as pickle_in:
    howboutnow = pickle.load(pickle_in) #!!

In [28]:
howboutnow[0][0].shape

(2049, 740)

In [32]:
sys.getsizeof(howboutnow)

18104

#### Random investigating matrix and file sizes
The above function ended up working with the reduced matrix sizes. Not the resolution I was going for but could suffice if need. 33.9GB for just the positive files, roughly 1/4 total train-test-split data. This strikes me as somewhat bonkers considering the audio files themselves are 19.7GB. 

In [4]:
test_path = '../scratch_data/testinggg/'

In [5]:
test_files = os.listdir(test_path)

In [6]:
test_files

['makua2016_00014626.e.wav',
 'makua2016_00014642.e.wav',
 'makua2016_00014611.e.wav',
 'makua2016_00014628.e.wav',
 'honolua2016_00000104.e.wav']

In [7]:
n_fft = 8192
# win_length = n_fft (default)
# hop_length = n_fft/4 (default)
window = signal.windows.hann

y, sr = librosa.load(test_path+test_files[0], sr=50000)
S = np.abs(librosa.stft(y, n_fft=n_fft, win_length=n_fft, hop_length=int(n_fft/4), 
                        window=window))

In [8]:
S.shape

(4097, 739)

In [9]:
import sys
sys.getsizeof(S) #12_110_852 bytes = 12.11 MB * 5588 pos files = 67.67 GB

12110852

In [10]:
spectlist = []
spectlist.append((S,0))
sys.getsizeof(spectlist)

88

In [15]:
n_fft_sm=4096
y, sr = librosa.load(test_path+test_files[0], sr=50000)
S_smaller = np.abs(librosa.stft(y, n_fft=n_fft_sm, hop_length=int(n_fft_sm/2), 
                        window=window))

In [16]:
S_smaller.shape

(2049, 739)

In [41]:
spectlist_sm=[]
spectlist_sm.append((S_smaller,1))
print(sys.getsizeof(S_smaller))
print(sys.getsizeof(spectlist_sm))

6056964
88


In [42]:
s_list=[]
for i in test_files:
    y, sr = librosa.load(test_path+i, sr=50000)
    S = np.abs(librosa.stft(y, n_fft=n_fft, win_length=n_fft, hop_length=int(n_fft/4), 
                            window=window))
    s_list.append((S,1))

In [45]:
print(len(s_list))
print(sys.getsizeof(s_list)) #so what's the deal?
print(sys.getsizeof(s_list[0][0])) #oh.

5
120
12110852


In [46]:
with open('../scratch_data/size_test.pkl', mode = 'wb') as pickle_out:
    pickle.dump(s_list, pickle_out)

    
#60.6 MB in Finder, len=5. Projected len=5588 --> 338.5 GB woof
with open('../scratch_data/size_test.pkl', mode = 'rb') as pickle_in:
    size_check = pickle.load(pickle_in)

In [60]:
sys.getsizeof(size_check[0][0]) #120 bytes?  Why size difference?

120

In [64]:
print(sys.getsizeof(s_list[0][0]))
s_list[0]

12110852


(array([[0.15962058, 0.19476938, 0.18493547, ..., 0.7757199 , 0.7546011 ,
         0.7965746 ],
        [0.04913787, 0.1263316 , 0.11499742, ..., 0.4314431 , 0.33885133,
         0.42975217],
        [0.09760977, 0.0261576 , 0.05600911, ..., 0.15420917, 0.02857238,
         0.11313909],
        ...,
        [0.0329048 , 0.07270378, 0.12733585, ..., 0.08322319, 0.13698879,
         0.11471724],
        [0.03562226, 0.01525263, 0.18407874, ..., 0.02256095, 0.11547531,
         0.11492181],
        [0.04500581, 0.0670425 , 0.21629652, ..., 0.0559493 , 0.1409184 ,
         0.22505325]], dtype=float32),
 1)

In [65]:
print(sys.getsizeof(size_check[0][0])) #pickled version of s_list but order of magnitude diff
size_check[0]

120


(array([[0.15962058, 0.19476938, 0.18493547, ..., 0.7757199 , 0.7546011 ,
         0.7965746 ],
        [0.04913787, 0.1263316 , 0.11499742, ..., 0.4314431 , 0.33885133,
         0.42975217],
        [0.09760977, 0.0261576 , 0.05600911, ..., 0.15420917, 0.02857238,
         0.11313909],
        ...,
        [0.0329048 , 0.07270378, 0.12733585, ..., 0.08322319, 0.13698879,
         0.11471724],
        [0.03562226, 0.01525263, 0.18407874, ..., 0.02256095, 0.11547531,
         0.11492181],
        [0.04500581, 0.0670425 , 0.21629652, ..., 0.0559493 , 0.1409184 ,
         0.22505325]], dtype=float32),
 1)