In [101]:
import pandas as pd
import numpy as np
import os
#pd.set_option('display.max_columns', None)

# In this notebook:
## 1. Create master list of all positive ID files
* concatenate individual dataset logs  
* bring in sample_rates.csv (varies by site) and attach to master list
    

## 2. Create loop to download data
Transfer select files from external hardrive to local.
* all positive ID files
* random sample of negative files 
* they are all separated by site and have repeat filenames it's chaos

In [2]:
path = '../scratch_data/positive_file_logs/'
logs = [i for i in os.listdir(path) if '.csv' in i]
logs

['MM17b_20160630_20160830_triton output_MM.csv',
 'Kahekili1_20160115_20160316_triton output_MM.csv',
 'Kahekili2_20160630_20160830_triton output_MM.csv',
 'MM17a_20160115_20160316_triton output_MM.csv',
 'Lopa_20160930_20161130_triton output_MM.csv',
 'Honolua_20161005_20161130_triton output_MM.csv',
 'MauiLanai_20160630_20160830_triton output_MM.csv',
 'Launiupoko_20160630_20160820_triton output_MM.csv',
 'Makua_20161001_20161130_triton output_MM.csv',
 'Manele_20160930_20161130_triton output_MM.csv',
 'NorthMala_20160630_20160830_triton output_MM.csv']

In [3]:
def mass_import(path, listdir):
    all_data = []
    for i in listdir:
        all_data.append(pd.read_csv(path+i))
    return pd.concat(all_data)

In [4]:
all_positives = mass_import(path, logs)
all_positives.columns = all_positives.columns.str.lower().str.replace(' ','_')

#all_positives.isnull().sum()  #64000 nulls from empty csv rows at the end of one df
#drop nulls that are completely empty across all fields
all_positives.dropna(how = 'all', inplace=True)

#only rows of import
all_positives = all_positives[['key', 'source', 'call_type', 'date', 'hour', 'frequency_1']].copy()
all_positives.rename(columns = {'key':'filename'}, inplace=True)

  if (await self.run_code(code, result,  async_=asy)):


In [5]:
all_positives.info() #should only be ~6000, check

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6138 entries, 0 to 489
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   filename     6138 non-null   object 
 1   source       6138 non-null   object 
 2   call_type    6138 non-null   object 
 3   date         6138 non-null   object 
 4   hour         6138 non-null   float64
 5   frequency_1  6138 non-null   float64
dtypes: float64(2), object(4)
memory usage: 335.7+ KB


#### Assign sample rate
Sample rate is one of two frequencies; the rate for a dataset_id applies to all .wav files within that deployment.

In [7]:
srs = pd.read_csv('../data/sample_rates.csv')

In [8]:
srs

Unnamed: 0,dataset_id,sample_rate_hz
0,honolua2016,64000
1,kahekili1,50000
2,kahekili2,64000
3,launiupoko2016,64000
4,lopa2016,64000
5,makua2016,50000
6,manele2016,64000
7,mauilanai2016,64000
8,mm17a,50000
9,mm17b,64000


In [9]:
sr_dict = {srs['dataset_id'][i] : srs['sample_rate_hz'][i] for i in range(len(srs))}

In [10]:
sr_dict

{'honolua2016': 64000,
 'kahekili1': 50000,
 'kahekili2': 64000,
 'launiupoko2016': 64000,
 'lopa2016': 64000,
 'makua2016': 50000,
 'manele2016': 64000,
 'mauilanai2016': 64000,
 'mm17a': 50000,
 'mm17b': 64000,
 'nmala2016': 64000}

In [11]:
all_positives['sr_key'] = all_positives['filename'].map(lambda x: x.split('_')[0])
all_positives['sample_rate'] = all_positives['sr_key'].map(sr_dict)
all_positives.drop(columns='sr_key', inplace=True)

In [13]:
print(all_positives.shape)
all_positives.head()

(6138, 7)


Unnamed: 0,filename,source,call_type,date,hour,frequency_1,sample_rate
0,mm17b_00000333.e.wav,Dolphin,click,7/1/16,3.0,23896.5383,64000
1,mm17b_00000334.e.wav,Dolphin,click,7/1/16,3.0,21559.0068,64000
2,mm17b_00000335.e.wav,Dolphin,click,7/1/16,3.0,24621.9791,64000
3,mm17b_00000336.e.wav,Dolphin,click,7/1/16,4.0,27523.7424,64000
4,mm17b_00000337.e.wav,Dolphin,click,7/1/16,4.0,26516.1857,64000


#### Break up by source

In [14]:
all_positives.source.value_counts()

Dolphin      6104
Anthropog      25
unknown         5
Unknown         3
Whale           1
Name: source, dtype: int64

In [15]:
anthro = all_positives.loc[all_positives['source'] == 'Anthropog'].copy()
whale = all_positives.loc[all_positives['source'] == 'Whale'].copy()
all_positives = all_positives.loc[all_positives['source'] == 'Dolphin']

In [16]:
all_positives.to_csv('../data/dolphin_positives.csv', index=False)
anthro.to_csv('../data/anthro_examples.csv', index=False)
whale.to_csv('../data/whale_solo.csv', index=False)

## 2. Download loop: select and transfer data
Currently I have an external hardrive of 190,000 audio files distributed among 11 folders, each representing a deployment of the recording device. Each folder contains positive and negative files intermingled. Only 6,104 of those files are positive, per the key above; which is a 96.7% majority for negative files. That's not a great ratio for training a model (at least one that will outperform baseline). To balance classes, I will randomly select negative files (from across all 11 sites) to achieve a 1:3 positive:negative ratio. Even if 1,000 of these positive files end up being set aside as a hold-out set, that still leaves me with a 1:4 ratio for training. There supposedly are more positive files en route, but I am unsure on their # or arrival time, so moving ahead as is.  

---
* 6,104 positive files @ 1p:3n ratio = 18,312 negative files
* write fx to physically move all the positive files from external hardrive to scratch_data positive folder
* once positive files removed, can freely random-sample from all sites for negatives
* directories for each deployment about the same size: sample same # from each
* projected total size = 90GB; ample room on laptop; means not having to rely on external connection for analysis
    * Consider copying some example files as a small dummy-dataset for github. Github filesize limits mean can't push real data.
    * scratch_data is included in .gitignore
* Note: complete data backed up on separate drive. Can delete, rename, relocate with abandon.

---
code guidance: [1](https://stackoverflow.com/questions/8858008/how-to-move-a-file), [2](https://stackoverflow.com/questions/49280966/pulling-random-files-out-of-a-folder-for-sampling)

In [94]:
# relocate all positive dolphin id files from aggregated data to yes_dolphin folder

def dolphins_assemble(file_list, origin, destination):
    import shutil #os.rename might not move between devices
    
    directories = [i for i in os.listdir(origin) if '2016' in i] #all deployments titled with 2016
    count = 0
    
    for i in directories:
        
        files = [f for f in os.listdir(f'{origin + i}/ewavs') if f.endswith('.wav')]
        
        for file in files:
            if file in file_list:
                shutil.move(f'{origin + i}/ewavs/{file}', destination+file)
                count += 1
                if count % 1000 == 0:
                    print(count)
            else:
                pass

In [95]:
origin = '/Volumes/hmm1mhm/'
destination = '../scratch_data/yes_dolphin/'
file_list = all_positives['filename'].to_list()

dolphins_assemble(file_list, origin, destination)

1000
2000
3000
4000
5000
6000


Confirmed in file origin that positive id files are no longer in origin directories. As such, can random sample these directories for negative files. For this operation, copy the files as opposed to remove/rename. That way, negative training data can be resampled as desired. (Though also note that if resampling or adding new data, discard the old training copies or otherwise ensure there are no duplicates).
* Target: 18,300 files
* Deployments: 11, all similar file count (~16k)
* Random Sample: 1664 from each  

In [99]:
# random sample (no replacement) negative files from aggregated to no_dolphin
# equal samples from each site (directory)

def sample_negatives(n_samples, origin, destination):
    
    import shutil #os.rename might not move between devices
    
    directories = [i for i in os.listdir(origin) if '2016' in i] #all deployments titled with 2016
    count = 0
    
    for i in directories:
        
        files = [f for f in os.listdir(f'{origin + i}/ewavs') if f.endswith('.wav')]
        
        subsample = np.random.choice(files, size = n_samples, replace=False)
        
        for s in subsample:
            shutil.copy(f'{origin + i}/ewavs/{s}', destination)
            count += 1
            if count % 3000 == 0:
                print(count)

In [100]:
origin = '/Volumes/hmm1mhm/'
destination = '../scratch_data/no_dolphin/'

sample_negatives(1664, origin, destination)

3000
6000
9000
12000
15000
18000


### Pau
Pret.ty proud of this whole operation. Ready to start to start starting the actual project!