# **Complete subsampling pipeline breakdown**

## 1) Start with our imports:

### The cell below contains the imports needed for `src/pipeline/pipeline.py`

In [1]:
import numpy as np
import pandas as pd

from pathlib import Path

from torch import multiprocessing
from tqdm import tqdm

### The cell below contains the imports needed for `src/pipeline/audio_segmentor.py`

In [2]:
import os
import soundfile as sf

### The cell below contains the imports needed for `src/cli.py`

In [3]:
import sys

# # append the path of the
# # parent directory
sys.path.append('..')
sys.path.append('../src/')
sys.path.append('../src/models/bat_call_detector/batdetect2/')

In [4]:
from src.pipeline import pipeline
import src.subsampling as ss

## 2) Write any custom methods below:

### a) Below method is the implementation of subsampling we used for generating detections used in the Symposium results
#### &nbsp;&nbsp;&nbsp; i) Takes in segmented_file_paths from MSDS pipeline
#### &nbsp;&nbsp;&nbsp; ii) Filters out segmented_file_paths generated from MSDS pipeline. 
#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Removes paths that would not exist if the recorder incorporated duty cycling by using the `percent_on` and `cycle_length` parameters of duty cycle.

In [5]:
def subsample_withpaths(segmented_file_paths, cfg, cycle_length, percent_on):
    necessary_paths = []

    for path in segmented_file_paths:
        if (path['offset'] % cycle_length == 0 # Check if starting position is within recording period; won't need to check rest of boolean if it is
            or ((path['offset']+cfg['segment_duration'])%cycle_length > 0 and (path['offset']+cfg['segment_duration'])%cycle_length <= int(cycle_length*percent_on))):
            necessary_paths.append(path)

    return necessary_paths

### b) Below method is a modified version of the MSDS run_models method
#### &nbsp;&nbsp;&nbsp; i) Runs only the batdetect2 submodule to generate only search-phase call detections.

In [6]:
## Run models and get detections!
def run_models(file_mappings, cfg, csv_name):
    bd_dets = pd.DataFrame()
    for i in tqdm(range(len(file_mappings))):
        cur_seg = file_mappings[i]
        bd_annotations_df = cur_seg['model']._run_batdetect(cur_seg['audio_seg']['audio_file'])
        bd_preds = pipeline._correct_annotation_offsets(
                bd_annotations_df,
                cur_seg['original_file_name'],
                cur_seg['audio_seg']['offset']
            )
        bd_dets = pd.concat([bd_dets, bd_preds])

    bd_dets.to_csv(f"{cfg['output_dir']}/{csv_name}", index=False)

    return bd_dets

## 3) Where does the subsampling pipeline start?

### Let's start with an audio file to demonstrate how we used our `subsampling.py` script

In [7]:
filepath = f"{Path.home()}/Documents/Research/Lab_related/example/original_recording"
filename = "20210910_030000.WAV"

### The below command is the command line invocation of the subsampling pipeline.

#### **Command: `python src/subsampling.py ../Documents/Research/Lab_related/example/original_recording '5min_every_30min__Central_20210910_030000.csv' 'output_dir' 'output/tmp' 1800 0.167`**

- `../Documents/Research/Lab_related/example/original_recording` is the folder path that contains our recording. 
   - Our pipeline takes in a folder and generates detections for every recording in those folders
- `5min_every_30min__Central_20210910_030000.csv` is the name of the output detections .csv.
   - For multiple consecutive recordings, we've labelled the output file as "...030000to130000.csv"
- `output_dir` is the repository folder where output detections .csv files will be saved.
- `output/tmp` is the repository folder where generated segment recordings will be saved and deleted after detections have been generated.
- `1800` is the provided cycle_length to generate duty cycled detections. 1800(s) is the number of seconds in 30min
- `0.167` is the provided percent_on to generate duty cycled detections. 0.167 is the duty cycle percentage. 

The subsampling scheme given is 300s (1800*0.167) or 5min every 30min. 

The location the recording was recorded from was Central Pond.

The recording was recorded on 09/10/2021 at 8pm PST.

**We wrote in these details into the output detections .csv file name.**

### Calling the above command runs the following code:

- `args = parse_args()`: **Takes in the command line positional arguments**

- `run_subsampling_pipeline(args['input_dir'], args['cycle_length'], args['percent_on'], args['csv_filename'], args['output_dir'], args['temp_dir'])`

## 4) A look into the `run_subsampling_pipeline()` method

In [8]:
def run_subsampling_pipeline(input_dir, cycle_length, percent_on, csv_name, output_dir, tmp_dir):
    cfg = ss.get_params(output_dir, tmp_dir, 4, 30.0)
    audio_files = sorted(list(Path(input_dir).iterdir()))
    segmented_file_paths = ss.generate_segmented_paths(audio_files, cfg)
    
    ## Get file paths specific to our subsampling parameters
    if (percent_on < 1.0):
        necessary_paths = subsample_withpaths(segmented_file_paths, cfg, cycle_length, percent_on)
    else:
        necessary_paths = segmented_file_paths

    file_path_mappings = ss.initialize_mappings(necessary_paths, cfg)
    bd_dets = run_models(file_path_mappings, cfg, csv_name)

    return bd_dets

### Description: 

This pipeline runs very similar to the MSDS pipeline with a few modifications:
1) A level of filtering to simulate duty cycling using the generated MSDS segments.
2) Only running batdetect2 for search-phase call detections in run_models

If one wishes to run the pipeline without any duty cycling, simply provide a `percent_on` of 1.0

In [9]:
ss.run_subsampling_pipeline(filepath, 360, 0.167, '1min_every_6min__Central_20210910_030000.csv', '../output_dir', '../output/tmp')

100%|██████████| 10/10 [00:54<00:00,  5.44s/it]


Unnamed: 0,start_time,end_time,low_freq,high_freq,detection_confidence,event,input_file
0,726.5015,726.5129,24609.0,31793.0,0.570,Echolocation,20210910_030000.WAV
1,726.6575,726.6712,23750.0,30947.0,0.510,Echolocation,20210910_030000.WAV
2,727.0355,727.0591,21171.0,27033.0,0.528,Echolocation,20210910_030000.WAV
3,727.3045,727.3254,21171.0,28485.0,0.564,Echolocation,20210910_030000.WAV
4,727.8175,727.8405,21171.0,27830.0,0.550,Echolocation,20210910_030000.WAV
...,...,...,...,...,...,...,...
80,1498.2985,1498.3119,23750.0,28609.0,0.673,Echolocation,20210910_030000.WAV
81,1498.7365,1498.7499,24609.0,28873.0,0.588,Echolocation,20210910_030000.WAV
82,1499.2145,1499.2291,23750.0,28587.0,0.648,Echolocation,20210910_030000.WAV
83,1499.5175,1499.5324,23750.0,28491.0,0.590,Echolocation,20210910_030000.WAV


In [10]:
ss.run_subsampling_pipeline(filepath, 1800, 0.167, '5min_every_30min__Central_20210910_030000.csv', '../output_dir', '../output/tmp')

100%|██████████| 10/10 [00:54<00:00,  5.41s/it]


Unnamed: 0,start_time,end_time,low_freq,high_freq,detection_confidence,event,input_file
0,246.9325,246.9418,28046.0,37602.0,0.549,Echolocation,20210910_030000.WAV
1,247.2165,247.2264,28046.0,38133.0,0.573,Echolocation,20210910_030000.WAV
2,247.3525,247.3627,28046.0,38892.0,0.555,Echolocation,20210910_030000.WAV
3,247.4715,247.4794,27187.0,42404.0,0.763,Echolocation,20210910_030000.WAV
4,247.5905,247.6001,28046.0,37383.0,0.530,Echolocation,20210910_030000.WAV
...,...,...,...,...,...,...,...
57,257.2315,257.2460,22031.0,28742.0,0.614,Echolocation,20210910_030000.WAV
58,257.3965,257.4096,23750.0,29575.0,0.628,Echolocation,20210910_030000.WAV
59,258.3155,258.3265,24609.0,31268.0,0.639,Echolocation,20210910_030000.WAV
60,258.8635,258.8733,24609.0,30445.0,0.553,Echolocation,20210910_030000.WAV


In [11]:
ss.run_subsampling_pipeline(filepath, 1800, 1.0, 'continuous__Central_20210910_030000.csv', '../output_dir', '../output/tmp')

100%|██████████| 60/60 [05:25<00:00,  5.43s/it]


Unnamed: 0,start_time,end_time,low_freq,high_freq,detection_confidence,event,input_file
0,246.9325,246.9418,28046.0,37602.0,0.549,Echolocation,20210910_030000.WAV
1,247.2165,247.2264,28046.0,38133.0,0.573,Echolocation,20210910_030000.WAV
2,247.3525,247.3627,28046.0,38892.0,0.555,Echolocation,20210910_030000.WAV
3,247.4715,247.4794,27187.0,42404.0,0.763,Echolocation,20210910_030000.WAV
4,247.5905,247.6001,28046.0,37383.0,0.530,Echolocation,20210910_030000.WAV
...,...,...,...,...,...,...,...
49,1781.4465,1781.4565,25468.0,32664.0,0.584,Echolocation,20210910_030000.WAV
50,1781.5785,1781.5891,26328.0,32298.0,0.577,Echolocation,20210910_030000.WAV
51,1781.9385,1781.9477,26328.0,34069.0,0.614,Echolocation,20210910_030000.WAV
52,1782.0615,1782.0702,25468.0,34393.0,0.600,Echolocation,20210910_030000.WAV


## 5) Output detections and generating comparisons