# Prepare NFC annotations

This notebook
* Uses the NFC annotation files that Lauren Chronister cleaned (`clean_nfcs_lmc_fall2021/Annotations_cleaned`)
* Pairs annotation files with sound files, including modifying the names of some annotation files for consistency and saving them in a new folder (`./kearney-nfc_annotations_cleaned`)
* Loads a table of preliminary frequency information, guessed manually and stored in `freqs_and_durations_draft.csv`. (The codes in this table were created by finding all the codes using `0_nfc_finding.ipynb`).
* Extracts some preliminary information about the lengths of each species's calls and adds it to the previously loaded `freqs_and_durations_draft.csv`

In [1]:
from pathlib import Path
import pandas as pd
from glob import glob
import os
import shutil
import numpy as np
from opensoundscape.helpers import run_command
from opensoundscape.audio import Audio
from opensoundscape.spectrogram import Spectrogram

## Get cleaned annotations
Got the zip file from Lauren and unzipped them in this folder.

## Get list of species to extract

This is in Lauren's `clean_nfcs_lmc_fall2021/master_key.csv`.

Column meaning:

* Code = the code from the original annotation
* New_Code = Corrected code that includes clade and any needed corrections
* Check: 
  * "yes" = is good
  * "uncertain" = we're not sure what they annotator was going for; these were dropped
  * some other value = something that was updated through a check
* Count = the number of annotations

In [5]:
counts = pd.read_csv("../../../annotations/clean_nfcs_lmc_fall2021/master_key.csv")
counts.head()

Unnamed: 0,Code,New_Code,Check,Count
0,ALAUDIDAE-EREMOPHILA-ALPESTRIS-HOLA,PASSERIFORMES-ALAUDIDAE-EREMOPHILA-ALPESTRIS-HOLA,yes,1
1,ANATIDAE-ANAS-CRECCA-GWTE,ANSERIFORMES-ANATIDAE-ANAS-CRECCA-GWTE,yes,29
2,ANATIDAE-BRANTA-CANADENSIS-CAGO,ANSERIFORMES-ANATIDAE-BRANTA-CANADENSIS-CANG,yes,148
3,ANATIDAE-BUCEPHALA-CLANGULA-COGE,ANSERIFORMES-ANATIDAE-BUCEPHALA-CLANGULA-COGO,yes,2
4,ANATIDAE-CLANGULA-HYEMALIS-LTDU,ANSERIFORMES-ANATIDAE-CLANGULA-HYEMALIS-LTDU,yes,40


Get 4-letter alpha codes

In [6]:
counts['alpha'] = counts.Code.str.split('-', expand=True)[3]

Exclude uncertain ones and ones where there is no species based on whether or not the alpha code is 4 characters (e.g. ANATIDAE, NA)

In [7]:
counts = counts.query('Check != "uncertain"') # remove ones where we weren't sure what the annotators were going for
counts = counts[counts.alpha.str.len() == 4] # remove non-4 letters
counts = counts.query('alpha != "UNKN"') # remove ones that the annotators marked as unknown
counts = counts.reset_index(drop=True)

Get counts

In [8]:
counts_by_alpha = counts.groupby('alpha').Count.sum()

Get list of alpha codes to extract annotations for

In [9]:
len(counts_by_alpha.index.tolist())

135

## Match up annotation and sound files

Annotations are located in `clean_nfcs/Annotations_cleaned` and are `.txt` files with the following columns:
```
Begin time (s)	End time (s)	Low freq (hz)	High freq (hz)	Order	Family	Genus	Species	Alpha code
```

Get list of files

In [10]:
selection_tables = glob('../../../annotations/clean_nfcs_lmc_fall2021/Annotations_cleaned/*.txt')
selection_tables.sort(key=lambda x: Path(x).name)
len(selection_tables)

2318

Audio files are located in `/bgfs/jkitzes/ter38/data/kearney-nfc`:

In [11]:
filenames = list(Path('/bgfs/jkitzes/ter38/data/kearney-nfc/').rglob('*.wav'))
filenames.sort(key=lambda x: Path(x).name)
len(filenames)

235126

In [12]:
audio_stems = [Path(t).stem for t in filenames]

Get an abbreviated list of sound files that have a selection table associated with them so that it doesn't take as long to make the dictionary below

In [13]:
table_stems = [Path(t).stem for t in selection_tables]
to_keep = []
for idx, filename in enumerate(filenames):
    if filename.stem in table_stems:
        to_keep.append(idx)
sound_files = [filenames[i] for i in to_keep]

### Create new set of filename-fixed annotation files

Inspect the lengths of the matched arrays and see...

In [14]:
len(selection_tables)

2318

In [15]:
len(sound_files)

1229

...that sound files weren't found for many selection tables. We'll have to find which ones those are and fix the names of the annotation tables.

In [16]:
cleaned_annot_path = Path('kearney-nfc_annotations_cleaned')
cleaned_annot_path.mkdir(exist_ok=True)

Create a dictionary matching the selection tables we already have with the necessary audio files.

In [17]:
Path(sound_files[0])

PosixPath('/bgfs/jkitzes/ter38/data/kearney-nfc/Amherst Study 2014-2015/Amherst Access Road/A2AR1_20140504_210100.wav')

In [18]:
tables_to_filenames = {}
need_to_fix_names = []

# For all the selection tables
for selection_table in selection_tables:
    
    # Search through the list of audio files to find one with a 
    # filename matching this selection table
    for idx, sound_file in enumerate(sound_files):
        if Path(sound_file).stem == Path(selection_table).stem:
            new_annotation_filename = cleaned_annot_path.joinpath(Path(selection_table).name)
            shutil.copy(str(selection_table), str(new_annotation_filename))
            tables_to_filenames[Path(selection_table).name] = str(sound_file)
            break
            
    # If we haven't found a matching audio filename in the loop above,
    # take note to change the name
    if Path(selection_table).name not in tables_to_filenames.keys():
        need_to_fix_names.append(selection_table)

In [19]:
len(need_to_fix_names)

1089

In [20]:
need_to_fix_names[:5]

['../../../annotations/clean_nfcs_lmc_fall2021/Annotations_cleaned/BERI1_20180805_211600.txt',
 '../../../annotations/clean_nfcs_lmc_fall2021/Annotations_cleaned/BERI1_20180806_211500.txt',
 '../../../annotations/clean_nfcs_lmc_fall2021/Annotations_cleaned/BERI1_20180807_211300.txt',
 '../../../annotations/clean_nfcs_lmc_fall2021/Annotations_cleaned/BERI1_20180808_211200.txt',
 '../../../annotations/clean_nfcs_lmc_fall2021/Annotations_cleaned/BERI1_20180809_211000.txt']

Most of the names that need to be fixed are problematic because the selection tables and audio files are these two styles:
* Selection table: `BERI1_20180805_211600.txt`
* Audio file: `BERI1-20180805_211600.txt`

Others are problematic because they have two underscores instead of one in one place.

Change this for all of the files in the `need_to_fix_names` list.

In [21]:
proposed_new_names = {}

for need_to_fix_name in need_to_fix_names:
    name = Path(need_to_fix_name).name.replace('__', '_') # Fix the two-underscore typo
    parts = name.split('_')
    new_name = parts[0] + '-' + parts[1] + '_' + parts[2] # Fix the underscore-instead-of-hyphen typo
    new_annotation_filename = cleaned_annot_path.joinpath(new_name)
    proposed_new_names[need_to_fix_name] = new_annotation_filename
    #shutil.copy(need_to_fix_name, new_name)
    #fixed_names.append(new_name)

In [22]:
len(proposed_new_names.keys())

1089

Create a new list of sound files that correspond to the fixed names.

In [23]:
table_stems_fixed = [Path(t).stem for t in proposed_new_names.values()]
to_keep = []
for idx, filename in enumerate(filenames):
    if filename.stem in table_stems_fixed:
        to_keep.append(idx)
sound_files = [filenames[i] for i in to_keep] 
len(sound_files)

1087

So we have found 1087 sound files of the 1089 selection tables--pretty close.

Add correspondence between the newly fixed names to the master dictionary.

In [24]:
need_to_fix_names = []

# For all the proposed fixes
for old_annotation_filename, new_annotation_filename in proposed_new_names.items():
    
    # Search through the list of audio files to find one with a 
    # filename matching this selection table
    for idx, sound_file in enumerate(sound_files):
        if Path(sound_file).stem == new_annotation_filename.stem:
            shutil.copy(str(old_annotation_filename), str(new_annotation_filename))
            tables_to_filenames[Path(new_annotation_filename).name] = str(sound_file)
            break
            
    # If we haven't found a matching audio filename in the loop above,
    # take note to change the name
    if new_annotation_filename.name not in tables_to_filenames.keys():
        need_to_fix_names.append(selection_table)

What are the two remaining names?

In [25]:
need_to_fix_names

['../../../annotations/clean_nfcs_lmc_fall2021/Annotations_cleaned/SWWB_20161020_183900.txt',
 '../../../annotations/clean_nfcs_lmc_fall2021/Annotations_cleaned/SWWB_20161020_183900.txt']

After looking through the original dataset I cannot find audio files corresponding to these names.

Save the correspondence betwen selection tables and audio files

In [26]:
for key, val in tables_to_filenames.items():
    tables_to_filenames[key] = [val]

In [27]:
annotation_audio_pairs = pd.DataFrame(tables_to_filenames).transpose().reset_index()
annotation_audio_pairs.columns = ['annotation_file', 'audio_file']
annotation_audio_pairs.to_csv('annotation_audio_pairs.csv', index=False)

## Get duration information for NFCs

We created a file called `freqs_and_durations_draft.csv` which contains the expected frequencies for each of the species in the dataset. Now let's add the duration information.

In [28]:
cleaned_annot_path = Path('kearney-nfc_annotations_cleaned')

In [29]:
lengths_dict = {}
for txt_file, audio_file in list(tables_to_filenames.items()):
    txt_filename = cleaned_annot_path.joinpath(txt_file)
    df = pd.read_csv(txt_filename, sep='\t')
    for idx, row in df.iterrows():
        alpha = row['Alpha code']
        if alpha in ["BLGR", "BRCR", "COGO", "EVGR", "LALO", "OROR", "RBGU", "SORA", "STSA"]:
            print(txt_filename)
        # Skip rows where alpha code was unknown or not able to be determined
        if alpha == '?':
            continue
        length = row["End time (s)"] - row['Begin time (s)']
        if alpha not in lengths_dict.keys():
            lengths_dict[alpha] = [length]
        else:
            lengths_dict[alpha].append(length)
        

kearney-nfc_annotations_cleaned/A2PS1_20140418_203900.txt
kearney-nfc_annotations_cleaned/A2WA_20150509_210800.txt
kearney-nfc_annotations_cleaned/BRIS1_20180901_203200.txt
kearney-nfc_annotations_cleaned/BRIS1_20180901_203200.txt
kearney-nfc_annotations_cleaned/BRIS1_20180901_203200.txt
kearney-nfc_annotations_cleaned/CAFO1_20170916_200200.txt
kearney-nfc_annotations_cleaned/CAFO1_20180423_204800.txt
kearney-nfc_annotations_cleaned/CAFO1_20180423_204800.txt
kearney-nfc_annotations_cleaned/SWSG_20131003_191000.txt
kearney-nfc_annotations_cleaned/SWWB_20160906_200200.txt
kearney-nfc_annotations_cleaned/SWWB_20160906_200200.txt
kearney-nfc_annotations_cleaned/SWWB_20160906_200200.txt
kearney-nfc_annotations_cleaned/SWWB_20160906_200200.txt
kearney-nfc_annotations_cleaned/SWWB_20160906_200200.txt
kearney-nfc_annotations_cleaned/SWWB_20160906_200200.txt
kearney-nfc_annotations_cleaned/BERI1-20190902_202900.txt
kearney-nfc_annotations_cleaned/CAWRR-20151106_173800.txt
kearney-nfc_annotation

### Round durations and save to CSV ***OUTDATED***

Use a basic ceiling rounder to assign each sound type to one of four lengths. These will be modified by later scripts.

In [43]:
break # THIS CODE IS OUTDATED

SyntaxError: 'break' outside loop (<ipython-input-43-6aaf1f276005>, line 4)

freq_limits = pd.read_csv("../freqs_and_durations_draft.csv", usecols=['code', 'low_freq', 'high_freq'])
codes = freq_limits.code.unique()

Remove the question mark (unknown code) - doesn't exist because we already removed it from the freq_limits df.

#codes.sort()
#assert codes[0] == '?'
#codes = codes[1:]

def ceiling_rounder(
    in_number,
    acceptable_numbers = [0.05, 0.1, 0.15, 0.5]):
    """Perform ceiling rounding of a number to one of a list of numbers. 
    
    Handles np.nan by returning 0.1.
    """
    acceptable_numbers.sort()
    if np.isnan(in_number):
        return 0.1
    for acceptable_number in acceptable_numbers:
        if in_number <= acceptable_number:
            return acceptable_number

import numpy as np

freq_limits.sort_values('code', inplace=True)

species_lengths = []
for code in freq_limits.code.tolist():
    ls = lengths_dict[code]
    species_lengths.append(np.round(np.median(np.array(ls)), decimals=3))

freq_limits['median_duration'] = species_lengths
freq_limits['approx_duration'] = [ceiling_rounder(l) for l in species_lengths]

freq_limits.to_csv('freqs_and_durations_draft.csv', index=False)