I think the dominant frequencies are a good measure to estimate all the CF peak frequencies that may be there in a chunk of audio. However, the issue is that the power spectrum can be a bit ragged, *and* also the presence of Doppler shifted call reflections may mean that there are many more dominant frequencies than CF components. This is my attempt to somehow try and simplify or filter the excess dominant frequencies. 

Approach 1 is to manually go through all the non-silent audio snippets and check which dominant frequencies actually match the CF frequencies. This might take too much time, but may be worth it. 

Approach 2 is to estimate the total variation caused by Doppler effects. I've previously estimated it to be at most $\pm$ 600 Hz. This essentially means any dominant frequencies $\pm$ 600 Hz of each other are likely to be from source. (+600 Hz components are the direct and approaching reflection Doppler shifts, while -600 Hz components are the direct and out-flight Doppler shifts). This method has the danger of being a bit too conservative..

I see a combination of approaches 1 & 2 as sensible. Let me try to see if the 'manual' expectation matches the automated approach. 

### Summary 
I've been able to come to a satisfactory compromise that meets criterion 1 & 2 halfway. The 'centering' approach  identifies groups of dominant frequencies that are within a given distance of each other. These 'groups' are then replaced by the median of the group's dominant frequencies. This approach has been tested by manual inspection over several audio windows and has shown satisfactory performance in not being too conservative or lax. 


## Table of Contents

### * [Verifying the effect of centering dominant frequencies](#verifying)
### * [Why 600 Hz as the inter-frequency difference?](#why600)
### * [Centering dominant frequencies: the functions ](#centering)
### * [Removing raw dominant frequencies and adding centred dominant frequencies](#addcenteredfreqs)
### * [Saving non silent measurements with centred dominant frequencies](#saving)



&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;

Original date of notebook creation: 2020-11-24

Author: Thejasvi Beleyur

In [1]:
import datetime as dt
import sys 
sys.path.append('../../correct_call_annotations/')
sys.path.append('../')
import numpy as np 
np.random.seed(2222)
import networkx as nx
import pandas as pd
import sklearn
from sklearn.neighbors import kneighbors_graph
import soundfile as sf
import matplotlib.pyplot as plt

In [2]:
print(f'This notebook run started at: {dt.datetime.now()}')

This notebook run started at: 2020-12-17 15:21:02.730790


In [3]:
import measure_annot_audio
from measure_annot_audio import split_measure

In [4]:
import correct_call_annotations
import correct_call_annotations.correct_call_annotations as cca2

In [5]:
%matplotlib notebook

In [6]:
obs_nonsilent = pd.read_csv('obs_nonsilent_measurements_20dBthreshold.csv')
virt_nonsilent = pd.read_csv('virt_nonsilent_measurement_20dBthreshold.csv')

In [7]:
domfreq_obs_nonsilent = obs_nonsilent[obs_nonsilent['measurement']=='dominant_frequencies']
domfreq_virt_nonsilent = virt_nonsilent[virt_nonsilent['measurement']=='dominant_frequencies']

In [8]:
obs_nonsilent

Unnamed: 0.1,Unnamed: 0,value,segment_number,measurement,file_name,unique_window_id,video_annot_id,num_bats
0,0,0.022077,0,rms,matching_annotaudio_Aditya_2018-08-16_21502300...,0_matching_annotaudio_Aditya_2018-08-16_215023...,Aditya_2018-08-16_21502300_100,1
1,1,0.062134,0,peak_amplitude,matching_annotaudio_Aditya_2018-08-16_21502300...,0_matching_annotaudio_Aditya_2018-08-16_215023...,Aditya_2018-08-16_21502300_100,1
2,2,105040.000000,0,dominant_frequencies,matching_annotaudio_Aditya_2018-08-16_21502300...,0_matching_annotaudio_Aditya_2018-08-16_215023...,Aditya_2018-08-16_21502300_100,1
3,3,90332.031250,0,fm_terminal_freqs,matching_annotaudio_Aditya_2018-08-16_21502300...,0_matching_annotaudio_Aditya_2018-08-16_215023...,Aditya_2018-08-16_21502300_100,1
4,4,89843.750000,0,fm_terminal_freqs,matching_annotaudio_Aditya_2018-08-16_21502300...,0_matching_annotaudio_Aditya_2018-08-16_215023...,Aditya_2018-08-16_21502300_100,1
...,...,...,...,...,...,...,...,...
17933,17933,0.044647,3,peak_amplitude,matching_annotaudio_Aditya_2018-08-20_0300-040...,3_matching_annotaudio_Aditya_2018-08-20_0300-0...,Aditya_2018-08-20_0300-0400_91,1
17934,17934,105200.000000,3,dominant_frequencies,matching_annotaudio_Aditya_2018-08-20_0300-040...,3_matching_annotaudio_Aditya_2018-08-20_0300-0...,Aditya_2018-08-20_0300-0400_91,1
17935,17935,87402.343750,3,fm_terminal_freqs,matching_annotaudio_Aditya_2018-08-20_0300-040...,3_matching_annotaudio_Aditya_2018-08-20_0300-0...,Aditya_2018-08-20_0300-0400_91,1
17936,17936,85449.218750,3,fm_terminal_freqs,matching_annotaudio_Aditya_2018-08-20_0300-040...,3_matching_annotaudio_Aditya_2018-08-20_0300-0...,Aditya_2018-08-20_0300-0400_91,1


In [9]:
by_obs_filenamesegnum = domfreq_obs_nonsilent.groupby(['file_name','segment_number'])
by_virt_filenamesegnum = domfreq_virt_nonsilent.groupby(['file_name','segment_number'])
print(len(by_obs_filenamesegnum.groups), len(by_virt_filenamesegnum))

2940 722


In [10]:
print(f' There are {len(by_obs_filenamesegnum.groups)+len(by_virt_filenamesegnum)} segments to be inspected overall')

 There are 3662 segments to be inspected overall


All the segments will somehow need to be inspected and checked to see if the dominant frequencies measured really match those seen in the CF parts of calls in a spectrogram. Let's first do a few segments and rounds. 

In [11]:
q = by_obs_filenamesegnum.get_group(('matching_annotaudio_Aditya_2018-08-16_21502300_100_hp.WAV',
  0))

In [12]:
# what is the relationship between the number of bats and the number of dominant frequencies?
def get_numbats_and_numdomfreqs(df):
    return np.unique(df['num_bats']), len(df['value'])

In [13]:
obs_numbatsdomfreqs = by_obs_filenamesegnum.apply(get_numbats_and_numdomfreqs)
virt_numbatsdomfreqs = by_virt_filenamesegnum.apply(get_numbats_and_numdomfreqs)

In [14]:
def reformat_nbats_ndomfreqs(input_list):
    nbats, ndomfreqs = [], []
    for each in input_list:
        nbats.append(each[0])
        ndomfreqs.append(each[1])
    return nbats, ndomfreqs

In [15]:
obsbats, obsdomfreqs = reformat_nbats_ndomfreqs(obs_numbatsdomfreqs)
virtbats, virtdomfreqs = reformat_nbats_ndomfreqs(virt_numbatsdomfreqs)


In [16]:
list(by_obs_filenamesegnum.groups)[2445]

('matching_annotaudio_Aditya_2018-08-17_45_162_hp.WAV', 21)

There are some examples with a *crazy* number of dominant frequencies ...this is clearly an exception though!

In [17]:
plt.figure()
plt.subplot(121)
plt.plot(obsbats, obsdomfreqs,'*')
plt.ylim(0,40);plt.xticks(np.arange(1,5));
plt.title('Observed')
plt.subplot(122)
plt.plot(virtbats, virtdomfreqs,'*')
plt.title('Virtual')
plt.ylim(0,40);plt.xticks(np.arange(1,5));

<IPython.core.display.Javascript object>

([<matplotlib.axis.XTick at 0x186328e3f28>,
  <matplotlib.axis.XTick at 0x186328e3b00>,
  <matplotlib.axis.XTick at 0x186328e3748>,
  <matplotlib.axis.XTick at 0x18632911588>],
 [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])

Clearly, many windows have more than expected from the number of bats. 

In [18]:
groups = by_obs_filenamesegnum.groups
groups = list(groups.keys())

randomgroup = int(np.random.choice(np.arange(len(groups)),1))
filename, segnum = groups[randomgroup]
print(filename, segnum)
df = by_obs_filenamesegnum.get_group(groups[randomgroup])

out = cca2.find_file_in_folder(filename,
                               '../../individual_call_analysis/hp_annotation_audio/')
audio, fs = sf.read(out[0])
audio_segs = split_measure.split_audio(audio[:,0], fs, 0.05)

import scipy.spatial as spl
print(spl.distance_matrix(df['value'].to_numpy().reshape(-1,1),df['value'].to_numpy().reshape(-1,1)))


plt.figure()

plt.specgram(audio_segs[segnum],Fs=fs, NFFT=512,noverlap=482)
for each in df['value']:
    plt.hlines(each, 0, 0.05)

plt.ylim(50000,125000)

df


matching_annotaudio_Aditya_2018-08-19_0120-0200_74_hp.WAV 6
Match found!
[[  0. 260.]
 [260.   0.]]


<IPython.core.display.Javascript object>

Unnamed: 0.1,Unnamed: 0,value,segment_number,measurement,file_name,unique_window_id,video_annot_id,num_bats
16139,16139,108920.0,6,dominant_frequencies,matching_annotaudio_Aditya_2018-08-19_0120-020...,6_matching_annotaudio_Aditya_2018-08-19_0120-0...,Aditya_2018-08-19_0120-0200_74,1
16140,16140,109180.0,6,dominant_frequencies,matching_annotaudio_Aditya_2018-08-19_0120-020...,6_matching_annotaudio_Aditya_2018-08-19_0120-0...,Aditya_2018-08-19_0120-0200_74,1


<a id='centering'></a>
## 'Centering' the dominant frequencies : the functions

In [19]:

def replace_raw_domfreqs_w_centred_domfreqs(df, interfreq_diff):
    '''
    '''
    domfreqs = df['value'].to_numpy().reshape(-1,1)
    
    if domfreqs.size > 1:
    
        # group the raw dominant frequencies
        grouped_freqs = centre_freqs_to_median(domfreqs, interfreq_diff)

        # assign the grouped dominant frequencies
        grouped_df = pd.DataFrame(data={'value':grouped_freqs.tolist()})

        for col in df.columns:
            if col!='value':
                grouped_df.loc[:,col] = df.loc[df.index[0],col]

        return grouped_df
    else:
        return df

    
    
    
    

def centre_freqs_to_median(domfrequencies, intergroup_difference):
    '''
    '''
    
    
    groupedfreqs = group_dominant_freqs(domfrequencies, intergroup_difference)
    
    allfreqs = domfrequencies.flatten().tolist()
    for each in groupedfreqs:
        for entry in each:
            if entry in allfreqs:
                allfreqs.remove(entry)
    group_medians = []
    for group in groupedfreqs:
        group_medians.append(np.nanmedian(list(group)))
    # final dominant frequencies :
    final_freqs = np.sort(np.concatenate((allfreqs, group_medians)).flatten())
    return final_freqs


def group_dominant_freqs(domfrequencies, inter_nbr_distance):
    '''
    
    Parameters
    ----------
    domfrequencies : np.array
        Array with dominant frequencies
    inter_nbr_distance : float>0
        If two frequencies are differ by this much they are considered neighbours and
        put into one group. The neighbours are put into groups transitively. eg. if A-B are
        neighbours, and B-C are neighbours - then A,B,C are put into one group.
    
    Returns
    -------
    groups : list with sets
        
    
    '''
    A = kneighbors_graph(domfrequencies,n_neighbors=1,mode='distance')
    nearestnbrs_distances = A.toarray()
    rows, cols = np.where(np.logical_and(nearestnbrs_distances>0, nearestnbrs_distances<=inter_nbr_distance))
    locs = np.column_stack((rows,cols))
    
    freqs = []
    for each in locs:
        freqs.append([float(domfrequencies[each[0]]),float(domfrequencies[each[1]])])
    
    # thanks to robert king at Stack Overflow https://stackoverflow.com/a/9387057/4955732
    g = nx.Graph()

    for sub_list in freqs:
        for i in range(1,len(sub_list)):
            g.add_edge(sub_list[0],sub_list[i])
    groups = list(nx.connected_components(g))
    return groups


<a id='verifying'></a>
## Verifying the effect of centering 

In [20]:
groups = by_obs_filenamesegnum.groups
groups = list(groups.keys())

randomgroup = int(np.random.choice(np.arange(len(groups)),1))
filename, segnum = groups[randomgroup]
print(filename, segnum)
df = by_obs_filenamesegnum.get_group(groups[randomgroup])

domfreqs = df['value'].to_numpy().reshape(-1,1)
if domfreqs.size >1:
    grouped_domfreqs = group_dominant_freqs(domfreqs, 600)
else:
    grouped_domfreqs = domfreqs

out = cca2.find_file_in_folder(filename,
                               '../../individual_call_analysis/hp_annotation_audio/')
audio, fs = sf.read(out[0])
audio_segs = split_measure.split_audio(audio[:,0], fs, 0.05)


plt.figure()
plt.subplot(121)
plt.specgram(audio_segs[segnum],Fs=fs, NFFT=512,noverlap=482)
for each in df['value']:
    plt.hlines(each, 0, 0.05, linewidth=0.75)

plt.ylim(50000,125000)

plt.subplot(122)
plt.specgram(audio_segs[segnum],Fs=fs, NFFT=512,noverlap=482)
for each in grouped_domfreqs:
    plt.hlines(each, 0, 0.05)

plt.ylim(50000,125000)


matching_annotaudio_Aditya_2018-08-16_21502300_36_hp.WAV 16
Match found!


<IPython.core.display.Javascript object>

(50000.0, 125000.0)

<a id='why600'></a>
## Why 600 Hz as the inter-frequency difference?

The dominant frequency grouping works pretty well most of the time. Setting it too low means that many group-able dominant frequencies are left as they are, and setting it too high means all dominant frequencies get grouped together. I think 600 Hz is a good compromise based on the examples I've been seeing. 

Transitively grouping all frequencies that are $\pm$ 600Hz of each other makes sense because:

* The emitted CF is not a true pure tone. It is likely to have its own 'spectral width'.
* The recorded CF contains the Doppler shifted emitted CF *and* the reflections of the emitted CF, which may have positive and negative Doppler shifts. 

$\pm$ 600 Hz is the maximum expected Doppler shift when bats fly by a microphone 3 m/s and call at 100 kHz. 

In [21]:
# centre all raw dominant frequencies when many of them are very close together
interfreq_difference = 600
obs_centred_domfreqs = by_obs_filenamesegnum.apply(replace_raw_domfreqs_w_centred_domfreqs, interfreq_difference)
virt_centred_domfreqs = by_virt_filenamesegnum.apply(replace_raw_domfreqs_w_centred_domfreqs, interfreq_difference)

<a id='addcenteredfreqs'></a>
    
## Removing raw dominant frequencies and add centred dominant frequencies 

In [22]:
# remove the raw dominant frequencies and replace with the centred dominant frequencies in the main 
# split-measure files

obs_nonsilent_nodomfreqs = obs_nonsilent[obs_nonsilent['measurement']!='dominant_frequencies']
virt_nonsilent_nodomfreqs = virt_nonsilent[virt_nonsilent['measurement']!='dominant_frequencies']

# add the centred dominant frequencies into the datasets 

obs_nonsilent_wcentreddomfreqs = pd.concat((obs_nonsilent_nodomfreqs, obs_centred_domfreqs)).reset_index(drop=True)
virt_nonsilent_wcentreddomfreqs = pd.concat((virt_nonsilent_nodomfreqs, virt_centred_domfreqs)).reset_index(drop=True)



<a id='saving'></a>
## Saving non silent measurements with centred dominant frequencies

In [23]:
filenaems = ['obs_nonsilent_measurements_20dBthreshold_wcentdomfreqs.csv',
             'virt_nonsilent_measurement_20dBthreshold_wcentdomfreqs.csv']
for outfilename, dataset in zip(filenaems, [obs_nonsilent_wcentreddomfreqs, virt_nonsilent_wcentreddomfreqs]):
    dataset.to_csv(outfilename)

In [24]:
print(f'This notebook run ended at: {dt.datetime.now()}')

This notebook run ended at: 2020-12-17 15:21:27.351225
