This notebook will detail my attempts at getting video and acoustic tracking for the speaker playbacks. This is a simple recording to handle. 

* Speaker recording: SPKRPLAYBACK_multichirp_2018-07-29_09-42-59.WAV
* Video recording : 2018-07-28/P03/K1,2,3/02000.TMC


In [92]:
import datetime as dt
import scipy.signal as signal 
import scipy.spatial as spatial
import scipy.ndimage as ndimage
import soundfile as sf
import numpy as np 
import matplotlib.pyplot as plt


In [93]:
print(f'Notebook cell run at {dt.datetime.now()}')

Notebook cell run at 2021-06-27 08:00:52.624796


In [94]:
import batracker
from batracker.localisation import friedlander_1987 as fr87
from batracker.localisation import schau_robinson_1987 as sr87
from batracker.localisation import spiesberger_wahlberg_2002 as sw02

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from batracker.signal_detection.detection import cross_channel_threshold_detector
from batracker.signal_detection.detection import envelope_detector
from batracker.tdoa_estimation.tdoa_estimators import measure_tdoa
from batracker.correspondence_matching.multichannel_match import generate_crosscor_boundaries

In [95]:
%matplotlib notebook

In [96]:
folder = 'E://fieldwork_2018_002/actrackdata/wav/2018-07-28_003/'
filename = 'SPKRPLAYBACK_multichirp_2018-07-29_09-42-59.WAV'
# take out only the first 20 s with 4 channels + the cam sync and electronic output signal
fs = 192000
audio, fs = sf.read(folder+filename, stop=int(fs*20))
part_audio = audio[:,[0,1,2,3,6,7]]
sf.write('multichirp_sankenscamerasyncoutput_2018-07-29_09-42-59.wav', part_audio, fs)

In [97]:
audiofile = 'multichirp_sankenscamerasyncoutput_2018-07-29_09-42-59.wav'
# gwt only first 2 s for now. 
audio, fs = sf.read(audiofile, stop=int(192000*7.5))

In [98]:
# get all audio that start from frame 1 of the camera sync signal (1st frame that is +ve)
first_frame_sample = np.min(np.argwhere(audio[:,-1]>=np.percentile(audio[:,-1],95)))
# audio sync'ed with 1st camera frame
cam_audio = audio[first_frame_sample:,:]

print(first_frame_sample/fs)
# get the array audio 
array_audio = cam_audio[:,:4]

1.2398802083333333


In [99]:
b,a = signal.butter(2,np.array([30e3,90e3])/(fs*.5),'bandpass')
array_audiohp = np.apply_along_axis(lambda X: signal.filtfilt(b,a,X),0,array_audio)

In [100]:
positive, num_regions = ndimage.label(cam_audio[:,-1]>0)
frames = ndimage.find_objects(positive)
frame_start_times = [each[0].start/fs for each in frames]

In [101]:
digital_pbk_env = abs(signal.hilbert(cam_audio[:,-2]))


In [102]:
playbacks, num_pbks = ndimage.label(digital_pbk_env>0.005)
playback_chunks = ndimage.find_objects(playbacks)
playback_midpoints = np.array([np.mean([each[0].stop,each[0].start])/fs for each in playback_chunks])
# This will give the exact frame numbers to digitise
fps = 25

playback_frames = np.array(playback_midpoints*fps, dtype=np.int32)+1 # to include the fact that frame 1 is the starting

In [103]:
playback_frames

array([  5,  10,  15,  20,  25,  30,  35,  40,  45,  90,  95, 100, 105,
       110, 115, 120, 125, 130])

In [104]:
plt.figure()
plt.specgram(array_audiohp[:,0],Fs=fs);

<IPython.core.display.Javascript object>

In [105]:
detections = cross_channel_threshold_detector(array_audiohp, fs,
                                              detector_function=envelope_detector,
                                              threshold_db_floor=12,
                                              lowpass_durn=0.004)
# for now just use manual detections to generate the correlation boundaries

              
    
            

  0%|                                                                                            | 0/4 [00:00<?, ?it/s]

4 1201943


  floor_level = np.percentile(20*np.log10(envelope),5)
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.55it/s]


In [106]:
                                    
# Spectrogram of the cross-corr boundaries
plt.figure(figsize=(4,4))
ax= plt.subplot(411)
plt.specgram(array_audiohp[:,0], Fs=fs)
for each in detections[0]:
    plt.vlines(each, 0, fs*0.5, linewidth=0.4)

for i in range(2,5):
    plt.subplot(410+i, sharex=ax)
    plt.specgram(array_audiohp[:,i-1], Fs=fs)
    for each in detections[i-1]:
        plt.vlines(each, 0, fs*0.5, linewidth=0.4)

<IPython.core.display.Javascript object>

In [107]:
# filter all detections, and keep only those that are >1 ms long. 
min_durn = 0.0075
filtered_detections = []
for channel_dets in detections:
    long_detections = [] 
    for detn in channel_dets:
        if detn[1]-detn[0]>=min_durn:
            long_detections.append(detn)
    filtered_detections.append(long_detections) # Spectrogram of the cross-corr boundaries plt.figure() ax= plt.subplot(411) plt.specgram(audio[:,0], Fs=fs) for each in filtered_detections[0]: plt.vlines(each, 0, fs*0.5, linewidth=0.4) for i in range(2,5): plt.subplot(410+i, sharex=ax) plt.specgram(audio[:,i-1], Fs=fs) for each in filtered_detections[i-1]: plt.vlines(each, 0, fs*0.5, linewidth=0.4)filtered_detections

In [108]:
[len(each)for each in filtered_detections]

[0, 0, 18, 0]

In [109]:
                                    
# Spectrogram of the cross-corr boundaries
plt.figure(figsize=(4,4))
ax= plt.subplot(411)
plt.specgram(array_audiohp[:,0], Fs=fs)
for each in filtered_detections[0]:
    plt.vlines(each, 0, fs*0.5, linewidth=0.4)

for i in range(2,5):
    plt.subplot(410+i, sharex=ax)
    plt.specgram(array_audiohp[:,i-1], Fs=fs)
    for each in filtered_detections[i-1]:
        plt.vlines(each, 0, fs*0.5, linewidth=0.4)

<IPython.core.display.Javascript object>

In [110]:
# Array geometry
## What we expect it to be theoretically
R = 1.2 # meters
theta = np.pi/3
other_x_position = 0.5
theta2 = np.arctan(other_x_position/(R*np.cos(theta)))
R_2 = np.sqrt(other_x_position**2 +  (R*np.cos(theta))**2)
arbit_y = 0
mic_positions = np.array([[0,arbit_y,0],
                          [R_2*np.sin(theta2),  arbit_y, -R*np.cos(theta), ],
                          [-R*np.sin(theta), arbit_y, -R*np.cos(theta)],
                          [0,arbit_y,R]])
mic_positions[:,1] = np.random.normal(0,1e-5,4)
ag = pd.DataFrame(mic_positions)
ag.columns  = ['x','y','z']

In [111]:

crosscor_boundaries = [(0.19,0.205),(0.387,0.402), (0.576,0.602),
                        (0.79,0.8), (0.985,1.0), (1.175, 1.2),
                       (1.39, 1.399), (1.585, 1.595), (1.774, 1.794),
                      (3.593, 3.603), (3.785,3.802), (3.98, 4),
                      (4.191,4.2), (4.385,4.4), (4.575,4.6),
                      (4.79, 4.80), (4.985,5.0), (5.175, 5.199)]    

In [112]:
# Spectrogram of the cross-corr boundaries
plt.figure(figsize=(6,4))
ax= plt.subplot(411)
plt.specgram(array_audiohp[:,0], Fs=fs)
for each in crosscor_boundaries[0]:
    plt.vlines(each, 0, fs*0.5, linewidth=0.4)
    
for each in crosscor_boundaries:
    plt.vlines(each, 0, fs*0.5, linewidth=0.2, color='k', alpha=1)

for i in range(2,5):
    plt.subplot(410+i, sharex=ax)
    plt.specgram(array_audiohp[:,i-1], Fs=fs)
    for each in crosscor_boundaries[i-1]:
        plt.vlines(each, 0, fs*0.5, linewidth=0.4)
        for each in crosscor_boundaries:
            plt.vlines(each, 0, fs*0.5, linewidth=0.2, color='k', alpha=1)

<IPython.core.display.Javascript object>

In [113]:
# trying out gcc-phat

"""
 Estimate time delay using GCC-PHAT 
 Copyright (c) 2017 Yihui Xiong
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
     http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
"""

import numpy as np


def gcc_phat(sig, refsig, fs=1, max_tau=None, interp=16):
    '''
    This function computes the offset between the signal sig and the reference signal refsig
    using the Generalized Cross Correlation - Phase Transform (GCC-PHAT)method.
    '''
    
    # make sure the length for the FFT is larger or equal than len(sig) + len(refsig)
    n = sig.shape[0] + refsig.shape[0]

    # Generalized Cross Correlation Phase Transform
    SIG = np.fft.rfft(sig, n=n)
    REFSIG = np.fft.rfft(refsig, n=n)
    R = SIG * np.conj(REFSIG)

    cc = np.fft.irfft(R / np.abs(R), n=(interp * n))

    max_shift = int(interp * n / 2)
    if max_tau:
        max_shift = np.minimum(int(interp * fs * max_tau), max_shift)

    cc = np.concatenate((cc[-max_shift:], cc[:max_shift+1]))

    # find max cross correlation index
    shift = np.argmax(np.abs(cc)) - max_shift

    tau = shift / float(interp * fs)
    
    return tau, cc


In [114]:
reference_ch = 0
gcc_phat_tdoas = {}
other_channels = list(set(range(4))-set([reference_ch]))
all_tdoas = {}
for i,each_common in enumerate(crosscor_boundaries):
    start, stop = each_common
    start_sample, stop_sample = int(start*fs), int(stop*fs)
    tdoas_gcc_phat = []
    for each in other_channels:
        ccmax, cc = gcc_phat(array_audiohp[start_sample:stop_sample,each],
                                 array_audiohp[start_sample:stop_sample,reference_ch])
        tdoas_gcc_phat.append(ccmax/fs)

    tdoas = measure_tdoa(array_audiohp[start_sample:stop_sample,:], fs, ref_channel=reference_ch)
    gcc_phat_tdoas[i] = np.array(tdoas_gcc_phat)
    all_tdoas[i] = tdoas

In [115]:
array_audiohp.shape

(1201943, 4)

In [116]:
# Using the Time of emission to get TDOAs
reference_audio = cam_audio[:,-2]
gcc_phat_tof = {}
other_channels = list(range(4))
all_tofs = {}
for i,each_common in enumerate(crosscor_boundaries):
    start, stop = each_common
    start_sample, stop_sample = int(start*fs), int(stop*fs)
    tof_gcc_phat = []
    playback_chunk_start = playback_chunks[i][0].start
    for each in other_channels:
        ccmax, cc = gcc_phat(array_audiohp[playback_chunk_start:stop_sample,each],
                                 reference_audio[playback_chunk_start:stop_sample])
        #cc = signal.correlate(array_audiohp[playback_chunk_start:stop_sample,each],
        #                         reference_audio[playback_chunk_start:stop_sample],
        #                        'same')
        #ccmax = np.argmax(cc) - cc.size/2.0
        tof_gcc_phat.append(ccmax/fs)
    all_tofs[i] = np.array(tof_gcc_phat)

# set 0 channel as ref
tdoas_tofbased = {}
for i, tofs in all_tofs.items():
    tdoas_tofbased[i] = tofs[1:]-tofs[0]

In [117]:
cc = signal.correlate(array_audiohp[playback_chunk_start:stop_sample,each],
                                 reference_audio[playback_chunk_start:stop_sample],'same')
delay = (np.argmax(cc) - cc.size/2.0)/fs

print(delay, ccmax/fs)

plt.figure()
plt.plot(cc)

0.0192421875 0.01925390625


<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x1f6cb039b80>]

In [118]:
all_tofs

{0: array([0.01768229, 0.01628516, 0.01832878, 0.01847526]),
 1: array([0.01766797, 0.01627181, 0.01830306, 0.01846224]),
 2: array([0.01766081, 0.01626497, 0.01829557, 0.01845508]),
 3: array([0.01765202, 0.01625716, 0.01829785, 0.01844694]),
 4: array([0.01764551, 0.01625065, 0.01829134, 0.01843945]),
 5: array([0.01764648, 0.0162513 , 0.01828255, 0.01843978]),
 6: array([0.0176582 , 0.01626204, 0.0182946 , 0.01845085]),
 7: array([0.01767318, 0.01627604, 0.01832129, 0.01846582]),
 8: array([0.01766439, 0.01622721, 0.01829818, 0.01846094]),
 9: array([0.01813086, 0.01680632, 0.01834701, 0.01940853]),
 10: array([0.01811165, 0.01678939, 0.0183252 , 0.01939323]),
 11: array([0.01809635, 0.01677962, 0.01832129, 0.01937272]),
 12: array([0.01808529, 0.01677083, 0.01830208, 0.01936003]),
 13: array([0.01807422, 0.01675944, 0.0182998 , 0.01934896]),
 14: array([0.01807064, 0.01674837, 0.01828776, 0.01934375]),
 15: array([0.01803939, 0.01673079, 0.01825911, 0.01930794]),
 16: array([0.0180

In [119]:
audio_ch = 2
start = playback_chunks[0][0].start
stop = int(crosscor_boundaries[0][1]*fs)
plt.figure()
plt.subplot(311)
plt.specgram(cam_audio[start:stop,-2],Fs=fs);
plt.subplot(312)
plt.specgram(cam_audio[start:stop,audio_ch],Fs=fs);
plt.vlines(all_tofs[0][audio_ch],0,fs*0.5)
plt.subplot(313)
timeshifted = np.roll(cam_audio[start:stop,-2], int(all_tofs[0][audio_ch]*fs))
plt.specgram(timeshifted,Fs=fs);
#plt.vlines(,0,fs*0.5)

<IPython.core.display.Javascript object>

In [120]:
vsound = 340.0
all_positions = []
num_rows = mic_positions.shape[0]-1
calculated_positions = np.zeros((len(all_tdoas.keys()), 3,2))
calculated_2_positions = np.zeros((len(all_tdoas.keys()), 3,2))
calculated_3_positions = np.zeros((len(all_tdoas.keys()), 3,2))

for det_number, tdoas in gcc_phat_tdoas.items():
    d = (vsound*tdoas).reshape(-1,1)
    solution1, solution2 = sw02.spiesberger_wahlberg_solution(mic_positions, d)
    calculated_2_positions[det_number,:,0] = solution1
    calculated_2_positions[det_number,:,1] = solution2


for det_number, tdoas in all_tdoas.items():
        d = (vsound*tdoas).reshape(-1,1)
        solution1, solution2 = sw02.spiesberger_wahlberg_solution(mic_positions, d)
        calculated_positions[det_number,:,0] = solution1
        calculated_positions[det_number,:,1] = solution2

for det_number, tdoas in tdoas_tofbased.items():
        d = (vsound*tdoas).reshape(-1,1)
        solution1, solution2 = sw02.spiesberger_wahlberg_solution(mic_positions, d)
        calculated_3_positions[det_number,:,0] = solution1
        calculated_3_positions[det_number,:,1] = solution2
        

  t_solution1 = (-b_quad + np.sqrt(b_quad**2 - 4*a_quad*c_quad))/(2*a_quad)
  t_solution2 = (-b_quad - np.sqrt(b_quad**2 - 4*a_quad*c_quad))/(2*a_quad)


In [121]:
calculated_3_positions[:,:,1]

array([[-3.07762492, -4.88793704,  1.93857972],
       [-3.04340297, -4.86212761,  1.93168983],
       [-3.04359538, -4.86479274,  1.93223239],
       [-3.09308253, -4.92725407,  1.95159873],
       [-3.08922768, -4.9204288 ,  1.94805195],
       [-3.04766792, -4.87146025,  1.93230086],
       [-3.042123  , -4.85637222,  1.92754315],
       [-3.08263326, -4.89241792,  1.93948273],
       [-2.87997748, -4.40347693,  1.82642103],
       [-4.02701409, -9.04148656,  4.44982977],
       [-4.06830577, -9.17753817,  4.51827941],
       [-4.18863147, -9.45221442,  4.61610282],
       [-4.12391327, -9.36681679,  4.56989057],
       [-4.20254573, -9.50065465,  4.62952408],
       [-4.01827796, -9.0358271 ,  4.43067802],
       [-4.16041966, -9.49416584,  4.59826049],
       [-3.79744597, -8.3356446 ,  4.12522223],
       [-3.95567583, -8.76341293,  4.32918434]])

In [122]:
calculated_2_positions[:,:,0]

array([[-3.03544479,  4.83819612,  1.92439031],
       [-3.03953241,  4.84921775,  1.92817787],
       [-3.04381921,  4.86205839,  1.93177449],
       [-3.04792965,  4.87227771,  1.93484184],
       [-3.045249  ,  4.8660541 ,  1.93272707],
       [-3.04379191,  4.85939991,  1.9295304 ],
       [-3.04215463,  4.85098554,  1.92654782],
       [-3.03665747,  4.83775265,  1.92330791],
       [-3.04854076,  4.88297479,  1.93873185],
       [-0.12296742, -3.0391992 , -0.6017335 ],
       [-3.79292597,  8.28631262,  4.16030841],
       [-0.10540065, -3.10570518, -0.62545976],
       [-3.8131045 ,  8.37580238,  4.17006668],
       [-3.82239439,  8.40337513,  4.1823051 ],
       [-3.83598133,  8.4454201 ,  4.19294038],
       [-0.13822805, -2.99785575, -0.57618184],
       [-3.90058321,  8.66097113,  4.25741948],
       [-3.95122445,  8.82710091,  4.31475183]])

In [123]:
calculated_positions[:,:,1]

array([[         nan,          nan,          nan],
       [         nan,          nan,          nan],
       [         nan,          nan,          nan],
       [         nan,          nan,          nan],
       [         nan,          nan,          nan],
       [         nan,          nan,          nan],
       [         nan,          nan,          nan],
       [ -3.01364212,  -4.78679779,   1.90886332],
       [         nan,          nan,          nan],
       [ -4.98170179, -12.01816239,   5.6732017 ],
       [ -3.67902033,  -7.92902603,   4.01239008],
       [ -3.72527873,  -8.05334072,   4.04761326],
       [ -5.10575633, -12.4654476 ,   5.82800847],
       [ -5.10575633, -12.4654476 ,   5.82800847],
       [ -4.8935126 , -11.90819349,   5.56192277],
       [ -5.12619165, -12.58444859,   5.84895085],
       [ -5.22431031, -12.86117706,   5.9368373 ],
       [ -3.88641401,  -8.58230914,   4.22544476]])

In [124]:
valid_positions = calculated_2_positions[:,:,1]
valid_positions

array([[-3.03625898, -4.83958203,  1.9247118 ],
       [-3.0403495 , -4.8506095 ,  1.92850108],
       [-3.04463952, -4.86345684,  1.93209946],
       [-3.04875273, -4.87368155,  1.93516829],
       [-3.04607033, -4.86745462,  1.93305254],
       [-3.04461175, -4.86079683,  1.92985457],
       [-3.04297268, -4.85237803,  1.92687066],
       [-3.03747191, -4.83913827,  1.92362907],
       [-3.0493657 , -4.88438424,  1.93906002],
       [-0.12292647,  3.03870748, -0.60153872],
       [-3.7946985 , -8.29033817,  4.16193227],
       [-0.10536156,  3.10519354, -0.62525699],
       [-3.81490299, -8.37990415,  4.17170909],
       [-3.82420308, -8.40750285,  4.18395856],
       [-3.83780476, -8.44958563,  4.19460616],
       [-0.13818532,  2.9973763 , -0.57599355],
       [-3.90248132, -8.66533541,  4.25915484],
       [-3.9531821 , -8.83162352,  4.31654635]])

In [125]:
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(elev=17, azim=122)
ax.plot(valid_positions[:,0], valid_positions[:,1],
            valid_positions[:,2],'*')

for each in range(4):
    ax.plot(mic_positions[:,0],mic_positions[:,1],mic_positions[:,2],'k*')

<IPython.core.display.Javascript object>

### Distance of positions from central microphone - acoustic tracking 



In [126]:
mic_positions[0,:]

array([ 0.00000000e+00, -2.30418972e-05,  0.00000000e+00])

In [127]:

def calc_dist_to_m0(X, refpos):
    try:
        distance = spatial.distance.euclidean(X,refpos)
    except ValueError:
        distance = np.nan
    return distance
dist_to_mic0 = np.apply_along_axis(calc_dist_to_m0,1,valid_positions,mic_positions[0,:])
dist_to_mic0

array([ 6.02865784,  6.04078062,  6.05440554,  6.06566705,  6.05864067,
        6.05153874,  6.0430011 ,  6.02856704,  6.0758184 ,  3.10013595,
       10.02251176,  3.16929288, 10.1083653 , 10.13981041, 10.18423164,
        3.05536685, 10.41428821, 10.59515871])

In [128]:
print(np.mean(dist_to_mic0[:8]), np.nanstd(dist_to_mic0[:8]))

6.046407325832548 0.012686001856360596


Another way to estimate distance from speaker to mic0 is to utilise the digital copy of the playback signal. We can then estimate the time of flight of the playback. 

In [129]:
# crosscorrelate the output signal with channel 0. 
output_ch = cam_audio[:,-2]
plt.figure()
a0 = plt.subplot(211)
plt.plot(array_audiohp[:,0])
plt.subplot(212,sharex=a0)
plt.plot(output_ch)

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x1f6cd26ad00>]

In [130]:
ind_forcc = [30000, 80000]
cc = signal.correlate(array_audiohp[ind_forcc[0]:ind_forcc[1],0], cam_audio[ind_forcc[0]:ind_forcc[1],-2],'same')
delay = (np.argmax(cc)-cc.size/2.0)/fs
delay

0.017666666666666667

### Getting mic0-speaker distance through the time of flight

In [131]:
print(f'The sync-channel delay based mic0-speaker distance is {delay*340.0} m')

The sync-channel delay based mic0-speaker distance is 6.006666666666667 m


### Distance of positions from central mic - video tracking

In [132]:
# get video tracked speaker positions
speaker_posns = pd.read_csv('video_tracking/speaker_pbks/DLTdv7_data_2018-07-28_p03_2000_spkr_pbksxyzpts.csv')
speaker_posns.columns = ['x','y','z']
speaker_posns['frame_num'] = speaker_posns['x'].index+1
speaker_posns = speaker_posns[~pd.isna(speaker_posns['x'])]
speaker_posns

Unnamed: 0,x,y,z,frame_num
0,0.008015,2.229281,-0.078225,1
4,0.017248,2.213429,-0.073621,5
9,0.00897,2.219483,-0.071139,10
14,0.010627,2.222088,-0.077738,15
19,0.017311,2.203463,-0.078932,20
24,0.012832,2.21671,-0.079252,25
29,0.020581,2.190977,-0.072187,30
34,0.018498,2.204841,-0.067472,35
39,0.015844,2.20208,-0.068835,40
44,0.015176,2.218292,-0.077634,45


In [133]:
# choose only those frames with current playbacks 
pbk_posns_only = speaker_posns[speaker_posns['frame_num'].isin(playback_frames)]
pbk_posns_only

Unnamed: 0,x,y,z,frame_num
4,0.017248,2.213429,-0.073621,5
9,0.00897,2.219483,-0.071139,10
14,0.010627,2.222088,-0.077738,15
19,0.017311,2.203463,-0.078932,20
24,0.012832,2.21671,-0.079252,25
29,0.020581,2.190977,-0.072187,30
34,0.018498,2.204841,-0.067472,35
39,0.015844,2.20208,-0.068835,40
44,0.015176,2.218292,-0.077634,45
89,0.63024,2.209804,-1.058441,90


In [134]:
video_mic_positions = pd.read_csv('video_tracking/mic_positions_video/DLTdv7_data_mics9-12positionsxyzpts.csv')
mic_xyz = video_mic_positions[~pd.isna(video_mic_positions['pt1_X'])].reset_index(drop=True)
mic_xyz.columns=['x','y','z']
mic_xyz

Unnamed: 0,x,y,z
0,-0.162229,-3.854878,0.097387
1,-1.084803,-3.256231,-0.437829
2,0.278436,-4.036286,-0.54438
3,-0.118366,-3.985202,1.299513


The first speaker position in the video corresponds to the set of first playbacks (first 9 detections). Let's see the mic0 to speaker distance estimated here. 

In [135]:
speaker_posns_video = pbk_posns_only[['x','y','z']].to_numpy()
mic_positions_video = mic_xyz.to_numpy()

# set all video positions w.r.t to mic0
speaker_posns_video = speaker_posns_video - mic_positions_video[0,:]
mic_positions_video = mic_positions_video - mic_positions_video[0,:]


In [136]:
speaker_posns_video

array([[ 0.179477,  6.068307, -0.171008],
       [ 0.171199,  6.074361, -0.168526],
       [ 0.172856,  6.076966, -0.175125],
       [ 0.17954 ,  6.058341, -0.176319],
       [ 0.175061,  6.071588, -0.176639],
       [ 0.18281 ,  6.045855, -0.169574],
       [ 0.180727,  6.059719, -0.164859],
       [ 0.178073,  6.056958, -0.166222],
       [ 0.177405,  6.07317 , -0.175021],
       [ 0.792469,  6.064682, -1.155828],
       [ 0.796977,  6.051353, -1.159034],
       [ 0.800219,  6.046452, -1.150196],
       [ 0.79935 ,  6.039186, -1.145362],
       [ 0.810579,  6.054218, -1.14298 ],
       [ 0.807427,  6.045962, -1.138167],
       [ 0.807197,  6.027195, -1.130399],
       [ 0.807757,  6.029992, -1.123528],
       [ 0.824842,  6.020306, -1.110799]])

In [137]:
valid_positions

array([[-3.03625898, -4.83958203,  1.9247118 ],
       [-3.0403495 , -4.8506095 ,  1.92850108],
       [-3.04463952, -4.86345684,  1.93209946],
       [-3.04875273, -4.87368155,  1.93516829],
       [-3.04607033, -4.86745462,  1.93305254],
       [-3.04461175, -4.86079683,  1.92985457],
       [-3.04297268, -4.85237803,  1.92687066],
       [-3.03747191, -4.83913827,  1.92362907],
       [-3.0493657 , -4.88438424,  1.93906002],
       [-0.12292647,  3.03870748, -0.60153872],
       [-3.7946985 , -8.29033817,  4.16193227],
       [-0.10536156,  3.10519354, -0.62525699],
       [-3.81490299, -8.37990415,  4.17170909],
       [-3.82420308, -8.40750285,  4.18395856],
       [-3.83780476, -8.44958563,  4.19460616],
       [-0.13818532,  2.9973763 , -0.57599355],
       [-3.90248132, -8.66533541,  4.25915484],
       [-3.9531821 , -8.83162352,  4.31654635]])

In [138]:
plt.figure(figsize=(10,6))
a1 = plt.subplot(121, projection='3d')
a1.view_init(elev=8, azim=81)
a1.plot(speaker_posns_video[:,0], speaker_posns_video[:,1],
            speaker_posns_video[:,2],'*')
plt.title('Video tracking')
for each in range(4):
    a1.plot(mic_positions_video[:,0],mic_positions_video[:,1],mic_positions_video[:,2],'k*')
a1.set_ylim(0,10)
a1.set_xlim(-4,4)
a1.set_zlim(-1,5)

a2 = plt.subplot(122, projection='3d')
plt.title('Acoustic tracking')
a2.view_init(elev=8, azim=81)
a2.plot(valid_positions[:,0], valid_positions[:,1],
            valid_positions[:,2],'*')

for each in range(4):
    a2.plot(mic_positions[:,0],mic_positions[:,1],mic_positions[:,2],'k*')
a2.set_ylim(0,10)
a2.set_xlim(-4,4)
a2.set_zlim(-1,5)

<IPython.core.display.Javascript object>

(-1.0, 5.0)

In [161]:
# video based tof-estimates
video_tofs = {}
for i, speaker_position in enumerate(speaker_posns_video):
    position_tofs = []
    for each in mic_positions_video:
        position_tofs.append(spatial.distance.euclidean(each,speaker_position)/338.0)
    video_tofs[i] = np.array(position_tofs)
video_tof_tdoas = {}
# set 0 channel as ref
for i, tofs in video_tofs.items():
    video_tof_tdoas[i] = tofs[1:]-tofs[0]

In [163]:
# use video based tdoas to estimate positions
calculated_4_positions = np.zeros((len(video_tof_tdoas.keys()), 3,2))

for det_number, tdoas in video_tofs.items():
        d = (vsound*tdoas).reshape(-1,1)
        solution1, solution2 = sw02.spiesberger_wahlberg_solution(mic_positions, d)
        calculated_4_positions[det_number,:,0] = solution1
        calculated_4_positions[det_number,:,1] = solution2

In [164]:
calculated_4_positions[:,:,1]

array([[-1.18731292,  2.66398867, -0.46229162],
       [-1.1904253 ,  2.6653503 , -0.46501913],
       [-1.19160691,  2.66638655, -0.46401009],
       [-1.18632314,  2.65889826, -0.46051224],
       [-1.19020762,  2.66426709, -0.46277793],
       [-1.18180217,  2.65415356, -0.45948148],
       [-1.18422606,  2.66066455, -0.46200634],
       [-1.18470449,  2.6586553 , -0.46209637],
       [-1.18954842,  2.66552127, -0.46266873],
       [-1.18260967,  2.75693945, -0.19434593],
       [-1.17931779,  2.75170449, -0.19154373],
       [-1.17586373,  2.75015937, -0.19157181],
       [-1.17387475,  2.74651488, -0.19166327],
       [-1.17302576,  2.75590663, -0.19120233],
       [-1.17152087,  2.75133885, -0.19166934],
       [-1.16663088,  2.74256387, -0.19082335],
       [-1.16577916,  2.74403146, -0.19195531],
       [-1.15666039,  2.74318165, -0.18909217]])

In [167]:
video_tdoas_basedpositions = calculated_4_positions[:,:,1]
plt.figure()
a2 = plt.subplot(111, projection='3d')
plt.title('Acoustic tracking through video TDOAs')
a2.view_init(elev=8, azim=81)
a2.plot(video_tdoas_basedpositions[:,0], video_tdoas_basedpositions[:,1],
            video_tdoas_basedpositions[:,2],'*')

for each in range(4):
    a2.plot(mic_positions[:,0],mic_positions[:,1],mic_positions[:,2],'k*')
a2.set_ylim(0,10)
a2.set_xlim(-4,4)
a2.set_zlim(-1,5)

<IPython.core.display.Javascript object>

(-1.0, 5.0)

### Camera based m0-speaker distance

## Conclusion : it's not all crap - acoustic tracking and video tracking do work -- but needs more troubleshooting!

* The camera based m0-speaker estimate is 6.08m
* Acoustic tracking based m0-speaker estimate is 5.96 $\pm$ 0.03 m (mean, sd)
    * The time-of-flight based m0-speaker estimate is 6.0 m

### Important lessons

* Audio processing is *very* important - the reverberation below 40 kHz made a *huge* difference on the TOADs estimated. Choosing the correct bandpass parameters made all of the difference.
* GCC-PHAT is better than CC in identifying TDOAs

### Next steps
* Why are the acoustic tracking positions so off??!! Use the fact that we can estimate time-of-flight to calculate TDOAs and then check if at least then acoustic tracking is accurate.
* Now I'd like to push the same exercise to more playback positions, and finally then get to aligning the audio and video tracking systems into a common system. 
* The ```batracker``` detection routines need some tuning!

In [42]:
print(f'Notebook cell run at {dt.datetime.now()}')

Notebook cell run at 2021-06-27 07:11:01.177384
