GitHub - smartcameras/AV3T: Impementation of "Multi-speaker tracking from an audio-visual sensing device"

smartcameras / AV3T Public

Notifications You must be signed in to change notification settings
Fork 6
Star 8

Impementation of "Multi-speaker tracking from an audio-visual sensing device" - IEEE Transactions on Multimedia,

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
avsync		avsync
crdtransfer		crdtransfer
evaluation		evaluation
gcf		gcf
initialisation		initialisation
likelihoods		likelihoods
projection		projection
readParas		readParas
visualisation		visualisation
AV3T.m		AV3T.m
AV3T_MOT_greedy.m		AV3T_MOT_greedy.m
AV3T_MOT_prediction.m		AV3T_MOT_prediction.m
AV3T_MOT_resampling.m		AV3T_MOT_resampling.m
AV3T_MOT_update.m		AV3T_MOT_update.m
License.doc		License.doc
readme.txt		readme.txt
test_AV3T.m		test_AV3T.m
test_main.m		test_main.m

Repository files navigation

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Date:     07/02/2019
% Author:   Xinyuan Qian (x.qian@qmul.ac.uk)
%           
% Please go through the readme.m and License.doc files before using the software.
%
% Requested citation acknowledgement when using this software:
% [1] X. Qian, A. Brutti, O. Lanz, M. Omologo and A. Cavallaro, "Multi-speaker tracking from an audio-visual sensing device" in IEEE Transactions on Multimedia, Feb 2018, accepted.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This folder contains the MATLAB code of our proposed 'AV3T' tracker for multi-speaker tracking in 3D
% using the multi-modal signals captured by a small-size co-located audio-visual sensing platform.
%
%
% The evaluation is made on two datasets: 
%   (1) CAV3D  - collected and annotated by ourselves
%   (2) AV16.3 - 
%       G. Lathoud, J.-M. Odobez, and D. Gatica-Perez, “AV16.3:an audio-visual corpus for speaker localization and tracking,” in Machine Learning for Multimodal Interaction. Martigny, Switzerland: Springer, Jun 2004.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% To execute the tracker, you need to
%   (1) download the CAV3D dataset:                 https://speechtek.fbk.eu/cav3d-dataset/
%   (2) download our re-arranged AV16.3 dataset:    https://drive.google.com/file/d/19vA1tSf9EauMBxZr4wkbS1Ra27hVlJpS/view?usp=sharing 
%   (3) change the 'datapath' in 'test_main.m' to the data directory
%   (4) run 'test_main.m' 
%   In the 'test_main.m' file, you can change the parameters to specify 
%       (i)     the test dataset i.e. either AV16.3 or CAV3D and sequences
%       (ii)    tracking mode i.e. audio-visual audio-only or video-only
%       (iii) 	face detection removal index => to test the system robustness of the face detection inputs
%       (iv)  	the number of particle (per target), and the experiment iterations
%       (v)     visualization and saving flag
%
% 
% Note: 
% The camera projection MEX-files ('projectin/mex/*.mex') were compiled from C++ code,
% which are not compatible to different machines (especially if you have the linux environment). If you cannot execute
% the default '.mex' files, you need to compile them by yourself by running the code 'projection/mex/cpp2mex.m'
% Some attention points:
%   (1) make sure you have the right compiler: https://uk.mathworks.com/support/requirements/supported-compilers.html
%   (2) make sure you have installed OpenCV
%   (3) make sure you have installed the MATLAB Computer Vision System Toolbox and it's OpenCV Interface:
%   https://uk.mathworks.com/matlabcentral/fileexchange/47953-computer-vision-system-toolbox-opencv-interface
%   (you may follow compilation instructions in the enclosed video.)
% If you still have problems during the compilation, please ask for the MathWorks technical support, 
% probably by creating a service request here: https://uk.mathworks.com/support/contact_us.html
%
%
% The proposed 3D multi-speaker tracker can be adapted to any audio-visual sensing platforms
% as long as signals are synchronized and calibration parameters are available.
% To execute on a new audio-visual data, you need to 
%   (1) compute and update the synchronization information, camera calibration parameters and microphone position in 'readParas.m' 
%   (2) compute the face detection results, save the results, and update the filename in 'readParas/readfaceDetect.m' 
%       the face detector we used in [1] can be found in https://github.com/tornadomeet/mxnet-face
%	the detection bounding box should be saved in format: [#frame, #detection, topleft_x, topleft_y, bottomright_x, bottomright_y, probability, ...]
%   (3) compute the GCC-PHAT results, save the results, and update 'readParas/extractGCF3Dres.m'
% 
% For any questions, please contact x.qian@qmul.ac.uk