# Speech latent feature discovery for multi-label forensic profiling

Project for IWBF 2018

----

**Objective:** Find latent representations in speech that can predict forensic aspects of human.


## Update ~12/11/2017

1. Write-up:

    https://www.overleaf.com/12702004wrrmdybmpqsq#/48448185/

2. Implementation
    1. Data: TIMIT, Alcohol, (SRE)
    2. Model
        ![model](./write-up/figs/model.png)  
    3. Current attributes: speaker-id, gender, dialect, age, height, drunk
    4. [Code](./src)
    
3. Final poster for convex

## Update ~12/18/2017

1. Experiments
    1. Currently using latent codes from autoencoder to do multi-label classification and regression.
    2. It works well on certain tasks (reconstruction, gender, height), but not well on other tasks (id, dialect, age).
    3. Including more datasets (SRE, Yandong's face data)

large model openset (100 epochs, batch 4)

|Label | Reconstruction Loss ($l_2^2$) | ID Acc (%) | Gender Acc (%) | Dialect Acc (%) | Age MSE | Height MSE |
|:----|----:|---:|---:|---:|----:|---:|
| Train | 0.1417 | 75.15 | 97.15 | 80.40 | 6.6931 | 0.0668 |
| Test | 0.1847 | 3.59 | 92.83 | 14.73 | 81.1115 | 0.0818 |
 
2. Writing paper

3. Todo next week
    1. Finish draft of the paper.
    2. Improve the model.
    

## Update ~12/19/2017

1. Experiments
    1. Observations: ID, Age and Dialect seem to be correlated (similar trends of loss), whereas Gender and Height seem to be independent (not speaker-dependent).
    2. Back to closed set of speakers -- openset is far more difficult.
    3. Reduce model size to speed up training.

## Update ~12/20/2017

1. Experiments
    1. Train on size-reduced model: significantly reduced model size, less layers, less parameters, smaller z dimension.
    2. Use closed set speakers, split (0.8 / 0.2).
    3. Observations: 
    
small model openset (300 epochs, batch 4)

|Label | Reconstruction Loss ($l_2^2$) | ID Err (%) | Gender Err (%) | Dialect Err (%) | Age MSE | Height MSE |
|:----|----:|---:|---:|---:|----:|---:|
| Train | 0.1168 | 0.98 | 0.02 | 1.45 | 1.1543 | 0.0980 |
| Test | 0.1529 | 98.60 | 3.81 |  82.36 | 83.5361 | 0.0981 |

## Update ~12/20/2017

1. Experiments
    1. Observations: while the train losses monotonically decreases, the test losses for id and dialect strangely increases.

![strange-loss](./write-up/figs/strange_loss.png)

small model closedset (300 epochs, batch 16)

|Label | Reconstruction Loss ($l_2^2$) | ID Err (%) | Gender Err (%) | Dialect Err (%) | Age MSE | Height MSE |
|:----|----:|---:|---:|---:|----:|---:|
| Train | 0.1209 | 0.15 | 0.00 | 0.31 | 0.5237 | 0.0177 |
| Test | 0.1609 | 87.54 | 0.0150 | 77.97 | 60.2400 | 0.0773 |

## Update ~12/28/2017

Did not do much this week ...

1. Found the stupid bug in code, now all losses decreasing.
    
small model closedset

|Label | Reconstruction Loss ($l_2^2$) | ID Err (%) | Gender Err (%) | Dialect Err (%) | Age MSE | Height MSE |
|:----|----:|---:|---:|---:|----:|---:|
| Train |  | 32.40 |  |  |  |  |
| Test |  | 48.33 | | |  ||

2. Paper

    https://www.overleaf.com/12702004wrrmdybmpqsq#/48448185/

3. Todo

    1. Extend to multi-datasets
    2. Have to finish the paper !!!


## Update ~12/30/2017

1. Multitask learning (MTL)
    1. Models
        1. Neural models
            1. Hard parameter sharing
            2. Soft parameter sharing
            3. **what to share in model?**

                ** sharing information with unrelated tasks might impede learning individual tasks**
        2. Non-neural models [assume model parameter $W$ of size ($d$-feature dims $\times$ $T$-tasks)]
            1. sparsity across tasks through norm regularization
                1. block-sparse regularization: mixed $l_1 / l_q$ norms
                2. group lasso
                3. trace norm regularize: low rank
                4. combine block-sparse and element-wise sparse
            2. modelling the relationships between tasks
                1. clustering constraints: task mean and between-task variance; possibly add within-task variance
                2. more complex structures: graph, tree, k-nn
                3. Bayesian
                    1. Gaussian process with shared covariance
                    2. mean task-dependent and a clustering of the tasks using a mixture distribution
                    3. Dirichlet process and enable the model to learn the similarity between tasks as well as the number of clusters
                    4. hierarchical Bayesian model -- a latent task hierarchy
                    5. actual tasks are linear combination of small number of latent basis tasks
        3. Auxiliary tasks
            1. related task
            2. adversarial task
            3. hints: predict features
            4. attention
            5. quantization smoothing
            6. representation learning
            7. **what auxiliary tasks are helpful?**
            
            e.g.  two tasks are $\mathcal{F}$-related if the data for both tasks can be generated from a fixed probability distribution using a set of transformations $\mathcal{F}$
            
            Task similarity is not binary, but resides on a spectrum. More similar tasks should help more in MTL, while less similar tasks should help less. Allowing our models to learn what to share with each task might allow us to temporarily circumvent the lack of theory and make better use even of only loosely related tasks. However, we also need to develop a more principled notion of task similarity with regard to multi-task learning in order to know which tasks we should prefer.
            
             **Existing problems**: task similarity, relationship, hierarchy, and benefit for MTL

2. **Multitask learning and Adversarial learning**
    1. Domain adaptation ?
    2. MTL with loosely related tasks: multiple adversarial learning?


## Update ~01/01/2018

1. Address the problems analyzed in **Update ~12/30/2017**
    1. Need a sharing mechanism among tasks
    2. Need a latent domain adaptation mechanism
    3. Weighted loss w.r.t. the task uncertainty

## Update ~01/02/2018

1. Experiments
    0. **Two difficulties**
        1. Model was not able to handle single but difficult task (e.g., id, age), neither multiple tasks. --> need better model
        2. The tasks are loosely relevant. Some tasks bias other tasks. --> need proper model sharing, loss weighting, auxillary tasks
    1. Tried different sharing mechanisms but not work
        1. sharing part of encoder but with different latent codes for different tasks
        2. sharing latent codes (including varying latent code distributions, e.g. Gaussians with centered mean but different variances)
        3. sharing part of decoder
    2. Changed model to Resnet-like structure
        1. combine multi-resolution features
    3. Applying weighting to different task losses
        1. hard-assigned weights: assign larger weights to tough tasks.
        2. soft-assigned weights: probabilistically reformulate the task objectives (for regression, model likelihood as a Gaussian; for classification, use softmax), and then use objective variance / uncertainty to adjust weights.
    4. Trying multi-stage autoencoders
    5. Attention model?

## Update ~01/03/2018

1. Experiments
    1. Resnet-like structure seems to improve accuracy than standard CNN.
    2. Use task-wise learning rates and task-wise early stopping.
    3. Use regularization on shared latent representation 

## Update ~01/04/2018

0. Conclusions
    1. Based on previous observations and discussion, we need
        0. Model: Resnet, probably add attention
        1. Model sharing: must share from latent level
        2. Related tasks: design auxillary tasks, corresponding loss and learning strategy
        3. Regularization on latent space: currently tried sparse, $l_2$, and normal
1. Experiments
    1. Resnet-like structure generally improve test performance (but not much).
    
    [See some loss curves](./write-up/figs/loss_records)
    
    2. Running task-wise early stopping with smaller learning rate.
    
    [Ref: Facial Landmark Detection by Deep Multi-task Learning](https://link.springer.com/chapter/10.1007/978-3-319-10599-4_7)
    
    [Ref: PhD Thesis: Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf)
    ![Example of MTL on nine 1D-ALVINN tasks](./write-up/figs/eg_mtl_loss_curves.png)
    
    3. Running single task.
    4. Using ID as main task, and others as auxillary tasks.
    
    Other tasks can be viewed as sub-tasks of ID.
    
    5. Redesign loss
        
        For regression: the likelihood $p(y_1 | f(x)) = \mathcal{N}(f(x), \sigma_1^2)$
        
        For classification: $p(y_2 | f(x)) = \text{softmax}(\frac{1}{\sigma_2^2}f(x))$
        
        The total log-likelihood for both regression and classification tasks: $\mathcal{L}(\sigma_1, \sigma_2) \approx \frac{1}{2\sigma_1^2}\|y_1 - f(x)\|_2^2 - \frac{1}{2\sigma_2^2}\log\text{softmax}(y_2, f(x)) + \log\sigma_1^2 + \log\sigma_2^2$
        
        Hence the variance controlls the contribution of each task.
        
        Can further extend to generalized linear mixed model.

## Update ~01/07/2018

1. Experiments
    1. Apply task-wise early stopping
    
        Loss reduced for ID, Dialect and Age.
    2. Use wasserstein loss for Autoencoder
        
        Loss reduced for reconstruction. 
    3. Learning strategy: train AE more before training Discriminators
    
        Improved performance.
    4. Conjecture: two adversaries in learning
        1. Adversary between Reconstruction and Discrimination
        
            Consider X -f- z -g- X' and X -f- z -h- y: $(g \circ f)(X) = X'$, $(h \circ f)(X) = y$, then $f = g^{-1} \circ I$ and $(h \circ (g^{-1} \circ I))(X) = y$. Hence the discrimination process invloves in inverting reconstruction process.
            
        2. Adversary among discriminators
        
            Each decoding process has true, underlying $\{\mathcal{Z}\}$, but the encoded $\{\mathcal{\widetilde{Z}}\}$ differ with $\{\mathcal{Z}\}$. There is a game among learning these latent representations.
    
    5. Todo: Apply probabilistic early stopping

## Update ~01/08/2018

1. Experiments
    1. Dynamic task-wise stopping and resuming
        
        Use learning curve statistics to regularize learning: at step $t$, stop task $\alpha$ if 
        
        $$\frac{\text{med}_kE_{\text{train}}^{\alpha}}{\text{mean}_kE_{\text{train}}^{\alpha} - \text{med}_kE_{\text{train}}^{\alpha}} \cdot \frac{\left|E_{\text{test}}^{\alpha}(t) - \min_tE_{\text{train}}\right|}{\lambda^{\alpha}\min_tE_{\text{train}}} > \text{threshold}$$,
        
        and resume if smaller than the threshold, where $\text{med}_kE_{\text{train}}^{\alpha}$ is the median of the recent $k$ train errors for task $\alpha$, $E_{\text{test}}^{\alpha}(t)$ is the test error at step $t$, $\min_tE_{\text{train}}$ is the minimum of the total $t$ train errors, and $\lambda^{\alpha}$ is the loss weight for task $\alpha$.
        
    2. Loss
        1. Use dual of wasserstein distance for reconstruction; mean absolute deviation for continuous attributes prediction --> weak convergence
        2. Todo: use probabilistic formulation for all tasks and derive weaker metric;
        
            also use it to adjust dynamic task-wise stopping and resuming.
            
            