# Speech Technologies for Forensic Profiling

> This project aims to explore the speech and machine (deep) learning related technologies for forensic profiling.

## Motivations
**Speech forensics** employs speech processing technologies to discover rich information contained in (concealed) speech associated with suspects, and provides evidence that could be used in court. 

These information includes
1. identity-related
    
    name, gender, age, height, weight, race, language & dialect, facial & body characteristics
2. geographical-related
    
    speech occurance location & conditions, trace map
3. social-relation-related
    
    home, family members, educatoin, work, party, social status, connections, upbringing
4. personal-traits-related
    
    mental state, personality, emotion tendency, habits
5. health-related

    illness history, disease tendency, DNA, body shape, composition and size of their vocal tract, skeletal proportions, lung volume and breathing functions
6. criminal-records-related
    
    crime history, crime tendency

**Justification**

We know that the acoustical aspects of speech are closely related to the speaker's articulatory system, which is further related to the speaker's facial structure and movement, and even to many other physical characteristics. The recordings also contains environmental information that can be exploited. Furthermore, the semantic aspects of speech contains a lot useful information.

The sub-objectives are
1. Discover disguised voice

    How to tell whether a speech is disguised or not? 
2. Discover voice under manipulatoin

    How can we tell if a speaker is under pressure, or threatened, etc.?
3. Privacy

    How can we pretect the privacy of speakers while preserving their confidential information?
4. Reconstruction
    
    Can we 3D reconstruct the speaker?

## Methods
A disguised voice can be a person deliberately impersonates another person, or machine synthesized. In the disguise, the time-frequency traits and semantic traits are altered. Either way, in order to break the disguise, we need to identify the *invariant* and *variant* characteristics in speech. Some factors are innate while some can be modified. We first identify these factors (as mentioned in last part). We then need to define the *normal* manner of a speaker talking, and categorize and quantitize the *deviation* of speech. Using discovered factors, we create forensic profile of the speaker. With this profile, we are able to do a varity of prediction / classification tasks.

1. Microstructures: the sub-phonetic level features
    
    Voice onset time
    
    harmonic bandwidth

    Creak / Vocal fry
    
    Excitation

    Modulation

    Formant frequencies

    Formant bandwidth

    Formant dispersion

    Glottal airflow / Glottal pulse shape

    Harmonicity / Peak-to-valley ratio

    Long-term average spectra

    Nasality

    Pitch

    Register

    Resonance

    Voice bar 

    Voice Bar bandwidth

    Voice coil peak displacement
    
    gradient
    
    region of interest
    
    neural nets auto-discovered features
    
2. Hypothesis & tests

## Update 3/5 - 3/11
    1. Microfeatures
        a. For tidigits, construct speaker-feature dictionary. Speaker dict contains id, gender, age, dialect, seq info. Feature dict contains speaker id, spectrograms, mel spectrograms, const-q spectrograms, mfccs, etc. Seperate training and test set. Seperate single digits and seqs.        
        b. For timit, segment by word and by phone. Compute spectrograms and mfccs.
        c. Write an interface for general tasks: input speaker id and retrieve its features; input features and predict speaker-ids.
        d. Try a Conv-deconv network for feature extraction.
    2. Interspeech
        a. Compute metrics.
        b. Make figures.
    2. Qual preparation