# Speech Technologies for Forensic Profiling

> This project aims to explore the speech and machine (deep) learning related technologies for forensic profiling.

## Motivations
**Speech forensics** employs speech processing technologies to discover rich information contained in (concealed) speech associated with suspects, and provides evidence that could be used in court. 

These information includes
1. identity-related
    
    name, gender, age, height, weight, race, language & dialect, facial & body characteristics
2. geographical-related
    
    speech occurance location & conditions, trace map
3. social-relation-related
    
    home, family members, educatoin, work, party, social status, connections, upbringing
4. personal-traits-related
    
    mental state, personality, emotion tendency, habits
5. health-related

    illness history, disease tendency, DNA, body shape, composition and size of their vocal tract, skeletal proportions, lung volume and breathing functions
6. criminal-records-related
    
    crime history, crime tendency

**Justification**

We know that the acoustical aspects of speech are closely related to the speaker's articulatory system, which is further related to the speaker's facial structure and movement, and even to many other physical characteristics. The recordings also contains environmental information that can be exploited. Furthermore, the semantic aspects of speech contains a lot useful information.

The sub-objectives are
1. Discover disguised voice

    How to tell whether a speech is disguised or not? 
2. Discover voice under manipulatoin

    How can we tell if a speaker is under pressure, or threatened, etc.?
3. Privacy

    How can we pretect the privacy of speakers while preserving their confidential information?
4. Reconstruction
    
    Can we 3D reconstruct the speaker?

## Methods
A disguised voice can be a person deliberately impersonates another person, or machine synthesized. In the disguise, the time-frequency traits and semantic traits are altered. Either way, in order to break the disguise, we need to identify the *invariant* and *variant* characteristics in speech. Some factors are innate while some can be modified. We first identify these factors (as mentioned in last part). We then need to define the *normal* manner of a speaker talking, and categorize and quantitize the *deviation* of speech. Using discovered factors, we create forensic profile of the speaker. With this profile, we are able to do a varity of prediction / classification tasks.

1. Microstructures: the sub-phonetic level features
    
    Voice onset time
    
    harmonic bandwidth

    Creak / Vocal fry
    
    Excitation

    Modulation

    Formant frequencies

    Formant bandwidth

    Formant dispersion

    Glottal airflow / Glottal pulse shape

    Harmonicity / Peak-to-valley ratio

    Long-term average spectra

    Nasality

    Pitch

    Register

    Resonance

    Voice bar 

    Voice Bar bandwidth

    Voice coil peak displacement
    
    gradient
    
    region of interest
    
    neural nets auto-discovered features
    
2. Hypothesis & tests

## Update 3/5 - 3/11
    1. Microfeatures
        a. For tidigits, construct speaker-feature dictionary. Speaker dict contains id, gender, age, dialect, seq info. Feature dict contains speaker id, spectrograms, mel spectrograms, const-q spectrograms, mfccs, etc. Seperate training and test set. Seperate single digits and seqs.        
        b. For timit, segment by word and by phone. Compute spectrograms and mfccs.
        c. Write an interface for general tasks: input speaker id and retrieve its features; input features and predict speaker-ids.
        d. Try a Conv-deconv network for feature extraction.
    2. Interspeech
        a. Compute metrics.
        b. Make figures.
    2. Qual preparation

## Update 5/8 - 5/13

### 1. Get familiar with datasets
> TIMIT
    * total 6300 sentences, 10 sentences spoken by each of 630 speakers
    * 8 major dialect regions of the United States
    
    
    1. **Dialect distribution**
    
    Table 1:  Dialect distribution of speakers
    
Dialect Region(dr) | # Male | # Female | Total
:------------------|-------:|---------:|------:
1 New England       |31 (63%) |18 (27%)  | 49 (8%)  
2 Northern       |71 (70%) |31 (30%)  |102 (16%) 
3 North Midland       |79 (67%) |23 (23%)  |102 (16%) 
4 South Midland       |69 (69%) |31 (31%)  |100 (16%) 
5 Southern       |62 (63%) |36 (37%)  | 98 (16%) 
6 New York City       |30 (65%) |16 (35%)  | 46 (7%) 
7 Western       |74 (74%) |26 (26%)  |100 (16%) 
8 Army Brat (moved around)       |22 (67%) |11 (33%)  | 33 (5%)
total      |438 (70%)|192 (30%) |630 (100%)
    
    2. **Corpus text**
    
    Table 2:  TIMIT speech material

Sentence Type |  #Sentences |  #Speakers |  Total |  #Sentences/Speaker
:------------- | ---------- :| ---------:| -----:| ------------------:
Dialect (SA)|         2|         630|       1260|           2
Compact (SX)|        450|           7|       3150|           5
Diverse (SI)|       1890|           1|       1890|           3
Total|              2342|            |       6300|          10

    3. **Filesystem**
    
    /<CORPUS>/<USAGE>/<DIALECT>/<SEX><SPEAKER_ID>/<SENTENCE_ID>.<FILE_TYPE>
         SPEAKER_ID :== <INITIALS><DIGIT>
             INITIALS :== speaker initials, 3 letters
             DIGIT :== number 0-9 to differentiate speakers with identical initials
                             
    .wav - waveform file (SPHERE-headered)
    .txt - transcription
    .wrd - time-aligned word transcription
    .phn - time-aligned phonetic transcription
    
    Examples:
     /timit/train/dr1/fcjf0/sa1.wav
                         
     (TIMIT corpus, training set, dialect region 1, female speaker, 
      speaker-ID "cjf0", sentence text "sa1", speech waveform file)
      
    4. **Other docs**
        
        sentences, dict, lexicon, alignment
        
        tagging
        
    5. **Extra info**
    
        Birthday, height, race, education
        
        low pitch, concious attemp to change accent, denasality, inhale/exhale, slow rate, high freq, intonation 
        
        /R/ in "WASH", whistling /S/'S, 
        
        movement
        
        hearing loss, cold, glottal fry, hoarse, voice disorder
        
        mixed-race, multi-lingual

> TIDIGITS
    * more than 25 thousand digit sequences
    * 326 speakers (111 men, 114 women, 50 boys, and 51 girls)
    * collected in a quiet environment and digitized at 20 kHz
    
    1. **Speaker statistics**
        1. Age distribution
        2. Dialect distribution
        
    2. **Corpus text**
    
    3. **Filesystem**
        
        FILESPEC ::= /tidigits/<USAGE>/<SPEAKER-TYPE>/<SPEAKER-ID>/<DIGIT-STRING><PRODUCTION>.wav
             USAGE ::= test | train
             SPEAKER-ID ::= aa | ab | ac | ... | tc
             
        Example:
         /tidigits/train/man/fd/6z97za.wav

         ("tidigits" corpus, training material, adult male, speaker code "fd", digit sequence "six zero nine seven 
         zero", first production, NIST SPHERE file.)
         
    4. **Other docs**
    
    5. **Extra info**

> Comparison

### 2. Listen, visualize, and compare
    
    1. Listen samples from the two datasets
    
    2. Compute their spectra, visulize and analyze
    
    3. Discoveries

### 3. Get familiar with Sphinx

    1. Read docs
    
    2. Read codes and run demos

### 4. Other
    
    1. Code maintenance: format data io, extract and visualize features, previous speech align and segmentation codes, general interfaces
    
    2. Read relevant papers: speech production, deep kernel learning
    
    3. Summarize statistical learning