# Assignment 1: Decoding States

---

## Task 5) Dual-Tone Multi-Frequency

[Dual-tone multi-frequency DTMF](https://en.wikipedia.org/wiki/Dual-tone_multi-frequency_signaling) signaling is an old way of transmitting dial pad keystrokes over the phone.
Each key/symbol is assigned a frequency pair: `[(1,697,1209), (2,697,1336), (3,697,1477), (A,697,1633), (4,770,1209), (5,770,1336), (6,770,1477), (B,770,1633), (7,852,1209), (8,852,1336), (9,852,1477), (C,852,1633), (*,941,1209), (0,941,1336), (#,941,1477), (D,941,1633)]`.
You can generate some DTMF sequences online, eg. <https://www.audiocheck.net/audiocheck_dtmf.php>

### Features

For feature computation, use librosa to compute the power spectrum (`librosa.stft` and `librosa.amplitude_to_db`), and extract the approx. band energy for each relevant frequency.

> Note: It's best to identify silence by the overall spectral energy and to normalize the band energies to sum up to one.

### Decoding

To decode DTMF sequences, we can use again dynamic programming, this time applied to states rather than edits.
For DTMF sequences, consider a small, fully connected graph that has 13 states: 0-9, A-D, \*, \# and _silence_.
As for the DP-matrix: the rows will denote the states and the columns represent the time; we will decode left-to-right (ie. time-synchronous), and at each time step, we will have to find the best step forward.
The main difference to edit distances or DTW is, that you may now also "go up" in the table, ie. change state freely.
For distance/similarity, use a template vector for each state that has `.5` for those two bins that need to be active.

When decoding a sequence, the idea is now that we remain in one state as long as the key is pressed; after that, the key may either be released (and the spectral energy is close to 0) hence we're in pause, or another key is pressed, hence it's a "direct" transition.
Thus, from the backtrack, collapse the sequence by digit and remove silence, eg. `44443-3-AAA` becomes `433A`.

---

### Preparation

In [7]:
import librosa
import numpy as np
from typing import List, Tuple

In [8]:
### Notice: librosa defaults to 22.050 Hz sample rate; adjust if needed!

DTMF_TONES = [
    ('1', 697, 1209), 
    ('2', 697, 1336), 
    ('3', 697, 1477), 
    ('A', 697, 1633),
    ('4', 770, 1209),
    ('5', 770, 1336),
    ('6', 770, 1477),
    ('B', 770, 1633),
    ('7', 852, 1209),
    ('8', 852, 1336),
    ('9', 852, 1477),
    ('C', 852, 1633),
    ('*', 941, 1209),
    ('0', 941, 1336),
    ('#', 941, 1477),
    ('D', 941, 1633)
]

### Implement the Decoding

In [64]:
### TODO:
### 1. familiarize with librosa stft to compute powerspectrum
### 2. extract the critical bands from the power spectrum (ie. how much energy in the DTMF-related freq bins?)
### 3. define template vectors representing the state (see dtmf_tones)
### 4. for a new recording, extract critical bands and do DP do get state sequence
### 5. backtrack & collapse

### Notice: you will need a couple of helper functions...


def decode(y: np.ndarray, sr: float) -> list:
    """
    Apply DTMF signal decoding.
    
    Arguments:
    y: Input signal.
    sr: Sample rate. 
    
    Returns list of DTMF-signals (no silence).
    """
    ### YOUR CODE HERE
    def make_template_vector(freq_low, freq_high, sr, n_fft):
        bin_low = int((freq_low * n_fft) / sr)
        bin_high = int((freq_high * n_fft) / sr)

        template = np.zeros(int(n_fft / 2 + 1))
        template[bin_low] = 0.5
        template[bin_high] = 0.5
    
        return template
    
    def collapse_sequence(seq, silence_break_between_strokes = 2):
        if len(seq) == 0:
            return []

        collapsed_seq = [seq[0]]
        symbol_count = 1
        last_symbol = seq[0]

        for symbol in detected_symbols:
            if symbol == last_symbol:
                symbol_count += 1

                if symbol_count > silence_break_between_strokes and symbol != collapsed_seq[-1]:
                    collapsed_seq.append(symbol)
            else:
                symbol_count = 0

            last_symbol = symbol
        
        return collapsed_seq

    n_fft = 2048
    S = np.abs(librosa.stft(y, n_fft=2048)) ** 2  # shape (1025, 345) mit S[..., f, t] >> frequency bin f beim Frame t
    S_db = librosa.amplitude_to_db(S)

    n = len(DTMF_TONES) + 1
    m = S.shape[1]
    D = np.full((n, m), -np.inf, dtype=float)
    template_freqs = [make_template_vector(fl, fh, sr, n_fft) for _, fl, fh in DTMF_TONES]

    for i in range(m):
        for j in range(n-1):
            D[j,i] = np.dot(S_db[:, i], template_freqs[j]) / (len(S_db[:, i]) * len(template_freqs[j]))  # normalized cosine similarity

        D[n-1, i] = 1.0 if np.argmax(S_db[:, i]) == 0 else 0.0  # silence label either 1 or 0 to dominate other similarity scores
    
    # select symbol with the highest similarity score for the given frame
    detected_symbols = []
    for i in range(m):
        max_idx = np.argmax(D[:, i])
        detected_symbols.append("-" if max_idx == n-1 else DTMF_TONES[max_idx][0]) 

    # collapsing
    collapsed_symbols = collapse_sequence(detected_symbols)

    return collapsed_symbols
    ### END YOUR CODE


### Test the Decoding

In [65]:
SR = 22050
TEST_FILE = "C:\\Users\\Felix\\PythonProjects\\seqlrn_assignments\\1-dynamic-programming\\data\\dtmf\\audiocheck.net_dtmf_112163_112196_11#9632_##9696.wav"

y, sr = librosa.load(TEST_FILE, sr=SR)
print(decode(y=y, sr=sr))

['1', '2', '1', '6', '3', '-', '1', '2', '1', '9', '6', '-', '1', '#', '9', '6', '3', '2', '-', '#', '9', '6', '9', '6']
