# Assignment 1: Decoding States

---

## Task 5) Dual-Tone Multi-Frequency

[Dual-tone multi-frequency DTMF](https://en.wikipedia.org/wiki/Dual-tone_multi-frequency_signaling) signaling is an old way of transmitting dial pad keystrokes over the phone.
Each key/symbol is assigned a frequency pair: `[(1,697,1209), (2,697,1336), (3,697,1477), (A,697,1633), (4,770,1209), (5,770,1336), (6,770,1477), (B,770,1633), (7,852,1209), (8,852,1336), (9,852,1477), (C,852,1633), (*,941,1209), (0,941,1336), (#,941,1477), (D,941,1633)]`.
You can generate some DTMF sequences online, eg. <https://www.audiocheck.net/audiocheck_dtmf.php>

### Features

For feature computation, use librosa to compute the power spectrum (`librosa.stft` and `librosa.amplitude_to_db`), and extract the approx. band energy for each relevant frequency.

> Note: It's best to identify silence by the overall spectral energy and to normalize the band energies to sum up to one.

### Decoding

To decode DTMF sequences, we can use again dynamic programming, this time applied to states rather than edits.
For DTMF sequences, consider a small, fully connected graph that has 13 states: 0-9, A-D, \*, \# and _silence_.
As for the DP-matrix: the rows will denote the states and the columns represent the time; we will decode left-to-right (ie. time-synchronous), and at each time step, we will have to find the best step forward.
The main difference to edit distances or DTW is, that you may now also "go up" in the table, ie. change state freely.
For distance/similarity, use a template vector for each state that has `.5` for those two bins that need to be active.

When decoding a sequence, the idea is now that we remain in one state as long as the key is pressed; after that, the key may either be released (and the spectral energy is close to 0) hence we're in pause, or another key is pressed, hence it's a "direct" transition.
Thus, from the backtrack, collapse the sequence by digit and remove silence, eg. `44443-3-AAA` becomes `433A`.

---

### Preparation

In [43]:
import librosa
import numpy as np
from typing import List, Tuple

In [44]:
### Notice: librosa defaults to 22.050 Hz sample rate; adjust if needed!

DTMF_TONES = [
    ('1', 697, 1209), 
    ('2', 697, 1336), 
    ('3', 697, 1477), 
    ('A', 697, 1633),
    ('4', 770, 1209),
    ('5', 770, 1336),
    ('6', 770, 1477),
    ('B', 770, 1633),
    ('7', 852, 1209),
    ('8', 852, 1336),
    ('9', 852, 1477),
    ('C', 852, 1633),
    ('*', 941, 1209),
    ('0', 941, 1336),
    ('#', 941, 1477),
    ('D', 941, 1633)
]

### Implement the Decoding

In [None]:
### TODO:
### 1. familiarize with librosa stft to compute powerspectrum
### 2. extract the critical bands from the power spectrum (ie. how much energy in the DTMF-related freq bins?)
### 3. define template vectors representing the state (see dtmf_tones)
### 4. for a new recording, extract critical bands and do DP do get state sequence
### 5. backtrack & collapse

### Notice: you will need a couple of helper functions...
ALL_DTMFS = [697, 770, 852, 941, 1209, 1336, 1477, 1633]

def similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.linalg.norm(b - a)

def decode_column(col: np.ndarray, dtmf_templates: list) -> str:
    best_label, best_score = None, float('inf')
    for label, template in dtmf_templates:
        score = similarity(col, template)
        if score < best_score:
            best_label, best_score = label, score
    return best_label

def decode(y: np.ndarray, sr: float):
    """
    Apply DTMF signal decoding.

    Arguments:
    y: Input signal.
    sr: Sample rate.

    Returns list of DTMF-signals (no silence).
    """
    ### YOUR CODE HERE
    freqs = librosa.fft_frequencies(sr=sr, n_fft=2048)
    freqs = freqs[freqs <= 2000]  # Limit to 0 - 2000 Hz
    
    nearest_freqs = {f: np.argmin(np.abs(freqs - f)) for f in ALL_DTMFS}
    
    dtmf_templates = []
    for (label, low, high) in DTMF_TONES:
        template = np.zeros_like(freqs)
        template[nearest_freqs[low]] = 0.5
        template[nearest_freqs[high]] = 0.5
        dtmf_templates.append((label, template))
    
    silence_template = np.full(len(freqs), 1 / len(freqs))
    dtmf_templates.append((None, silence_template))
    
    D = librosa.stft(y, n_fft=2048)
    D = np.abs(D[:len(freqs), :]) ** 2
    D /= np.sum(D, axis=0, where=(np.sum(D, axis=0) != 0), keepdims=True)
    
    decoded = [decode_column(D[:, i], dtmf_templates) for i in range(D.shape[1])]
    result = [x for i, x in enumerate(decoded) if x != decoded[i-1] and x is not None]
    return ''.join(result)
    ### END YOUR CODE

### Test the Decoding

In [51]:
SR = 22050
TEST_FILE = "data/audiocheck.net_dtmf_112163_112196_11#9632_##9696.wav"

y, sr = librosa.load(TEST_FILE, sr=SR)
print(decode(y=y, sr=sr))

11216311219611#9632##9696


  D /= np.sum(D, axis=0, where=(np.sum(D, axis=0) != 0), keepdims=True)
