## Some technical notes about audio parameters

- The sampled signal is obtained in the Linear Pulse Code Modulation (LPCM).
- The signal is stereo (`nchanells=2`), but it is only used the left-side signal.
- It is utilized 16 bits (2 bytes) per sample to encode the audio. The native data type of this data is `int16`, which is capable of storing a [range from](https://www.mathworks.com/help/matlab/ref/audioread.html) `-32768` up to `+32767`.
- The data type is converted to `float` because of the numeric precision and because the floating point in `Python` [is interpreted as](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) `double` in `C`, which is convenient.
- The raw data is normalized by its l2 norm for each frame.
- The original sampling rate is $44.1\;kHz$. But each recording is downsampled into two different signals, with a sampling rate of $F_s = 22.05\;kHz$.
- The audio dataset comprises five classes (the speeches "avançar", "recuar", "parar", "direita", and "esquerda"), each with 10 recordings, totalizing 50 files. With the downsampling, we have 20 recordings by class. Considering that the `.wav` file is stereo, that is, `nchannel=2`, the number of audio recordings by class is increased to 40. From each of these recordings, it is extracted a discrete-time signal, which is converted to a $N_s$-dimensional vector, being $N_s$ the number of samples of this signal.

## Some notes about the LPC (linear predictive coding) and the Yule-Walker algorithms

- The AR(p) model is implemented for `p=10`, `p=15`, and `p=20`.
- A single recording is divided into 31 frames without overlapping. The number of samples per frames, $N_f$, and the number of samples between each frame, $N_{gap}$, are given by
    $$ N_f = \frac{T_{sig}T_{f_{min}} F_s}{T_{min}} $$
    and
    $$ N_{gap} = \frac{T_{sig} F_s-31N_f}{30},$$
    where $T_{sig}$ is the signal duration, $T_{min}$ is the minimum signal duration of the dataset, and $T_{f_{min}} \triangleq 15\;ms $ is the minimum frame duration. All these variables are defined in seconds.
<!-- - The Yule-Walker equation is applied to each of the 31 frames produced from a single audio recording. Being $\mathbf{a}_{i,j} \in \mathbb{R}^{p}$ the $j$-th vector of the $i$-th audio recording, the matrix containing all coefficients of the AR(p) of the $i$-th audio recording is
$$\mathbf{A}_i = \begin{bmatrix}
\mathbf{a}_{i,1} & \mathbf{a}_{i,2} & \cdots & \mathbf{a}_{i,31}
\end{bmatrix} \in \mathbb{R}^{p\times 31}
$$ -->

---

> 1. Carregar os diversos arquivos de áudio e realizar a subamostragem dos sinais de cada canal a fim
de gerar a base de dados de treino e teste.

### Initializing

In [9]:
from numpy import multiply, sum, matmul, inf, empty, concatenate, zeros, array
from statsmodels.regression.linear_model import yule_walker
from scipy.io import wavfile
from scipy.linalg import toeplitz
from math import floor
from numpy.linalg import norm, cond, matrix_rank as rank, inv
from warnings import warn
from os import listdir

# train/test set split
n_train, n_test = 8, 2
# AR(p) model order -> p = 10, 15, 20
all_p = range(10,21,5)
# all coefficients of the AR(p) model. For each command, we have a 8 set of coefficients
all_a = {f'{command}_file{file_number}_p{p}_s{signal}': array([]) for p in all_p for command in ('avancar', 'esquerda', 'direita', 'parar', 'recuar') for file_number in range(1,11) for signal in ('1a', '1b', '2a', '2b')}

def get_T_min(root_dir):
    T_min = inf
    for file_name in listdir(root_dir):
        F_s, s_n = wavfile.read(root_dir+file_name)
        # signal duration
        T_sig = s_n[:,0].size * (1/F_s)
        if T_sig < T_min:
            T_min = T_sig
    return T_min

# minimum audio duration of the dataset
T_min = get_T_min('./Audio_files_TCC_Jefferson/')
# minimum frame duration, 15ms (user defined)
T_f_min = 15e-3

### LPC and Yule-Walker algorithm

In [11]:
for p in all_p:
    for command in ('avancar', 'direita', 'esquerda', 'parar', 'recuar'):
        # training set
        for file_number in range(1,n_train+1):
            file_name = f'./Audio_files_TCC_Jefferson/comando_{command}_{file_number:0>2d}.wav'
            # input audio vector, s_n -> [s[0], s[1], ..., s[N_s-1]]
            F_s, s_n = wavfile.read(file_name)
            # Number of samples
            N_s = s_n[:,0].size
            # signal duration
            T_sig = N_s/F_s
            # convert from int16 to float type
            s_n = s_n.astype(float)
            # downsampling: generate s0_n (even samples) and s1_n (odd samples) from s_n
            s0_n, s1_n = s_n[range(0,N_s,2),:], s_n[range(1,N_s,2),:]
            N_s //= 2
            F_s /= 2
            # number of samples per frame
            N_f = floor(T_sig*T_f_min*F_s/T_min)
            # number of samples between each frame (gap)
            N_gap = floor((N_s - 31*N_f)/30)
            # get channel b and chanell b
            s0a_n, s0b_n, s1a_n, s1b_n = s0_n[:,0], s0_n[:,1], s1_n[:,0], s1_n[:,1]

            # for each of the 4 signals from a single recording: channel a and b, samples even and odd
            for s, signal_id in zip((s0a_n, s0b_n, s1a_n, s1b_n), ('1a', '1b', '2a', '2b')):
                # for each frame
                for i, n in enumerate(range(0, N_s+1, N_f)):
                    # ensure that it is get only 31 frames
                    if i == 31: break
                    # s_n0 -> [s[n0], s[n0+1], ..., s[n0+N_f-1]], being n0\in\mathbb{N}
                    s_n0 = s[n+i*N_gap:n+i*N_gap+N_f]
                    # normalized signal by its l2 norm
                    s_n0 /= norm(s_n0)
                    # compute the autocorrelation function, r_k -> r[k] -> [r[0], r[1], ..., r[p]]
                    r_k = empty(p+1)
                    for k in range(p+1):
                        # s_n0_minus_k -> [0, 0, ..., 0(k times), s[n0], s[n0+1], ..., s[n0+N_f-1-k]]
                        s_n0_minus_k = concatenate((zeros(k), s_n0[k:]))
                        r_k[k] = sum(multiply(s_n0, s_n0_minus_k))
                    # autocorrelation matrix
                    # r_k[:p] -> [r[0], r[1], ..., r[p-1]]
                    R = toeplitz(r_k[:p])
                    # autocorrelation vector
                    # r -> [r[1], r[2], ..., r[p]]
                    r = r_k[1:]
                    if rank(R) == R.shape[0]:
                        if cond(R) > 1e3:
                            warn(f'The autocorrelation matrix of the audio {file_name} is ill-conditioned! The results are suspect!')
                        # Yule-Walker equation
                        a = matmul(inv(R), r)
                        # built-in function for comparasin purpose
                        a_hat, _ = yule_walker(s_n0, order=p)
                        all_a[f'{command}_file{file_number}_p{p}_s{signal_id}'] = concatenate((all_a[f'{command}_file{file_number}_p{p}_s{signal_id}'], a))
                    else:
                        warn(f'The autocorrelation matrix of the audio {file_name} is rank-deficient, skip over to the next audio recording.')

  warn(f'The autocorrelation matrix of the audio {file_name} is ill-conditioned! The results are suspect!')


KeyboardInterrupt: 