# Analysis, transformation and synthesis of audio signals using the STFT

## Reminders

### Definition

Let $x(t) \in \mathbb{R}$ be a signal defined in the time domain $t \in \mathbb{Z}$ with support $\{0,...,T-1\}$. 

The short-time Fourier transform (STFT) is defined by:
$$ X(f,n) = \sum_{t \in \mathbb{Z}} x_n(t) \exp\left(-j2\pi \frac{ft}{F}\right), \qquad \forall (f,n) \in \{0,...,F-1\} \times \mathbb{Z}, \qquad (1)$$
where
$$ x_n(t) =  x(t + nH) w(t), \qquad (2)$$
and

- $w(t)$ is a window of support $\{0,...,L-1\}$;
- $H$ is the analysis hop size with $L/H \in \mathbb{N}$;
- $F$ is the order of the discrete Fourier transform (we will choose $F=L$).

The inverse STFT is defined by:
$$ \hat{x}(t) = \sum_{n \in \mathbb{Z} } w(t-nH) \hat{x}_n(t-nH), \qquad \forall t \in \mathbb{Z}, \qquad (3)$$ 
where
$$ \hat{x}_n(t) = \frac{1}{F}\sum_{f=0}^{F-1} X(f,n) \exp\left(+ j2\pi \frac{ft}{F}\right). \qquad (4) $$

### Important properties

1. Hermitian symmetry: $X(F-f,n) = X^*(f,n)$, which is inheritated from the DFT.

2. Perfect reconstruction (i.e. $\hat{x}(t) = x(t)$) is ensured provided that the window $w(t)$ satisfies:

$$ \sum_{n \in \mathbb{Z} } w^2(t-nH) = 1. \qquad (5)$$

---

This is not a signal processing course so proving these properties is left as a homework, served with the following hint: $\displaystyle \frac{1}{F} \sum_{f=0}^{F-1} \exp\left( -j 2 \pi \frac{ft}{F} \right) = 1$ if $t=0$, and $0$ if $t \in \mathbb{Z}^*$. It corresponds to the sum of the $F$-th roots of unity, and this result can be shown by recognizing  a sum of the successive terms of a geometric sequence.

---

## Outline of the notebook

* #### <a href='#II.1'>1. Perfect reconstruction</a> 

    The objective is to verify that the perfect reconstruction condition is verified with the sine window.

* #### <a href='#II.2'>2. STFT implementation</a> 

    The objective is to implement the STFT and observe the power spectrogram for different sounds and parameters.

* #### <a href='#II.3'>3. Inverse STFT implementation</a> 

    The objective is to implement the inverse STFT and observe that we indeed have perfect reconstruction.
    
* #### <a href='#II.4'>4. Analysis, transformation and synthesis of audio signals</a> 

    The objective is to transform an audio signal in the STFT domain.

Let us first import packages and define some important parameters.

In [None]:
import matplotlib
# matplotlib.use('Qt4Agg') # if problem with PyQt5

import numpy as np
import soundfile as sf 
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
fs = 16000 # sampling rate
wlen_sec = 32e-3 # STFT window length in seconds
hop_percent = 0.5  # hop size as a percentage of the window length
L = wlen_sec*fs # window length in samples
L = np.int(np.power(2, np.ceil(np.log2(L)))) # round to the next power of 2 to fasten fft 
H = np.int(hop_percent*L) # hop size in samples
nfft = L # number of points (i.e. order) of the discrete Fourier transform
F = L//2 + 1  # number of positive frequency bins

<a id='II.1'></a>

### 1. Perfect reconstruction

We define the following analysis/synthesis window:
$$ w(t) = \sin\left( \frac{\pi}{L} \left(t + \frac{1}{2}\right) \right), \qquad t = 0,..., L-1. \qquad (6)$$

In [None]:
win = np.sin(np.arange(.5,L-.5+1)/L*np.pi); # sine analysis window
time_axis = np.arange(0, L)/fs
plt.plot(time_axis, win)
plt.xlabel('time (s)')
plt.ylabel('amplitude')
plt.title('sine window')

We verify that the perfect reconstruction condition (5) is satisfied with the sine window. We define an arbitrary number of time frames $N = 10$ for the overlap-add operation, such that $n=0,...,N-1$, and we compute the overlap-add in (5).

In [None]:
N = 10
ola = np.zeros((N-1)*H+L) # array to save the result of the overlap-add

for n in np.arange(N):
    ind_beg = n*H
    ind_end = ind_beg+L
    ola[ind_beg:ind_end] += win**2
    
plt.plot(ola)

We indeed observe that except for the edges, relation (5) is verified. In practice, when computing the STFT, we will do some preprocessing (add zeros) to deal with edges. When computing the inverse STFT, we will apply the corresponding postprocessing (remove first and last coefficients) to ensure perfect reconstruction. Note that we would also simply divide by the overlap-add of the sine window after computing the inverse STFT.

<a id='II.2'></a>
### 2. STFT implementation

Let's first load a signal

In [None]:
from preprocessing import preprocessing
import IPython.display as ipd

wavfile = './data/piano_drums_mix.wav'

x, fs = sf.read(wavfile)
if len(x.shape)>1:
    x = x[:,0]
x = x/np.max(np.abs(x))
x = preprocessing(x, L, H) # some preprocessing to deal with edges and ensure perfect reconstruction
T = x.shape[0]

time_axis = np.arange(0, T)/fs
plt.figure(figsize=(10,3))
plt.plot(time_axis, x)
plt.xlabel('time (s)')
plt.ylabel('amplitude')
plt.title('waveform')

ipd.Audio(x, rate=fs) 

### Exercise

Complete the following cell to implement the STFT. 

You have to:

- Loop over the number of frames $n = 0, ..., N-1$.
- At each time frame, select the appropriate portion of the signal, and multiply it with the sine window as in equation (2). You should end-up with an array of dimension $L$.
- Compute the discrete Fourier transform (DFT) (over $L$ points) of the frame (use np.fft.fft) as in equation (1).
- Exploiting the Hermitian symmetry property, discard the redundant part of the spectrum, by only keeping the first part of the spectrum (until Nyquist frequency, included). You should end-up with an array of dimention $F=L/2 + 1$.

In [None]:
N = np.int( (T-L)/H ) + 1 # number of time frames

X = np.zeros( (F, N), dtype=np.cfloat )

for n in np.arange(N):
#### TO COMPLETE ####
    pass

Below, we display the resulting power spectrogram.

In [None]:
X2_dB = 10*np.log10(np.abs(X)**2) # power spectrogram in dB

plt.figure(figsize=(10,7))
plt.imshow(X2_dB, origin='lower',  aspect='auto', cmap='magma', extent=[0, (N-1)*H/fs, 0, fs/2])

plt.clim(vmin=-50, vmax=None)
plt.colorbar()   
plt.xlabel('time (s)')
plt.ylabel('frequency (Hz)')
plt.title('power spectrogram')

### Exercise

1. How would you describe the difference between the spectrogram of the piano and drums signals (```piano_scale.wav``` and ```drums.wav``` in the ```data``` directory).

2. What is the effect of the analysis window length?

3. Would you choose the same window length for computing the spectrograms of the drums and piano signals?

4. Load ```piano_drums_mix.wav``` and look at its spectrogram. We would like to decompose this spectrogram into the sum of two spectrograms, one for the piano, one for the drums. What properties would you like to enforce in each of these two spectrograms?

<a id='II.3'></a>
### 3. Inverse STFT implementation

### Exercise

Complete the following cell to implement the inverse STFT. 

You have to:

- Loop over the number of frames $n = 0, ..., N-1$.
- For each frame, restore the Hermitian symmetry of the spectrum. You should end-up with an array of dimension $L$.
- Compute the inverse DFT (use np.fft.ifft) as in equation (4).
- Compute overlap-add, as in equation (3), similarly as what has been done to compute the `ola` array in Section 1.

In [None]:
x_hat = np.zeros((N-1)*H + L)

for n in np.arange(N):
#### TO COMPLETE ####
    pass

Plot the reconstruction error in the next cell.

<a id='II.4'></a>
### 4. Analysis-transformation-synthesis

Your a DJ looking for new awesome audio effects for the post-COVID party you are preparing. You have this great idea of filtering the following piece of acid techno music with some oscillating Gaussian in the STFT domain (yes your are also quite familiar with signal processing).

In [None]:
x, fs = sf.read('data/acid.wav')
T = x.shape[0]
ipd.Audio(x, rate=fs) 

Now that you are an expert of the STFT, you are allowed to use a Python library to compute it. Look for the corresponding function in the [librosa](https://librosa.org/doc/latest/index.html) library and use it in the next cell to compute the STFT of the above-loaded audio file.

In [None]:
import librosa

#### TO COMPLETE ####
X = None
####################

N = X.shape[1]

X2_dB = 10*np.log10(np.abs(X)**2) # power spectrogram in dB

plt.figure(figsize=(10,7))
plt.imshow(X2_dB, origin='lower',  aspect='auto', cmap='magma', extent=[0, (N-1)*H/fs, 0, fs/2])

plt.clim(vmin=-50, vmax=None)
plt.colorbar()   
plt.xlabel('time (s)')
plt.ylabel('frequency (Hz)')
plt.title('power spectrogram')

In the following cell, we design the oscillating Gaussian filter, with a length of 10 seconds.

In [None]:
T_filt = 10*fs
N_filt = int( (T_filt-L)/H ) + 1 # number of time frames

t = np.linspace(0, T_filt/fs, N_filt) # vector of time indices in seconds
osc_freq = 0.5 # frequency of the oscillation in Hz
center_frequency = 1500 + 500 * np.sin(2*np.pi*osc_freq*t) # center frequency of the band-pass filter in Hz
width = 2000.0 # width of the Gaussian
# generates a Gaussian over the frequency axis with a given center and width
gauss = lambda x, mu: 2.0 * np.pi * width**-2.0 * np.exp(- ((x - mu) / width)**2.0) 

frequencies = np.linspace(0, fs/2, F) # vector of frequency indices in Hz
TF_filter = np.array([gauss(frequencies, cf) for cf in center_frequency]).T # time-frequency magnitude filter
TF_filter /= TF_filter.max(axis=0, keepdims=True) # normalize to be between 0 and 1

plt.figure(figsize=(10,7))
plt.imshow(TF_filter, origin='lower',  aspect='auto', cmap='magma', extent=[0, (N_filt-1)*H/fs, 0, fs/2])

plt.colorbar()   
plt.xlabel('time (s)')
plt.ylabel('frequency (Hz)')
plt.title('time-frequency filter')

### Exercise

Apply the filter to a portion of your choice of the original signal.

In [None]:
#### TO COMPLETE ####
X_filt = None
####################

X2_filt_dB = 10*np.log10(np.abs(X_filt)**2 + 1e-5) # power spectrogram in dB

plt.figure(figsize=(10,7))
plt.imshow(X2_filt_dB, origin='lower',  aspect='auto', cmap='magma', extent=[0, (N-1)*H/fs, 0, fs/2])

plt.clim(vmin=-50, vmax=None)
plt.colorbar()   
plt.xlabel('time (s)')
plt.ylabel('frequency (Hz)')
plt.title('power spectrogram')

Look for the inverse STFT function in librosa, and use it in the next cell to reconstruct a time-domain signal.

In [None]:
####################
x_filt = None
####################

Plot and listen to the resulting audio signal.

In [None]:
time_axis = np.arange(0, T)/fs
plt.figure(figsize=(10,3))
plt.plot(time_axis, x_filt)
plt.xlabel('time (s)')
plt.ylabel('amplitude')
plt.title('waveform')

ipd.Audio(x_filt, rate=fs) 