### Seminar: Spectrogram Madness

![img](https://github.com/yandexdataschool/speech_course/raw/main/week_02/stft-scheme.jpg)

#### Today you're finally gonna deal with speech! We'll walk you through all the main steps of speech processing pipeline and you'll get to do voice-warping. It's gonna be fun! ....and creepy. Very creepy.

In [None]:
from IPython.display import display, Audio
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import librosa

display(Audio("sample1.wav"))
display(Audio("sample2.wav"))
display(Audio("welcome.wav"))

In [None]:
amplitudes, sample_rate = librosa.core.load("sample1.wav")

display(Audio(amplitudes, rate=sample_rate))
print(sample_rate)

print("Length: {} seconds at sample rate {}".format(amplitudes.shape[0] / sample_rate, sample_rate))
plt.figure(figsize=[16, 4])
plt.title("First 10^4 out of {} amplitudes".format(len(amplitudes)))
plt.plot(amplitudes[:10000]);

### Task 1: Mel-Spectrogram (5 points)

As you can see, amplitudes follow a periodic patterns with different frequencies. However, it is very difficult to process these amplitudes directly because there's so many of them! A typical WAV file contains 22050 amplitudes per second, which is already way above a typical sequence length for other NLP applications. Hence, we need to compress this information to something manageable. 

A typical solution is to use __spectrogram:__ instead of saving thousands of amplitudes, we can perform Fourier transformation to find which periodics are prevalent at each point in time. More formally, a spectrogram applies [Short-Time Fourier Transform (STFT)](https://en.wikipedia.org/wiki/Short-time_Fourier_transform) to small overlapping windows of the amplitude time-series:


<img src="https://www.researchgate.net/profile/Phillip_Lobel/publication/267827408/figure/fig2/AS:295457826852866@1447454043380/Spectrograms-and-Oscillograms-This-is-an-oscillogram-and-spectrogram-of-the-boatwhistle.png" width="480px">

However, this spectrogram may have extraordinarily large numbers that can break down neural networks. Therefore the standard approach is to convert spectrogram into a __mel-spectrogram__ by changing frequencies to [Mel-frequency spectrum](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum).

Hence, the algorithm to compute spectrogram of amplitudes $y$ becomes:
1. Compute Short-Time Fourier Transform (STFT): apply fourier transform to overlapping windows
2. Build a spectrogram: $S_{ij} = abs(STFT(y)_{ij}^2)$
3. Convert spectrogram to a Mel basis

In [None]:
# Some helpers:
# 1. slice time-series into overlapping windows
def slice_into_frames(amplitudes, window_length, hop_length):
    return librosa.core.spectrum.util.frame(
        np.pad(amplitudes, int(window_length // 2), mode='reflect'),
        frame_length=window_length, hop_length=hop_length)
    # output shape: [window_length, num_windows]

dummy_amps = amplitudes[2048: 6144]
dummy_frames = slice_into_frames(dummy_amps, 2048, 512)
print(amplitudes.shape)

plt.figure(figsize=[16, 4])
plt.subplot(121, title='Whole audio sequence', ylim=[-3, 3])
plt.plot(dummy_amps)

plt.subplot(122, title='Overlapping frames', yticks=[])
for i, frame in enumerate(dummy_frames.T):
    plt.plot(frame + 10 - i);

In [None]:
# 2. Weights for window transform. Before performing FFT you can scale amplitudes by a set of weights
# The weights we're gonna use are large in the middle of the window and small on the sides
dummy_window_length = 3000
dummy_weights_window = librosa.core.spectrum.get_window('hann', dummy_window_length, fftbins=True)
plt.plot(dummy_weights_window); plt.plot([1500, 1500], [0, 1.1], label='window center'); plt.legend()

In [None]:
# 3. Fast Fourier Transform in Numpy. Note: this function can process several inputs at once (mind the axis!)
dummy_fft = np.fft.rfft(dummy_amps[:3000, None] * dummy_weights_window[:, None], axis=0)  # complex[sequence_length, num_sequences]
plt.plot(np.real(dummy_fft)[:, 0])
print(dummy_fft.shape)

Okay, now it's time to combine everything into a __S__hort-__T__ime __F__ourier __T__ransform

In [None]:
def get_STFT(amplitudes, window_length, hop_length):
    """ Compute short-time Fourier Transform """
    # slice amplitudes into overlapping frames [window_length, num_frames]
    frames = slice_into_frames(amplitudes, window_length, hop_length)
    
    # get weights for fourier transform, float[window_length]
    weights_window = <YOUR CODE>
   
    
    # apply fourier transfrorm to frames scaled by weights
    stft = <YOUR CODE>
    return stft

In [None]:
stft = get_STFT(amplitudes, window_length=2048, hop_length=512)
plt.plot(abs(stft)[0])

In [None]:
def get_spectrogram(amplitudes, sample_rate=22050, n_mels=128,
                       window_length=2048, hop_length=512, fmin=1, fmax=8192):
    """
    Implement mel-spectrogram as described above.
    :param amplitudes: float [num_amplitudes], time-series of sound amplitude, same as above
    :param sample rate: num amplitudes per second
    :param n_mels: spectrogram channels
    :param window_length: length of a patch to which you apply FFT
    :param hop_length: interval between consecutive windows
    :param f_min: minimal frequency
    :param f_max: maximal frequency
    :returns: mel-spetrogram [n_mels, duration]
    """
    # Step I: compute Short-Time Fourier Transform
    stft = <YOUR CODE>
    assert stft.shape == (window_length // 2 + 1, len(amplitudes) // hop_length + 1)
    
    # Step II: convert stft to a spectrogram
    spectrogram = <YOUR CODE>
    
    return spectrogram

#### The Mel Basis

The Mel-scale is a perceptual scale which represents how sensitive humans are to various sounds. We will use it to compress and transform our spectrograms.

In [None]:
mel_basis = librosa.filters.mel(22050, n_fft=2048,
                                    n_mels=128, fmin=1, fmax=8192)
plt.figure(figsize=[16, 10])
plt.title("Mel Basis"); plt.xlabel("Frequence"); plt.ylabel("Mel-Basis")
plt.imshow(np.log(mel_basis),origin='lower', cmap=plt.cm.hot,interpolation='nearest', aspect='auto')
plt.colorbar(use_gridspec=True)

# Can 
mat= np.matmul(mel_basis.T, mel_basis)

plt.figure(figsize=[16, 10])
plt.title("recovered frequence Basis"); plt.xlabel("Frequence"); plt.ylabel("Frequency")
plt.imshow(np.log(mat),origin='lower', cmap=plt.cm.hot,interpolation='nearest', aspect='auto')
plt.colorbar(use_gridspec=True)


In [None]:
def get_melspectrogram(amplitudes, sample_rate=22050, n_mels=128,
                       window_length=2048, hop_length=512, fmin=1, fmax=8192):
    spectrogram = get_spectrogram(amplitudes, sample_rate=sample_rate, n_mels=n_mels,
                       window_length=window_length, hop_length=hop_length, fmin=fmin, fmax=fmax)
    
    # Step III: convert spectrogram into Mel basis (multiplying by transformation matrix)
    mel_basis = librosa.filters.mel(sample_rate, n_fft=window_length,
                                    n_mels=n_mels, fmin=fmin, fmax=fmax)
    # -- matrix [n_mels, window_length / 2 + 1]
    
    mel_spectrogram = <YOUR_CODE>
    assert mel_spectrogram.shape == (n_mels, len(amplitudes) // hop_length + 1)
    
    return mel_spectrogram

In [None]:
amplitudes1, s1 = librosa.core.load("./sample1.wav")
amplitudes2, s2 = librosa.core.load("./sample2.wav")
print(s1)
ref1 = librosa.feature.melspectrogram(amplitudes1, sr=sample_rate, n_mels=128, fmin=1, fmax=8192)
ref2 = librosa.feature.melspectrogram(amplitudes2, sr=sample_rate, n_mels=128, fmin=1, fmax=8192)
assert np.allclose(get_melspectrogram(amplitudes1), ref1, rtol=1e-4, atol=1e-4)
assert np.allclose(get_melspectrogram(amplitudes2), ref2, rtol=1e-4, atol=1e-4)

In [None]:
plt.figure(figsize=[16, 4])
plt.subplot(1, 2, 1)
plt.title("That's no moon - it's a space station!"); plt.xlabel("Time"); plt.ylabel("Frequency")
plt.imshow(np.log10(get_melspectrogram(amplitudes1)),origin='lower', vmin=-10, vmax=5, cmap=plt.cm.hot)
plt.colorbar(use_gridspec=True)

plt.subplot(1, 2, 2)
plt.title("Help me, Obi Wan Kenobi. You're my only hope."); plt.xlabel("Time"); plt.ylabel("Frequency")
plt.imshow(np.log10(get_melspectrogram(amplitudes2)),origin='lower', vmin=-10, vmax=5, cmap=plt.cm.hot);
plt.colorbar(use_gridspec=True)

# note that the second spectrogram has higher mean frequency corresponding to the difference in gender

### Task 2 - Griffin-Lim Algorithm - 5 Points


In this task you are going to reconstruct the original audio signal from a spectrogram using the __Griffin-Lim Algorithm (GLA)__ . The Griffin-Lim Algorithm is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained. GLA is based only on the consistency and does not take any prior knowledge about the target signal into account.


This algorithm expects to recover a __complex-valued spectrogram__, which is consistent and maintains the given amplitude $\mathbf{A}$, by the following alternative projection procedure. Initialize a random "reconstruced" signal $\mathbf{x}$, and obtain it's STFT
$$\mathbf{X} = \text{STFT}(\mathbf{x})$$

Then we __discard__ the magnitude of $\mathbf{X}$ and keep only a random phase $\mathbf{\phi}$. Using the phase and the given magnitude $\mathbf{A}$ we construct a new complex value spectrogram $ \mathbf{\tilde X}$ using the euler equation

$$\mathbf{\tilde X} = \mathbf{A}\cdot e^{j\mathbf{\phi}}$$

Then we reconstruct the signal $\mathbf{\tilde x}$ using an __inverse STFT__:

$$\mathbf{\tilde x} = \text{iSTFT}(\mathbf{\tilde X})$$

We update our value of the signal reconstruction:

$$ \mathbf{x} = \mathbf{\tilde x} $$

Finally, we interate this procedure multiple times and return the final $$\mathbf{x}$$.

In [None]:
# STEP 1: Reconstruct your Spectrogram from the Mel-Spectrogram
def inv_mel_spectrogram(mel_spectrogram, sample_rate=22050, n_mels=128,
                       window_length=2048, hop_length=512, fmin=1, fmax=8192):
    
    mel_basis = librosa.filters.mel(sample_rate, n_fft=window_length,
                                    n_mels=n_mels, fmin=fmin, fmax=fmax)
    
    inv_mel_basis = <INSERT YOUR CODE>
    spectrogram = <INSERT YOUT CODE>
    
    
    return spectrogram

In [None]:
amplitudes, sample_rate = librosa.core.load("welcome.wav")
display(Audio(amplitudes, rate=sample_rate))


true_spec = get_spectrogram(amplitudes)
mel_spec = get_melspectrogram(amplitudes, window_length=2048, hop_length=512)

#!!! Here you can modify your Mel-Spectrogram. Let your twisted imagination fly wild here !!!

#mel_spec[40:50,:]=0 # Zero out some freqs

# mel_spec[10:124,:] = mel_spec[0:114,:] # #Pitch-up 
# mel_spec[0:10,:]=0 

# mel_spec[0:114,:] = mel_spec[10:124,:] # #Pitch-down 
# mel_spec[114:124,:]=0

#mel_spec[:,:] = mel_spec[:,::-1] #Time reverse

#mel_spec[64:,:] = mel_spec[:64,:] # Trippy Shit

#mel_spec[:,:] = mel_spec[::-1,:] # Aliens are here

#mel_spec[64:,:] = mel_spec[:64,:] # Trippy Shit

#mel_spec[:,:] = mel_spec[::-1,::-1] # Say hello to your friendly neighborhood Chaos God

#!!! END MADNESS !!!


#Convert Back to Spec
spec = inv_mel_spectrogram(mel_spec, window_length=2048, hop_length=512)

scale_1 = 1.0 / np.amax(mel_spec)

scale_1 = 1.0 / np.amax(true_spec)
scale_2 = 1.0 / np.amax(spec)

plt.figure(figsize=[16, 4])
plt.subplot(1, 2, 1)
plt.title("Welcome...!"); plt.xlabel("Time"); plt.ylabel("Frequency")
plt.imshow((true_spec*scale_1)**0.125,origin='lower',interpolation='nearest', cmap=plt.cm.hot, aspect='auto')
plt.colorbar(use_gridspec=True)

plt.subplot(1, 2, 2)
plt.title("Xkdfsas...!"); plt.xlabel("Time"); plt.ylabel("Frequency")
plt.imshow((spec*scale_2)**0.125,origin='lower',interpolation='nearest', cmap=plt.cm.hot, aspect='auto')
plt.colorbar(use_gridspec=True)


plt.figure(figsize=[16, 10])
plt.title("Xkdfsas...!"); plt.xlabel("Time"); plt.ylabel("Frequency")
plt.imshow((mel_spec**0.125),origin='lower',interpolation='nearest', cmap=plt.cm.hot, aspect='auto')
plt.colorbar(use_gridspec=True)

In [None]:
# Lets examine how to take an inverse FFT
dummy_window_length = 3000
dummy_weights_window = librosa.core.spectrum.get_window('hann', dummy_window_length, fftbins=True)

dummy_fft = np.fft.rfft(dummy_amps[:3000, None] * dummy_weights_window[:, None], axis=0)  # complex[sequence_length, num_sequences]
print(dummy_fft.shape)
rec_dummy_amps = dummy_weights_window*np.real(np.fft.irfft(dummy_fft[:,0]))
plt.plot(dummy_amps[:3000])
plt.plot(rec_dummy_amps[:3000])
plt.legend(['Original', 'Reconstructed'])

In [None]:
# Step II: Reconstruct amplitude samples from STFT
def get_iSTFT(spectrogram, window_length, hop_length):
    """ Compute inverse short-time Fourier Transform """
    
    # get weights for fourier transform, float[window_length]
    window = librosa.core.spectrum.get_window('hann', window_length, fftbins=True)
    
    time_slices = spectrogram.shape[1]
    len_samples = int(time_slices*hop_length+window_length)
    
    x = np.zeros(len_samples)
    # apply inverse fourier transfrorm to frames scaled by weights and save into x
    amplitudes = <YOUR CODE>
        
    # Trim the array to correct length from both sides
    x = <YOUR_CODE>
    return x

In [None]:
# Step III: Implement the Griffin-Lim Algorithm
def griffin_lim(power_spectrogram, window_size, hop_length, iterations, seed=1, verbose=True):
    """Reconstruct an audio signal from a magnitude spectrogram.
    Given a power spectrogram as input, reconstruct
    the audio signal and return it using the Griffin-Lim algorithm from the paper:
    "Signal estimation from modified short-time fourier transform" by Griffin and Lim,
    in IEEE transactions on Acoustics, Speech, and Signal Processing. Vol ASSP-32, No. 2, April 1984.
    Args:
        power_spectrogram (2-dim Numpy array): The power spectrogram. The rows correspond to the time slices
            and the columns correspond to frequency bins.
        window_size (int): The FFT size, which should be a power of 2.
        hop_length (int): The hope size in samples.
        iterations (int): Number of iterations for the Griffin-Lim algorithm. Typically a few hundred
            is sufficient.
    Returns:
        The reconstructed time domain signal as a 1-dim Numpy array.
    """
    
    time_slices = power_spectrogram.shape[1]
    len_samples = int(time_slices*hop_length-hop_length)
    
    # Obtain STFT magnitude from Spectrogram

    magnitude_spectrogram = <YOUR CODE>
    
    # Initialize the reconstructed signal to noise.
    np.random.seed(seed)
    x_reconstruct = np.random.randn(len_samples)
    
    for n in range(iterations):
        # Get the SFTF of a random signal
        reconstruction_spectrogram = <YOUR_CODE>
        
        # Obtain the angle part of random STFT. Hint: unit np.angle
        reconstruction_angle = <YOUR_CODE>
        
        # Discard magnitude part of the reconstruction and use the supplied magnitude spectrogram instead.
        proposal_spectrogram = <YOUR_CODE>
        assert proposal_spectrogram.dtype == np.complex
        
        
        # Save previous construction
        prev_x = x_reconstruct
        
        # Reconstruct signal
        x_reconstruct = <YOUR CODE>
        
        # Measure RMSE
        diff = np.sqrt(sum((x_reconstruct - prev_x)**2)/x_reconstruct.size)
        if verbose:
            # HINT: This should decrease over multiple iterations. If its not, your code doesn't work right!
            # Use this to debug your code!
            print('Reconstruction iteration: {}/{} RMSE: {} '.format(n, iterations, diff))
    return x_reconstruct

In [None]:
rec_amplitudes1 = griffin_lim(true_spec, 2048, 512, 1, verbose=False)
display(Audio(rec_amplitudes1, rate=sample_rate))
rec_amplitudes2 = griffin_lim(true_spec, 2048, 512, 50, verbose=False)
display(Audio(rec_amplitudes2, rate=sample_rate))

rec_amplitudes3 = griffin_lim(spec, 2048, 512, 1, verbose=False)
display(Audio(rec_amplitudes3, rate=sample_rate))
rec_amplitudes4 = griffin_lim(spec, 2048, 512, 50, verbose=False)
display(Audio(rec_amplitudes4, rate=sample_rate))

In [None]:
# THIS IS AN EXAMPLE OF WHAT YOU ARE SUPPORT TO GET
# Remember to apply sqrt to spectrogram to get magnitude, note power here.

# Let's try this on a real spectrogram
ref_amplitudes1 = librosa.griffinlim(np.sqrt(true_spec), n_iter=1, hop_length=512, win_length=2048)
display(Audio(ref_amplitudes1, rate=sample_rate))
ref_amplitudes2 = librosa.griffinlim(np.sqrt(true_spec), n_iter=50, hop_length=512, win_length=2048)
display(Audio(ref_amplitudes2, rate=sample_rate))

# Not let's try this on a reconstructed spectrogram
ref_amplitudes3 = librosa.griffinlim(np.sqrt(spec), n_iter=1, hop_length=512, win_length=2048)
display(Audio(ref_amplitudes3, rate=sample_rate))
ref_amplitudes4 = librosa.griffinlim(np.sqrt(spec), n_iter=50, hop_length=512, win_length=2048)
display(Audio(ref_amplitudes4, rate=sample_rate))