### 7.1 ambient noise reduction
1. Divide the audio in frames of 25ms with hop length of 10ms.
2. Calculate the spectral centroids for each window. This is achieved by using the method features.spectral_centroid() of LibROSA.
3. Determine the maximum and minimum centroid value from the values calculated in the previous step. Let us call them the upper and lower thresholds respectively.
4. Apply a lowshelf filter for gain=-30 and frequency as the lower threshold. The lower threshold points towards the noisy part of the audio, and setting the gain to -30 significantly reduces those volumes which eventually helps in reducing those signals in the audio. Lesser gain reduces volume, eliminates insignificant noises.
5. Now, apply a highshelf filter for gain=-30 and higher threshold. This step is practically useful in reducing the foreground noises that might exist in the audio. For example, a person clapping close the mic between the speech of a person can be classified as a foreground noise.
6. To compensate for the volume loss in steps 4 and step 5, we apply limiter with a gain of +10 to increase the volume.

In [3]:
import numpy as np
import scipy.io.wavfile as wav
import os
import csv # create the datastet from the audio data
import matplotlib.pyplot as plt
import librosa as l
import IPython.display as ipd

In [4]:
# audio data foldel
check_folder = '../16000_pcm_speeches/Nelson_Mandela/'
print("tot files:", len(os.listdir(check_folder)))
os.listdir(check_folder)

tot files: 1500


['1452.wav',
 '526.wav',
 '880.wav',
 '430.wav',
 '1319.wav',
 '1167.wav',
 '902.wav',
 '167.wav',
 '1031.wav',
 '592.wav',
 '104.wav',
 '1492.wav',
 '1164.wav',
 '941.wav',
 '663.wav',
 '710.wav',
 '875.wav',
 '99.wav',
 '937.wav',
 '1273.wav',
 '525.wav',
 '574.wav',
 '1192.wav',
 '1490.wav',
 '1090.wav',
 '513.wav',
 '365.wav',
 '148.wav',
 '184.wav',
 '900.wav',
 '1487.wav',
 '961.wav',
 '529.wav',
 '11.wav',
 '1415.wav',
 '1431.wav',
 '985.wav',
 '682.wav',
 '121.wav',
 '884.wav',
 '654.wav',
 '982.wav',
 '516.wav',
 '1253.wav',
 '249.wav',
 '1396.wav',
 '1264.wav',
 '1383.wav',
 '1179.wav',
 '832.wav',
 '45.wav',
 '1476.wav',
 '290.wav',
 '1281.wav',
 '878.wav',
 '5.wav',
 '1000.wav',
 '116.wav',
 '1244.wav',
 '1011.wav',
 '822.wav',
 '626.wav',
 '1232.wav',
 '888.wav',
 '787.wav',
 '1108.wav',
 '1317.wav',
 '1338.wav',
 '1204.wav',
 '244.wav',
 '853.wav',
 '1077.wav',
 '958.wav',
 '501.wav',
 '348.wav',
 '346.wav',
 '535.wav',
 '162.wav',
 '398.wav',
 '726.wav',
 '425.wav',
 '13

In [5]:
ipd.Audio(check_folder+'100.wav')

In [9]:
signal, sample_rate = l.load(check_folder+'100.wav', sr=16000) # you are sure of sampling rate info
centroids = l.feature.spectral_centroid(y=signal, sr=sample_rate, n_fft=int(0.025*sample_rate),
                                       hop_length=int(0.010*sample_rate))
np_centroids = np.asarray(centroids)
print(np_centroids.shape)
np_centroids

(1, 101)


array([[3109.18363404, 3189.68301633, 3244.68667144, 3220.97288085,
        2883.12447593, 3185.49910855, 3426.42474383, 3329.62831332,
        3044.33695671, 3011.06633899, 3116.29153304, 2992.03922153,
        3168.05045582, 2980.90646335, 3279.81962601, 3648.69779067,
        3183.55472994, 3219.6102842 , 3064.8098304 , 2931.34704378,
        2889.25926418, 2650.56001271, 2813.38485301, 3128.82727447,
        2927.21903632, 3243.2022362 , 2945.41040867, 3283.34237844,
        3380.17751119, 2894.92271967, 2135.48333826, 2331.7601793 ,
        1482.44002686, 2740.14172514, 1516.32262188, 1516.66087416,
        1270.92192316, 1260.59479278, 1261.93356681, 1067.39678439,
         904.08673391,  910.85405123, 1028.25624718, 1143.91927496,
         904.06378312, 1028.56502367,  918.34864778,  844.36614839,
        1070.62060222, 1033.01620033, 1127.94440592, 1498.87823837,
        1820.65220441, 2102.36190901, 2032.52731941, 2045.39086577,
        2092.19343524, 2290.87022895, 2476.66183

In [11]:
lower_th = np.min(centroids)
upper_th = np.max(centroids)
print("centroid min:", lower_th, "\tmax:", upper_th)

centroid min: 844.3661483864162 	max: 4597.460470244332


### 7.2. Vocal Enhancement
Unlike ambient noise, vocal and voice signals are the features that we want our feature
extraction phase to focus more on. These are the significant parts of the audio that we wish to use
to extract features from and build machine learning classifiers. Hence, enhancing these parts of the
audio can greatly increase our accuracies. We can do this by using the concepts of MFCCs again.
Following are the steps we took to enhance the vocal enhancements.
1. Divide the audio in frames of 25ms with a hop length of 10ms.
2. Calculate the MFCC coefficients for each frame.
3. *Take the sum of squares of MFCC.* Squaring the coefficients greatly helps in summarizing the spread of data. Additionally, MFCCs might not always be positive, squaring makes sure that all comparisons are made in the positive space of numbers. Although this can be achieved by absolute values too, squaring helps in summarizing the spread in a better way.
4. Find the strongest frame. The strongest frame is the frame which has the maximum sum of squares for its MFCCs. The reason for taking the strongest frame is that would indicate a certain vocal utterance. This is because we do this step after ambient noise removal, which reduces the foreground noise too, and hence the strongest frame now will most likely be the vocal part of the audio.
5. Find the minimum hertz value from the strongest frame and apply a lowshelf filter of a positive to gain to enhance those vocals.

In [None]:
basta fare la somma dei quadrati...

### 7.3. Audio Trimming
Audio trimming is another important pre-processing step to ensure that the audio is clean,
improved, and more suitable for research. We have discussed in this report earlier that why audio
trimming is needed and how it affects the quality. For the purpose of this research, we set the
threshold value of 20 decibels. If a sound is less than this threshold, we consider it as silence and
trim if it is a part of the edge. Using LibROSA’s lib.effects.trim() method, we achieve this goal of
trimming the audio edges. We kept the frame length as 2048 and the hop length as 500 which
signifies the number of samples between neighboring frames.

In [None]:
trim_signal = l.effects.trim(y=signal, top_db=20)

### 7.4. Audio Splitting
Similar to audio trimming, audio splitting is a technique used to reduce the unwanted parts
in an audio. Trimming and splitting follow very similar fundamentals. Audio trimming is the
process of reducing unwanted or silent parts from the start and end of an audio, whereas audio
splitting takes this a step ahead and removes silences even from the middle portions of the audio.