[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rbg-research/AI-Training/blob/main/voice-analytics/speech-analytics-deep-learning/session-1/Tutorial-1.ipynb)

# Data Preparation

* The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems.
* TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.
* The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance.

In [1]:
!pip3 install datasets

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.8/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [2]:
from datasets import load_dataset

* Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.
* For more details refer these [link1](https://huggingface.co/docs/datasets/index.html) and [link2](https://huggingface.co/datasets/viewer/)

In [3]:
dataset = load_dataset(
   'timit_asr')

Downloading and preparing dataset timit_asr/clean (download: 828.75 MiB, generated: 7.90 MiB, post-processed: Unknown size, total: 836.65 MiB) to /Users/hbbg/.cache/huggingface/datasets/timit_asr/clean/2.0.1/5bebea6cd9df0fc2c8c871250de23293a94c1dc49324182b330b6759ae6718f8...


Downloading:   0%|          | 0.00/869M [00:00<?, ?B/s]

ConnectionError: HTTPSConnectionPool(host='data.deepai.org', port=443): Read timed out.

* "load_dataset" will download the corpus or will load the corpus if it is already downloaded

In [None]:
dataset

In [None]:
train_files, train_labels = dataset["train"]["file"], dataset["train"]["speaker_id"]

In [None]:
set(train_labels)

In [None]:
test_files, test_labels = dataset["test"]["file"], dataset["test"]["speaker_id"]

# Common Features, Visualiztion & Information

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import IPython.display as ipd
import librosa
import librosa.display
import sklearn

* Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
* For more details refer this [link](https://matplotlib.org/stable/tutorials/index.html)
* librosa is a python package for music and audio analysis.
* For more details refer this [link](https://librosa.org/doc/latest/tutorial.html)
* Sklearn mostly used for performing feature preprocessing and conventional machine learning
* For more details refer this [link](https://scikit-learn.org/stable/)

In [None]:
demo_file = train_files[100]

In [None]:
demo_file

### load an audio file

In [None]:
x , sr = librosa.load(demo_file)
print(type(x), type(sr))

In [None]:
x

In [None]:
sr

### load an audio file with specific sampling rate

In [None]:
x , sr = librosa.load(demo_file, sr=16000)
print(type(x), type(sr))

In [None]:
x

In [None]:
sr

### getting the audio duration

In [None]:
librosa.get_duration(y=x, sr=sr)

### playing audio

In [None]:
ipd.Audio(x, rate=sr)

### [Waveform](https://en.wikipedia.org/wiki/Waveform) - loudness of the audio at a given time (time domain)

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)

### [Spectogram](https://en.wikipedia.org/wiki/Spectrogram) - frequencies playing at a particular time along with it’s amplitude

In [None]:
X = librosa.stft(x) # converts data into short term Fourier transform
# (amplitude of various frequencies playing at a given time of an audio signal)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz') 
plt.colorbar()

In [None]:
X = librosa.stft(x) # converts data into short term Fourier transform
# (amplitude of various frequencies playing at a given time of an audio signal)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='log') #log of frequencies  
plt.colorbar()

### [Zero Crossing Rate](https://en.wikipedia.org/wiki/Zero-crossing_rate) - rate of sign-changes along a signal

In [None]:
n0 = 0
n1 = 10
plt.figure(figsize=(14, 5))
plt.plot(x[n0:n1])
plt.grid()

In [None]:
zero_crossings = librosa.zero_crossings(x[n0:n1], pad=False)
print(sum(zero_crossings))

### [Spectral Centroid](https://en.wikipedia.org/wiki/Spectral_centroid) - weighted mean of the frequencies present in the sound

In [None]:
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_centroids.shape

frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames) # computing the time variable for visualization

def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis) # normalising the spectral centroid for visualisation

librosa.display.waveplot(x, sr=sr, alpha=0.4) #plotting the Spectral Centroid along the waveform
plt.plot(t, normalize(spectral_centroids), color='r')

### [Spectral Rolloff](https://github.com/erwanrh/ML_Python-Music_Classification/wiki/Spectral-roll-off) - frequency below which a specified percentage of the total spectral energy, e.g. 85%, lies

In [None]:
spectral_rolloff = librosa.feature.spectral_rolloff(x, sr=sr)[0]
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_rolloff), color='r')

### [MFCC](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) — Mel-Frequency Cepstral Coefficients

MFCCs are commonly derived as follows:

* Take the Fourier transform of (a windowed excerpt of) a signal.
* Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows or alternatively, cosine overlapping windows.
* Take the logs of the powers at each of the mel frequencies.
* Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
* The MFCCs are the amplitudes of the resulting spectrum.

In [None]:
mfccs = librosa.feature.mfcc(x, sr=sr)
print(mfccs.shape)

In [None]:
mfccs = librosa.feature.mfcc(x, sr=sr, n_mfcc=13)
print(mfccs.shape)

In [None]:
librosa.display.specshow(mfccs, sr=sr, x_axis='time')

### [Chromagram](https://en.wikipedia.org/wiki/Chroma_feature) - relates to the twelve different pitch classes

In [None]:
chromagram = librosa.feature.chroma_stft(x, sr=sr)
plt.figure(figsize=(10, 4))
librosa.display.specshow(chromagram, y_axis='chroma', x_axis='time')
plt.tight_layout()