# My notes for reference . check out the original work by https://github.com/musikalkemist/AudioSignalProcessingForML

# Table of contents
    1. The Basics of Sound 
        - what is sound and it's wave property
        - freq,amplitude,phase of a sine wave
        - pitch vs freq relation
    2. Intensity, Loudness, Timbre
        - power and intensity measurement
        - loudness measurement
        - what is timbre and it's properties
    3. Understanding audio signals
        - sampling,nyquist rate,aliasing
        - quantization,bit depth
    4. Audio Feature
        - brief introduction to time domain audio feature
        - brief introduction to freq domain audio feature
        - spectral leakage, windowing and overlapping frames

## 1.The Basics of Sound :

Sound is produced by the vibration of an object which in turns vibrates the air molecules to oscillate. The change of pressure in the air molecule creates an wave that is interpreted by our ear.

The basic waveform of sound looks like this where we have amplitude in the y axis and time in the x axis.

![sound_waveform_l1](./images/sound_waveform_l1.png)


### Sine wave different terminology:

**amplitude** : The Amplitude is the height from the centre line to the peak (or to the trough). Or we can measure the height from highest to lowest points and divide that by 2.

**phase** : The Phase Shift is how far the function is shifted horizontally from the usual position.

**frequency/period** : when we say a sine wave has a freq of 22khz that means it has 22k samples(peak to peak is one sample) in one second. 
period = 1/freq which means how long does one sample takes in second 

good source for sine wave terminology: https://www.mathsisfun.com/algebra/amplitude-period-frequency-phase-shift.html

![sine_wave](./images/sine_wave.gif)

![hearing_range](./images/hearing_range.png)


**generally higher the amplitude results in louder sound and higher frequency results in higher sound(higher pitch)**.

good source for pitch/amplitude:  https://www.phys.uconn.edu/~gibson/Notes/Section2_1/Sec2_1.htm

**Pitch vs Freq** : The perception of pitch is logarithmic in nature. For example let's say we are playing C1 note of piano which has a pitch A0, now if we double the freq we will land in C2 and the resultant pitch will be A1, thus as we go one octave above the pitch doubles. So it follows the following log scale 

![pitch_freq](./images/pitch_freq.png)


**TODO**
provide explanation for the below parts

each octave again could be divided into Cents.
http://hyperphysics.phy-astr.gsu.edu/hbase/Music/cents.html#:~:text=Musical%20intervals%20are%20often%20expressed,then%20the%20interval%20is%20cents.

Why sound could be represented by sin wave, why not any other wave ??

# 2. Intensity, Loudness, Timbre

**Sound power and intensity**  
sound power is basically the rate at which energy is transferred, Energy per unit of time emitted by a sound source in all directions. It is measured in **watt(W)**

sound power per unit area becomes **sound intensity** and measured in W/m2

Threshold of hearing or TOH for human is 10*-2 W/m2
Threshold of pain or TOP for human is 10 W/m2

intensity level of a sound is basically measured with reference to TOH and it is represented in db(decibels) in log scale

for example if we have a sound with intensity I we can measure the db of the sound using the formula below

![intensity_formula](./images/intensity_formula.png)

**Every 3 dBs, intensity doubles**

**Loudness of sound**:<br>
how loud is some sound, basically a perception of the pitch that we are hearing. Depending on age this could change. It is dependent on the duration of the sound and the freq of the sound. 

For example 2 sound may have same intensity but depending on how long they sustain, the loudness will vary.

**Loudness is measured in phons**

below is graph of freq vs sound intensity(db). We can see from the graph that different lines have different phons. If we consider the 80phon line we can see that for very low freq we need high intensity to generate 80 phons

![loudness_conture](./images/loudness_conture.png)

**Timbre of sound**:<br>
Timbre doesn't have a solid definition. We could say that timbre is color of sound. Let's say we are playing same note in 2 different instrument with same intensity,freq and duration we could still hear them differently.sometime it is described as bright, dark, dull, harsh, warm

Timbre is multidimensional. To understand timbre we need to understand the following things

**Sound envelope**:<br>
In sound and music, an envelope describes how a sound changes over time. we have the Attack-Decay-Sustain-Release Model. Let's say we hit the C4, which will have a initial strike/attack phase then decay, then sustain and then release. 
![envelop_model](./images/envelop_model.png)

If you compare 2 different instrument like piano and violin, we could see that violin has longer attack and smooth attack time/not as sharp as piano.

![envelop_model_example](./images/envelop_model_example.png)


**Complex sound**:<br>
we could think of complex sound as superpositions of different sinusoids. each unique sinusoid is a **partial**. the minimum/lowest partial is **fundamental freq** and a **harmonic partial** is a frequency that is a multiple of the fundamental
frequency
for example if we have fundamental freq as 440hz the harmonic partials are 2* 440, 3* 440, 3* 440

**Frequency modulation/amplitude modulation**:<br>
modulating the freq is vibrato. amplitude modulation is tremolo. Both are used for expressive purposes.

![envelop_model_example](https://upload.wikimedia.org/wikipedia/commons/a/a4/Amfm3-en-de.gif)

All these things above Sound envelope/Complex sound/Frequency modulation/amplitude modulation describes the **timbre** of the music

# 3. Understanding Audio signals

Audio signals are continuous **analog** waves, so the question is how do we store it in a digital form? 
We sample the signal at equi-distance interval and store the values.

**Pulse-code modulation (PCM)** is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the amplitude of the analog signal is sampled regularly at uniform intervals, and each sample is quantized to the nearest value within a range of digital steps.

![sampling_audio](https://www.technologyuk.net/telecommunications/telecom-principles/images/pcm01.gif)

to locate the sample(basically to find out when the sample n occurs) we can use the formula <br>
$$tn = n.T$$

**sampling rate**:<br>
rate at which we sample the analog audio. obviously the more sample we take the better we can represent the original signal. 
let's say we have one signal with freq **F** abd we want to sample it. question is how do we choose the sampling rate so that the digital signal will be a good representation of the original sample. we can decide the sampling freq using *Nyquist theorem*

**Nyquist theorem**:<br>
it says that if we are sampling a audio signal of **f** freq we need to at least sample at a rate of **2f**. which means we need to sample atleast 2 points for each sample. if we don't maintain this the newly form wave will not be able to represent the original wave, and we will get a **aliased** waveform.

For exam let's say we have one waveform of 44khz and we start sampling it at 1000 hz. what this means is that all the freq above 500 hz will not get captured, and we will lose information beyond 500 hz of the original freq

good resource for sampling/nyquist rate/aliasing https://www.tutorialspoint.com/digital_communication/digital_communication_sampling.htm

**Quantization**:<br>
similar to sampling rate we apply quantization in the y axis which is amplitude. this tells to store amplitude how many bits are needed. so the resolution of quantization is measured in bit.
![quantization](./images/quantization.png)
when we see bit depth is written in the CD rom, that means it can represent amplitude of value 2^16=65536

**how much memory is needed to store 1 min worth of audio signal in a CDROM which has sampling rate 44100hz and Bit depth = 16 bits.**

==> so there are 44100 samples and each can be represented by 16 bits. so total number of bits in 1 sec(since 44100 samples are present in 1 sec) is 16 * 44100 bits . for 60 sec it becomes 16 * 44100 * 60 bits or 5.49 mega byte

**when we record sound to digitize we basically use a ADC or analog to digital convertor which applies sampling/quantization/and some low pass filter(to remove freq above nyquist freq) to convert the signal**

# 4. Audio Features

we extract relevant features form audio signal, these features later help us to solve the problem in hand using machine learning or deep learning. Even though there are different types of feature but 2 most important types are **time domain and freq domain features**<br>

**time domain** is basically the original waveform where we represent the wave in sine wave with amplitude vales in the y axis. some important features of time domain are **Amplitude envelope,Root-mean square energy,Zero crossing rate** <br>

sometimes it is useful to look at waveform in **freq domain**. we take the original time domain waveform and apply fft(more on it later) on this to obtain freq domain representation.some important freq domain features are **Band energy ratio, Spectral centroid ,Spectral flux**.<br>

There are features which represents both **time and freq domain** like Spectrogram,Mel-spectrogram,Constant-Q transform.<br>

let's talk about the typical pipeline for time/freq domain feature extraction.

**Time-domain feature pipeline**:<br>
let's say we have a signal which was sampled at a freq of 44.1khz.now  we have a base signal of 44.1hz which means that we have 44.1k samples each sec. now in order to hear the 2 sample distinctly(human ear) we need a gap of **10ms**. if we take each sample for processing, each chunk(sample here) duration os 1/44100=0.0227 ms which is below the rate of **Perceivable audio chunk**. That is why we divide the signal into frames.

**Frames**:<br>
generally we use frames from 256-8192. let's say we want to find out duration of each frame. we need to know each frame size **K**. each sample duration is 1/sr. The formula is given below <br>

$$df = (1/sr) * K $$

One thing about frame is that we use frames of the power of 2(this helps in fft calculation)

![time_domain_pipeline](./images/time_domain_pipeline.png)

**Freq-domain feature pipeline**:<br>
when we convert signal from time domain to freq domain we have a problem called **Spectral leakage.**
Spectral leakage occurs when a non-integer number of periods of a signal is sent to the DFT. that's why the freq domain representation gets some high freq peaks.
![spectral_leakage_fft](./images/spectral_leakage_fft.png)

Inorder to solve the Spectral leakage problem we use something called **windowing function** for each frame.
![windowing](./images/windowing.png)

but windowing gives rise to one more problem. when we join multiple frames we lose information around the sides. 
![windowing_problem](./images/windowing_problem.png)

so to sove this finally we use something called **overlapping frames**
![overlapping](./images/overlapping.png)

so the final Frequency-domain feature pipeline looks like below
![freq_domain_pipeline](./images/freq_domain_pipeline.png)

**NOTE**: To understand the spectral leakage and windowing please refer to this video as i feel the above explanation is not sufficient https://www.youtube.com/watch?v=tCWU9C-LdJQ