Based on [first part of Speech processing tutorial series](https://towardsdatascience.com/audio-deep-learning-made-simple-part-1-state-of-the-art-techniques-da1d3dff2504/)

- A sound signal is produced by variations in air pressure. Sound signals often repeat at regular intervals so that each wave has the same shape. The height shows the intensity of the sound and is known as the amplitude. The time taken for the signal to complete one full wave is the period. The number of waves made by the signal in one second is called the frequency. The frequency is the reciprocal of the period. <br>
<img src="./images/1-sound_signal.png">
- The majority of sounds may not follow such simple and regular periodic patterns. But signals of different frequencies can be added together to create composite signals with more complex repeating patterns. All sounds that we hear, including our own human voice, consist of waveforms like these. <br>
<img src="./images/2-sound_components.png">
- To **digitize** a sound wave, we must turn the signal into a series of numbers so that we can input it into out models. This is done by measuring the amplitude of the sound at fixed intervals of time. *Each such measurement is called a **sample**, and the sample rate is the number of samples per second. For instance, a common sampling rate is about 44,100 samples per second. That means a 10-second music clip would have 441,000 samples.* <br>
<img src="./images/3-sound_sampling.png">
- Input for deep learning methods is the "image" of audio file. This is done by generating Spectrograms from the audio. 

### Spectrogram
#### Spectrum
- Signals of different frequencies can be added together to create composite signals, representing any sound that occurs in the real-world. This means that any signal consists of many distinct frequencies and can be expressed as the sum of those frequencies. <br>
- **The Spectrum** is the set of frequencies that are combined together to produce a signal. **The Spectrum** plots all of the frequencies that are present in the signal along with the strength or amplitude of each frequency.
<img src="./images/4-spectrum.png">
- The lowest frequency in a signal is called **the fundamental frequency**. Frequencies that are whole number multiples of the fundamental frequency are known as harmonics. <br>
    * For instance, if the fundamental frequency is 200 Hz, then its harmonic frequencies are 400 Hz, 600 Hz, and so on. <br>
    
#### Time Domain and Frequency Domain
- The waveforms that we saw earlier showing Amplitude against Time are one way to represent a sound signal. Since the x-axis shows the range of time values of the signal, we are viewing the signal in the Time domain.
- Spectrum is an alternate way to represent the same signal. It shows Amplitude against Frequency, and since the x-axis shows the range of frequency values of the signal, at a moment in time, we are viewing the signal in the Frequency Domain.
<img src="./images/5-time_and_frequency_domains.png">

#### Spectrograms
- Since a signal produces different sounds as it varies over time, its constituent frequencies also vary with time. In other words, its **Spectrum** varies with time.
- **A Spectrogram** of a signal plots its **Spectrum** over time and is like a "photograph" of the signal. It plots Time on the x-axis and Frequency on the y-axis. It is as though we though the **Spectrum** again ans again at different instances in time, and then joined them all together into a single plot.
- It uses different colors to indicate the Amplitude or strength of each frequency. The brighter the color, the higher the energy of the signal. Each vertical "slice" of the Spectrogram is essentially the Spectrum of the signal at that instant in time and shows how the signal strength is distributed in every frequency found in the signal at that instant.
- In the following figure, the first picture displays the signal in the Time domain ie. Amplitude vs Time. It gives us a sense of how loud or quiet a clip  is at any point in time, but it gives us a very little information about which frequencies are present. The second picture is the **Spectrogram** and displays the signal in the Frequency domain.
<img src="./images/6-spectrogram.png">

#### Generating Spectrograms
- Spectrograms are produced using **Fourier Transforms** to decompose any signal into its constituent frequencies.

### Audio Deep Learning Models
<img src="./images/7-audio_DL.png">
- Most deep learning audio applications use **Spectrograms** to represent audio. They follow a procedure like this:

    1. Start with raw audio data in the form of a wave file
    2. Convert the audio data into its corresponding spectrogram
    3. (Optional) Use simple audio processing technique to augment the spectrogram data. (Some augmentation or cleaning can also be done on the raw audio before the spectrogram conversion)
    4. Apply standard CNN architectures to process and extract feature maps that are an encoded representation of the spectrogram image
    5. The next step is to generate output predictions from this encoded representation:
        1. Classification
        2. Text-to-speech: pass  through some RNN layers

### What problems does audio deep learning solve?
#### Audio classification
<img src="./images/8-audio_classification.png">
- This could be applied to detect the failure of machinery or equipment based on the sound that it produces, or in a surveillance system, to detect security break-ins.

#### Audio Separation and Segmentation
- Audio Separation involves isolating a signal of interest from a mixture of signals so that it can then be used for further processing. For instance, it is to seperate out individual people's voices from a lot of background noise, or the sound of the violin from the rest of the musical performance.
<img src="./images/9-audio_separation.png">

#### Music Genre Classification and Tagging
- Identify and categorize music based on the audio. The content of the music is analyzed to figure out the genre to which it belongs. This is a multi-label classification problem because a given piece of music might fall under more than one genre (rock, pop, jazz, ...)
<img src="./images/10-audio_genre.png">
- Speaker information can be tags (ages, genre, ...)

#### Music Generation and Music Transcription
<img src="./images/11-music_generation.png">

#### Voice Recognition
<img src="./images/12-voice_identification.png">

#### Speech to Text and Text to Speech
- This is one of the most challenging applications because it deals not just with analyzing audio, but also with NLP and requires developing some basic language capability to decipher distinct words from the uttered sounds.
<img src="./images/13-speech_to_text.png">