# DSP - Digital Signal Processing

## Signal
- Signal is anything that serves to indicate, warn, direct, command, or the like, as a light, a gesture, an act, etc.
- Signal is anything agreed upon or understood as the occasion for concerted action.

## Types of electronic signal 

### Analog signal 
An analog signal is one type of <strong>continuous time-varying</strong> signals, and these are classified into composite and simple signals. A simple type of analog signal is nothing but a sine wave, and that can’t be decomposed, whereas a composite type analog signal can be decomposed into numerous sine waves. 
- An analog signal can be defined by using amplitude, time period otherwise frequency, & phase. 
    - Amplitude streaks the highest height of the signal, 
    - Frequency streaks the rate at which an analog signal is varying
    - Phase streaks the signal position with respect to time nothing.
    \begin{equation}
    \mathbf{a = A.cos(\omega t+\phi)}
    \end{equation}
    ![analog_example](images/4.1-analog-signal.jpg)
- An analog signal is not resistant toward the noise, therefore; it faces distortion as well as reduces the transmission quality. The analog signal value range cannot be fixed.
![analog](images/analogSignal.png)

### Digital signal
- A digital signal carries the data in the form of **binary** because it signifies in the bits. 
- These signals can be decomposed into sine waves which are termed as harmonics. Every digital signal has amplitude, frequency, & phase like the analog signal. This signal can be defined by bit interval as well as bit rate. Here, bit interval in nothing but the required time for transmitting an only bit, whereas the bit rate is bit interval frequency.
![digital](images/digitalSiagnal.png)

## What is dsp?

- Digital Signal Processors (DSP) take real-world signals like voice, audio, video, temperature, pressure, or position that have been digitized and then mathematically manipulate them. 
- A DSP is designed for performing mathematical functions like "add", "subtract", "multiply" and "divide" very quickly.

# Fundamentals of Audio Signals and Systems

## 1. Sound
- Sound is a vibration that propagates as an __acoustic wave__, through a transmission medium such as a gas, liquid or solid.

### Perform a sound
**Time domain** consists of amplitude that varies with time. This is commonly referred to as filter-out overall reading <br>
 **Frequency domain** is the domain where amplitudes are shown as series of sine and cosin waves. These waves have a magnitude and a phase, which vary with frequency.<br>

![Fourier_transform](images/Fourier-transform.gif)

### Measuring the sound
The sound pressure level (SPL) expressed in decibels (dB) uses the definition that pressure 0 is 20 μPa (20 μPa = 0.00002 Pa), and pressure 1 is the effective pressure (root-mean-square or RMS) measured with a microphone

The hearing threshold is the sound level below which a person’s ear is unable to detect any sound. For adults, 0 dB is the reference level.
- A threshold shift is an increase in the hearing threshold for a particular sound frequency. It means that the hearing sensitivity decreases and that it becomes harder for the listener to detect soft sounds. Threshold shifts can be temporary or permanent.

The sound wave—the pressure disturbance of alternating high and low pressure—progresses through the air at a rate referred to as the speed of sound.

### Wavelength, Frequency, and Spectrum

![sound_wave](images/cycles.jpg)

- The amount of time required for one oscillation is known as the period of the vibration.

#### Wavelength
Wavelength is the distance traveled by the sound wave in the time of one period of the oscillation.

![weigth_length](images/weightlength.gif)

#### Frequency
Sound oscillations are commonly expressed as an oscillation rate: how many cycles of the oscillation occur in 1 s [cycles/second]. The oscillation rate is the frequency of the oscillation. It is customary to use the unit hertz (abbreviated Hz) for a cycles/second frequency measurement.

#### Spectrum

- A graph of the spectrum of a sinusoid(sine wave) looks like a single “frequency line". Because a pure tone (sinusoidal waveform) has energy only at the frequency of its repetition rate.

![spectrum](images/spectrum.png)

- A sound spectrum is a representation of a sound – usually a short sample of a sound – in terms of the amount of vibration at each individual frequency.
- We can think of the sound spectrum as a sound recipe: take this amount of that frequency, add this amount of that frequency etc until you have put together the whole, complicated sound.
![spectrum](images/crspec.gif)

Spectrum is a common transform used to gain information from a person's speech signal. It can be used to separate the **"excitation signal"** (which contains the words(sound) and the pitch) and the **transfer function** (which contains the voice quality)

## 2. Digital audio

Digitization refers to two processes performed by a circuit known as an analog-to-digital converter (ADC).
- The first process in the ADC is time sampling, which means a rapid and repeated measurement of the instantaneous value of the analog audio signal many times per second. Each individual measurement is a time sample. The rate at which the time sampling occurs is called the sampling rate, expressed in samples per second [Hz].
![timesampling](images/timesampling.gif)
- The second process in the ADC is quantization, which means representing each waveform sample with an integer value. The precision of the measurement is typically expressed by the number of digital bits used for each sample.

### Bit rate - audio quality 
- Bit rate (bitrate or as a variable R) is the number of bits that are conveyed or processed per unit of time.
- The bit rate is quantified using the bits per second unit (symbol: "bit/s"), often in conjunction with an SI prefix such as "kilo" (1 kbit/s = 1,000 bit/s), "mega" (1 Mbit/s = 1,000 kbit/s), "giga" (1 Gbit/s = 1,000 Mbit/s) or "tera" (1 Tbit/s = 1000 Gbit/s).[2] The non-standard abbreviation "bps" is often used to replace the standard symbol "bit/s", so that, for example, "1 Mbps" is used to mean one million bits per second.

# Human voice processing

## 1. The Source Filter Model of Speech

### 1.1 Speech

The components of speech are the **words** and the **voice** .

To speak, air is first released over the vocal cords, which expand and contract to give the air column structure. This is the biological concept of words. The words are then passed through the vocal tract where they are shaped, giving them intonation. This shaping of the words is the biological concept of voice.
=> So, words is the information known as source, we need a filter to transform to the voice and speech out

The source filter model is a model of speech where the spoken word is comprised of a source component originating from the vocal cords which is then shaped by a filter immitating the effect of the vocal tract.
![source_filter](images/source_filter.png)

### 1.2 Signal Processing Considerations

The source filter model can easily be extended to signal processing. The source is simply a signal __x(t)__. This signal is the input to the filter and is called the __excitation signal__ since it excites the vocal tract.

The vocal tract is a filter, it is a linear time-invariant system with **impulse response h(t)**. This is sometimes called the **transfer function** of speech since it is what transfers the excitation signal to speech - it adds voice to words.

Speech is the output y(t) of the source signal x(t) passed through the filter with impulse response h(t). Thus, the output is given by y(t) = x(t) ∗ h(t).
![speech_model](images/speech_modeling.png)

### 1.3 Convolution
- Convolution is a formal mathematical operation, just as multiplication, addition, and integration. Addition takes two numbers and produces a third number, while **convolution takes two signals and produces a third signal**. 
- Convolution is used in the mathematics of many fields, such as probability and statistics. In linear systems, convolution is used to **describe the relationship between three signals of interest: the input signal (excitation signal x(t)), the impulse response (transfer function h(t)), and the output signal**.

Linear system : http://www.dspguide.com/ch5/7.htm

### 1.4 Deconvolution
- Deconvolution is exactly what it sounds like: the undoing of convolution. This means that instead of mixing two signals like in convolution, we are isolating them.
- This is useful for analyzing the characteristics of the input signal and the impulse response when only given the output of the system.
![deconvolutions](images/deconvolutions.png)

## 2. Linear Predictive Coding (LPC) in Voice Conversion

- Linear Predictive Coding (or "LPC") is a method of predicting a sample of a speech signal **based on several previous samples.**
- We can use the LPC coefficients to separate a speech signal into two parts: the transfer function h(t) (which contains the vocal quality) and the excitation x(t) (which contains the pitch and the sound).

# \begin{equation} \hat{s} = \sum_{k=1}^{p} {a_k}{s[n-k]} \end{equation}

- The number of samples (p) is referred to as the "order" of the LPC. <br>
    As p approaches infinity, we should be able to predict the nth sample exactly. However, p is usually on the order of ten to twenty, where it can provide an accurate enough representation with a limited cost of computation.<br> 
- The weights on the previous samples (ak) are chosen in order to minimize the squared error between the real sample and its predictedvalue.<br>
    Thus, we want the error signal e(n), which is sometimes referred to as the LPC residual, to be as small as possible:<br>
    # \begin{equation} e[n] = s[n]- \hat{s}[n]= s[n]- \sum_{k=1}^{p} {a_k}{s[n-k]} \end{equation}

We can take the z-transform of the above equation:
# \begin{equation} E(z) = S(z) - \sum_{k=1}^{p} {a_k}{S(z)z^{-k}} = S(z)[1 - \sum_{k=1}^{p}{a_k}z^{-k}]  =S(z)A(z) \end{equation}

We can represent our original speech signal S(z) as the product of the error signal E(z) and the transfer function 1 / A(z):
# \begin{equation}  S(z) = \frac{E(z)}{A(z)} \end{equation}

In speech processing, computing the LPC coefficients of a signal gives us its ak values. <br>

- We can get the filter A(z) as described above. A(z) is the transfer function between the original signal s[n] and the excitation component e[n]. <br>
    - The transfer function of a speech signal is the part dealing with the voice quality:What distinguishes one person's voice from another. The excitation component of a speech signal is the part dealing with the particular sounds and words that are produced. <br>
- In the time domain, the excitation and transfer function are convolved to create the output voice signal. As shown in the figure below, we can put the original signal through the filter to get the excitation component.<br>
    - Putting the excitation component through the inverse filter (1 / A(z)) gives us the original signal back.<br>
    ![LPC](images/LPC_al.png)

We can perform voice conversion by **replacing the excitation component** from the given speaker with a new one. Since we are still using the same transfer function A(z), the resulting speech sample will have the same voice quality as the original.

## 3. Changing Pitch with PSOLA(Pitch-Synchronous Overlap and Add) for Voice Conversion

### 3.1 Pitch period

- Voiced sounds as vowels have a periodic structure, ie , their signal form repeats itself after time, and this is called the pitch period TP . Its reciprocal value fP = 1/TP is called the pitch frequency

- There are a number of algorithms for pitch period estimation. The two broad categories of pitch-estimation algorithms are the time-domain and frequency-domain algorithms.
- Time-domain algorithms attempt to determine the **pitch directly** from the speech waveform 
- Frequency domain algorithms use some forms of spectral analysis to determine the **pitch period**.
- Pitch changes, pitch scaling, or pitch modification means transposing the pitch without changing the characteristics of the sound. In addition, it is defined as the process of changing the pitch without affecting the speech.

### 3.2 Pitch shifting

- The pitch period is responsible for making some sounds to be sharper than others. The number of vibrations produced during a given period determines the pitch period. This vibration rate of a sound is called its frequency, the higher the frequency the higher the pitch. 
- The aim of pitch shifting algorithms is to create a change in pitch without creating a change in the replay rate. Pitch shifting can be done by performing a time stretch using PSOLA and resampling.

### 3.3 PSOLA - Pitch Synchronous Over Lap-Add

- PSOLA is a method used to manipulate the pitch of a speech signal to match it to that of the target speaker.

- PSOLA is a method based on decomposition of a signal into a series of elementary waveforms in such a way that each waveform represents one of the successive pitch periods of the signal and the sum (overlap-add) of them reconstitutes the signal

There are several types of PSOLA such as Time Domain TD-PSOLA, Frequency Domain PSOLA (FD-PSOLA) and the Linear-Predictive PSOLA (LP-PSOLA).

### 3.4 TD-PSOLA algorithm

The TD-PSOLA algorithm was proposed allowing **pitch modification** of a given speech signal **without changing the time duration and visa versa**

1. Analysis 
    - The original speech signal is first divided into separate but often overlapping shortterm analysis signals (ST). Short term signals $x_m(n)$ are obtained from the digital speech waveform x(n) by multiplying the signal by a sequence of the pitch-synchronous analysis window $h_m(n)$
    ### \begin{equation} X_m(n) = h_m(t_m-n)x(n) \end{equation}
    where m is an index for the short-time signal

2. The windows,
    - Windowing is the process of taking a small subset of a larger dataset, for processing and analysis. A naive approach, the rectangular window, involves simply truncating the dataset before and after the window, while not modifying the contents of the window at all.
    - Windowing is the equivalent of multiplying the signal sample by a windown function of the same length. A window must be applied to the data to minimize signal 'leakage' effects
    - There are centered on the successive instants $t_m$ , called pitchmarks. These marks are set at a pitch-synchronous rate on the voiced parts of the signal and at a constant rate on the unvoiced parts. <br>
    ref: https://think-engineer.com/blog/dsp/window-functions-and-how-we-use-them-in-dsp , 
    https://www.slideshare.net/ramagianhendraloka/windowing-signal-processing?from_action=save

3. The modification
    - Each frame is modified according to the target. The synthesis steps are performed such that these segments are recombined by means of overlap adding.

![pitch_change](images/pitchchange.png)

### 3.5 Pitch Shifting by Time Stretching using PSOLA and Resampling

Pitch shifting by time stretching and resampling involves simply performing a time stretch as described earlier with the PSOLA and then resampling in order to return sound length to its original value. 
- Expanding the sound by time stretch then resampling creates a higher pitch, while compressing and resampling creates a deeper pitch.
    ![pitchshift](images/PitchShift.png)

### 3.6 Pitch detection
- Pitch determination is essential for many speech processing tasks and applications, this includes the  classification of speech signal into voiced or unvoiced speech regions

There are several types of pitch detection algorithms such as:
- Time-domain analysis like Autocorrelation method, AMDF (Average Magnitude Difference Function)
- Frequency-domain analysis like Cepstrum, Harmonic product spectrum, Chens heuristic method
- Others like Maximum likelihood, Simple inverse filter tracking (SIFT), Neural network approaches