# Exercise 8: Sound transformations

In this exercise you will use the HPS model to creatively transform sounds. There are two parts in this exercise. In the first one you should perform a natural sounding transformation on the speech sound that you used in the previous exercise (E7). In the second part you should select a sound of your choice and do a "creative" transformation. You will have to write a short description of the sound and of the transformation you did, giving the link to the original sound and uploading several transformed sounds.

For this exercise, you can use the `transformations_GUI.py` (in `software/transformations_interface/`) to try things, once decided you can fill up the code in this file. You can also do everything from here and add any new code you wish.

In order to perform a good/interesting transformation you should make sure that you have performed an analysis that is adequate for the type of transformation you want to do. Not every HPS analysis representation will work for every type of sound transformation. There will be things in the analysis that when modified will result in undesired artifacts. In general, for any transformation, it is best to have the harmonic values as smooth and continuous as possible and an stochastic representation as smooth and with as few values as possible. It might be much better to start with an analysis representation that does not result in the best reconstruction in exchange of having smoother and more compact data.

To help you with the exercise, we give a brief description of the transformation parameters used by the HPS transformation function:

1. `freqScaling`: frequency scaling factors to be applied to the harmonics of the sound, in time-value pairs (where value of 1 is no scaling). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The scaling factor is a multiplicative factor, thus a value of 1 is no change. Example: to transpose an octave the sound you can specify `[0, 2, 1, 2]`.
2. `freqStretching`: frequency stretching factors to be applied to the harmonics of the sound, in time-value pairs (value of 1 is no stretching). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The stretching factor is a multiplicative factor whose effect depend on the harmonic number, higher harmonics being more affected that lower ones, thus resulting in an inharmonic effect. A value of 1 results in no transformation. Example: an array like `[0, 1.2, 1, 1.2]` will result in a perceptually large inharmonic effect.
3. `timbrePreservation`: 1 preserves the original timbre, 0 does not. It can only have a value of 0 or of 1. By setting the value to 1 the spectral shape of the original sound is preserved even when the frequencies of the sound are modified. In the case of speech it would correspond to the idea of preserving the identity of the speaker after the transformation.
4. `timeScaling`: time scaling factors to be applied to the whole sound, in time-value pairs (value of 1 is no scaling). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The time scaling factor is a multiplicative factor, thus 1 is no change. Example: to stretch the original sound to twice the original duration, we can specify `[0, 0, 1, 2]`.

All the transformation values can have as many points as desired, but they have to be in the form of an array with time-value pairs, so of even size. For example a good array for a frequency stretching of a sound that has a duration of 3.146 seconds could be: `[0, 1.2, 2.01, 1.2, 2.679, 0.7, 3.146, 0.7]`.

## Part 1. Perform natural sounding transformations of a speech sound

Use the HPS model with the sound `speech-female.wav`, available in the sounds directory, to first analyze and then obtain a natural sounding transformation of the sound. The synthesized sound should sound as different as possible to the original sound while sounding natural. By natural we mean that it should sound like speech, that it could have been possible to be produced by a human, and by listening we should consider it as a speech sound, even though we might not be able to understand it. You should first make sure that you start from a good analysis, then you can do time and/or frequency scaling transformations. The transformation should be done with a single pass, no mixing of sounds coming from different transformations. Since you used the same sound in A7, use that experience to get a good analysis, but consider that the analysis, given that we now want to use it for applying a very strong transformation, might be done differently than what you did in A7.

Write a short paragraph for every transformation, explaining what you wanted to obtain and explaining the transformations you did, giving both the analysis and transformation parameter values (sufficiently detailed for the evaluator to be able to reproduce the analysis and transformation).

In [1]:
#if want to run this notebook in google colab you should uncomment the following commands
!pip install sms-tools
!git clone https://github.com/MTG/sms-tools-materials.git
!pip install numpy==1.23.5

Collecting sms-tools
  Downloading sms_tools-1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting numpy<2.0.0 (from sms-tools)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m581.2 kB/s[0m eta [36m0:00:00[0m
Downloading sms_tools-1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (331 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m331.2/331.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy, sms-tools
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully

In [1]:
import numpy as np
from scipy.signal import get_window
import matplotlib.pyplot as plt
import IPython.display as ipd

from smstools.models import utilFunctions as UF
from smstools.models import stft as STFT
from smstools.models import hpsModel as HPS
from smstools.transformations import hpsTransformations as HPST
from smstools.transformations import harmonicTransformations as HT
import IPython.display as ipd

In [2]:
from google.colab import files
uploaded = files.upload()

Saving speech-female.wav to speech-female.wav


In [3]:
# 1.1 perform an analysis/synthesis using the HPS model

input_file = 'speech-female.wav'

### set the parameters
window = 'blackman'
M = 1024             # Window size: good balance between time & frequency resolution
N = 2048             # FFT size: power of 2, ≥ M
t = -80              # Threshold in dB for peak detection
minSineDur = 0.1     # Minimum sine duration in seconds
nH = 50              # Max number of harmonics to track
minf0 = 80           # Minimum expected pitch frequency (Hz) for female voice
maxf0 = 400          # Maximum expected pitch frequency (Hz)
f0et = 5             # Error tolerance in Hz
harmDevSlope = 0.01  # Allowed deviation of harmonic frequency slope
stocf = 0.2          # Fraction of Nyquist for stochastic component separation

# no need to modify anything after this
Ns = 512
H = 128

(fs, x) = UF.wavread(input_file)
w = get_window(window, M, fftbins=True)
hfreq, hmag, hphase, stocEnv = HPS.hpsModelAnal(x, fs, w, N, H, t, nH, minf0, maxf0, f0et, harmDevSlope, minSineDur, Ns, stocf)
y, yh, yst = HPS.hpsModelSynth(hfreq, hmag, hphase, stocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=x, rate=fs))
ipd.display(ipd.Audio(data=y, rate=fs))

**Question E8 - 1.2:**

explain your parameter choices in 1.2

________

In [11]:
# 1.3 Perform a transformation from the previous analysis

freqScaling = np.array([0.0, 1.2, 1.0, 1.2])      # Increase pitch 20%
freqStretching = np.array([0.0, 1.0, 1.0, 1.0])   # No spectral stretching
timbrePreservation = 1                           # Preserve spectral envelope
timeScaling = np.array([0.0, 0.0, 1.0, 1.2])      # Stretch entire sound by 1.2×

# no need to modify the following code
Ns = 512
H = 128

# frequency scaling of the harmonics
hfreqt, hmagt = HT.harmonicFreqScaling(hfreq, hmag, freqScaling, freqStretching, timbrePreservation, fs)

# time scaling the sound
yhfreq, yhmag, ystocEnv = HPST.hpsTimeScale(hfreqt, hmagt, stocEnv, timeScaling)

# synthesis from the trasformed hps representation
y, yh, yst = HPS.hpsModelSynth(yhfreq, yhmag, np.array([]), ystocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=y, rate=fs))

**Question E8 1-4:**

explain your transformations

___

## Part 2. Perform creative transformations with a sound of your choice

Pick any natural and harmonic sound from Freesound and use the HPS model to do the most creative and interesting transformation you can come up with. Sounding as different as possible from the original sound.

It is essential that you start with a natural harmonic sound. Examples include (but not limited to) any acoustic harmonic instrument, speech, harmonic sound from nature, etc. As long as they have a harmonic structure, you can use it. You can even reuse the sound you used in A7-Part2 or upload your own sound to freesound and then use it.

The sound from Freesound to use could be in any format, but to use the sms-tools software you will have to first convert it to be a monophonic file (one channel), sampling rate of 44100, and 16bits samples.

You can do any interesting transformation with a single pass. It is not allowed to mix sounds obtained from different transformations. The transformed sound need not sound natural. So, time to show some creativity!

Write a short paragraph for every transformation, explaining what you wanted to obtain and explaining the transformations you did, giving both the analysis and transformation parameter values (sufficiently detailed for the evaluator to be able to reproduce the analysis and transformation).

In [12]:
# 2.1 perform an analysis/synthesis using the HPS model

### set the parameters
input_file = 'speech-female.wav'
window = 'blackman'     # Good for speech analysis due to low spectral leakage
M = 1024                # Window size (samples), must be power of 2
N = 1024                # FFT size (same as M for no zero-padding)
t = -80                 # Threshold in dB (harmonic detection sensitivity)
minSineDur = 0.1        # Minimum duration (sec) for a peak to be considered harmonic
nH = 60                 # Max number of harmonics (usually 40–80 for speech)
minf0 = 70              # Minimum f0 (Hz) – suitable for low female voice
maxf0 = 500             # Maximum f0 (Hz) – suitable for high female voice
f0et = 5                # f0 error threshold (Hz)
harmDevSlope = 0.01     # Allowed harmonic deviation slope
stocf = 0.2             # Stochastic analysis cutoff frequency as fraction of Nyquist

# no need to modify anything after this
Ns = 512
H = 128

(fs, x) = UF.wavread(input_file)
w = get_window(window, M, fftbins=True)
hfreq, hmag, hphase, stocEnv = HPS.hpsModelAnal(x, fs, w, N, H, t, nH, minf0, maxf0, f0et, harmDevSlope, minSineDur, Ns, stocf)
y, yh, yst = HPS.hpsModelSynth(hfreq, hmag, hphase, stocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=x, rate=fs))
ipd.display(ipd.Audio(data=y, rate=fs))

**Question E8 2.2:**

explain your parameter choices

___

In [23]:
# 2.3 Perform a transformation from the previous analysis

### define the transformations
# freqScaling = np.array([XX,  ])
# freqStretching = np.array([XX, ])
# timbrePreservation = X
# timeScaling = np.array([XX, ])

freqScaling = np.array([0, 1.2, 1651, 1.2])      # Raise pitch 20% across entire clip
freqStretching = np.array([0, 1.0, 1651, 1.0])   # No harmonic stretch
timbrePreservation = 1
timeScaling = np.array([0, 0, 1, 2])


# no need to modify anything after this
Ns = 512
H = 128

# frequency scaling of the harmonics
hfreqt, hmagt = HT.harmonicFreqScaling(hfreq, hmag, freqScaling, freqStretching, timbrePreservation, fs)

# time scaling the sound
yhfreq, yhmag, ystocEnv = HPST.hpsTimeScale(hfreqt, hmagt, stocEnv, timeScaling)

# synthesis from the trasformed hps representation
y, yh, yst = HPS.hpsModelSynth(yhfreq, yhmag, np.array([]), ystocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=y, rate=fs))

**Question E8 - 2.4:**

explain your transformations


___


1. Frequency Scaling (freqScaling)

freqScaling = np.array([0, 1.2, 1651, 1.2])
This line controls how the frequencies of the harmonics in the sound are scaled over time.
It's represented as a time-value pair array: [time1, value1, time2, value2, ...]
In this case:
At time 0 (start of the sound), the scaling factor is 1.2, meaning all harmonics are raised by 20% (multiplied by 1.2).
At time 1651 (likely a specific time in the audio file), the scaling factor remains 1.2, which means the pitch increase of 20% is sustained until that time.
This results in a consistent upward pitch shift throughout the analyzed section of the sound.

2. Frequency Stretching (freqStretching)

freqStretching = np.array([0, 1.0, 1651, 1.0])
This parameter controls the stretching of harmonics, which can create inharmonic effects.
It's also a time-value pair array.
In this case, the stretching factor is kept at 1.0 throughout the sound. This means there is no harmonic stretching applied, preserving the original harmonic relationships.

3. Timbre Preservation (timbrePreservation)

timbrePreservation = 1
This is a binary setting (0 or 1) that determines whether the original timbre (spectral envelope) of the sound is preserved during transformations.
Setting it to 1 means the timbre is preserved.
In this scenario, even though the frequency of the harmonics is altered, the overall spectral shape and character of the sound are maintained, preventing it from sounding too artificial.

4. Time Scaling (timeScaling)

timeScaling = np.array([0, 0, 1, 2])
This parameter controls the overall time scaling of the sound.
In our fix we made it to be a normalized time scale factor.
Here's how it breaks down:
At time 0 (beginning of the sound), the scaling factor is 0.
This gradually increases until time point 1, where the sound is played at twice its original speed (scaling factor of 2)
In Summary

The transformations applied to the sound involve:

Pitch Shift: Raising the pitch of the sound by 20% across the entire clip using freqScaling.
Timbre Preservation: Maintaining the original timbre using timbrePreservation to make the pitch shift sound more natural.
Time Stretch: Increasing the speed of the sound to double using timeScaling, this value can be changed as desired.
No frequency stretching is applied (freqStretching), preserving the natural harmonic relationships in the sound.
The goal of these transformations is likely to make the sound higher-pitched and play at double speed while still sounding relatively natural due to timbre preservation. This can lead to interesting creative effects without making the sound completely unrecognizable.