Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stereo to Mono Convertions #877

Open
RemonComputer opened this issue Aug 12, 2020 · 18 comments
Open

Add Stereo to Mono Convertions #877

RemonComputer opened this issue Aug 12, 2020 · 18 comments

Comments

@RemonComputer
Copy link

🚀 Feature

Make mono to stereo or stereo to mono conversion

Motivation

You Guys have made an amazing job, but stereo to mono and vice versa is simple, it seems that you missed it,
I think it might be done with simple mean in the channel dim

Pitch

This should be a Simple Transform like ToMono(channel_first=True) which will be in Torchaudio.Transforms

@ezyang ezyang transferred this issue from pytorch/pytorch Aug 12, 2020
@RemonComputer
Copy link
Author

Attached a small Implementation of Mine
transformations.py.txt

@mthrok
Copy link
Collaborator

mthrok commented Aug 14, 2020

Hi @RemonComputer

Thanks for the suggestion.

I think this problem is underspecified. I am not sure if there is one and only one way to convert stereo to monaural. Taking average is the simplest way, but adjusting the power level of each signal before mixing them is also popular approach. So I imagine there will be a wide variety of opinion what "stereo to monaural conversion" should do.

Adding this to the library will increase the maintenance cost, but I do not see much value there especially when it is so easy for users to write one. Users can just pick their favorite conversion algorithm write them with only a few lines of code.

@RemonComputer
Copy link
Author

@mthrok Ok, If I can help in anything please let me know

@bagustris
Copy link

I am supporting this feature request.

Although the conversion from stereo to mono can be done in just a few lines of code and or using other libraries, having the same pattern would make the overall audio processing workflow easier and more consistent (e.g., the type and format of output [tensor vs. array, int64 vs float32], methods of converting [mean or first channel or second channel], etc).

@mthrok
Copy link
Collaborator

mthrok commented Feb 4, 2023

By the popular demand, we are reconsidering this.
There are couple of improvements we can make

  1. In documentation, explain how monaural conversion can be achieved.
  2. Add transforms.

I am not sure what kind of mixing methods exist, but I guess we can simply start from taking average waveform across channel dimension.

CLI tools like sox and ffmpeg allows to pick one channel. Perhaps that can be an extension to this.

https://www.nesono.com/node/275
https://trac.ffmpeg.org/wiki/AudioChannelManipulation
https://en.wikipedia.org/wiki/Out_of_Phase_Stereo

@mthrok
Copy link
Collaborator

mthrok commented Feb 4, 2023

cc @rbracco @hwangjeff @xiaohui-zhang

@rbracco
Copy link

rbracco commented Feb 4, 2023

Thank you Moto, I support this feature request as I think that, even though it can be done in a single line of pytorch code, it is a common operation that is A. unintuitive for beginners and B. often necessary to begin training a model.

I would also support documentation clearly explaining that conversion to mono can be achieved by taking the average over the channel dim, and giving a clear example in code. If you go the documentation only route I would suggest making sure it is easy to find via searching google and/or the documentation. Thank for reraising this.

@jjmmchema
Copy link

jjmmchema commented Apr 4, 2023

Hey there, it looks like this issue is still open. Can I work on this?
I was manipulating some audio files and saw that torchaudio doesn't have a built-in transformation for converting from stereo to mono.

@mthrok
Copy link
Collaborator

mthrok commented Apr 4, 2023

Hey there, it looks like this issue is still open. Can I work on this?
I was manipulating some audio files and saw that torchaudio doesn't have a built-in transformation for converting from stereo to mono.

@jjmmchema Thanks. Go ahead. torchaudio transforms take waveform in channel-first format, so make sure to support the shape of <...., channel, time>

@jjmmchema
Copy link

Hey there, it looks like this issue is still open. Can I work on this?
I was manipulating some audio files and saw that torchaudio doesn't have a built-in transformation for converting from stereo to mono.

@jjmmchema Thanks. Go ahead. torchaudio transforms take waveform in channel-first format, so make sure to support the shape of <...., channel, time>

@mthrok just a quick question, why is it that the supported shape should be <...., channel, time>? I thought it should be <...., time> since the dots, from what I understand, refer to the number of channels of the input tensor.

@mthrok
Copy link
Collaborator

mthrok commented Apr 5, 2023

@jjmmchema My intent was to emphasize that channel should not be the last, however, come to think of it, since this transform is about channel, it might be better to be explicit about where the channel dimension should be, and also flexible.

The most typical and common shape is <batch, channel, time>, which is what other transforms (except some) handles.
Another typical one is <channel, time> (like Tensors returned by torchaudio.load(..., channels_first=True)).
Both cases can be handled by accepting <..., time> and specifying channel dimension as -2, like torch.mean(waveform, dim=-2.

There are cases where we use <time, channel> format. For example, I/O features in torchaudio.io uses this format because they are more for streaming I/O. in this case, dim=-1 would be better.

So I think simply making the channel dimension configurable and expecting the shape to be <..., channel, ....>, and defaulting dim=-2 can be a way. What do you think?

@jjmmchema
Copy link

@mthrok Yeah, it makes sense that the channel dimension should be defaulted to dim=-2 while being flexible. To be honest I completely forgot about the batch dimension so thanks for bringing it up.

Expecting that the input has the shape of <...., channel, ....> seems to be the most appropriate thing to do.

@faroit
Copy link
Contributor

faroit commented Apr 7, 2023

@mthrok @jjmmchema Just a general remark for such a feature:

  • using mean to downmix stereo to mono is not a good idea for music signals since it cancels out any stereo phase. For spectrogram models it is much preferred to downmix stereo in the magnitude stft or Mel domain as this doesn't have that issue.
  • ffmpeg suggests to measure and shift the phase prior to downmixing https://superuser.com/questions/1667532/ffmpeg-aphasemeter-meaning-of-measurement
  • a general downmix to mono for more than 2 channels (e.g. 5.1) is not a good idea since there are specific downmix schemes to be respected when using Dolby surround content or spatial audio in general.

So as it's proposed in this issue and implemented in #3242 I strongly propose not to add such a feature to torchaudio and instead let the user decide how to downmix on their own to make the aware that this isn't a trivial operation.

@jjmmchema
Copy link

jjmmchema commented Apr 7, 2023

@faroit Thanks for your response.

I see that this isn't as simple as it seemed at first.

Maybe instead of implementing the downmixing with just a simple mean transformation a new issue can be opened to look for and discuss about different downmixing techniques?

Or maybe just create a documentation section that talks about the possibility of downmixing using mean but warns about the things that you mentioned?

Let me know about the decision so that I don't continue with the mean implementation shown in #3242 or I finish writing the tests for the PR.

@mthrok
Copy link
Collaborator

mthrok commented Apr 7, 2023

@faroit Thanks for the feedback. I do agree with your points, and I am grad that they are laid out nicely.
What do you think of

  • limiting the scope of the feature to stereo input (reject if #channels !=2),
  • implement phase correction,
  • and emphasize in doc that this is just AN implementation of such transform, and not universal.

As of a way to implement phase correction, we could implement it on PyTorch, or we can delegate it to FFmpeg.
I recently landed a feature to run ffmpeg filter on Tensor (PR, doc, tutorial (wip)), although this one is a bit overkill for just a channel manipulation, so more efficient implementation would be ideal.

(Also in the nightly build I added a feature to delegate channel manipulation to FFmpeg in StreamReader, so this is somewhat doable loading audio from file.)

@mthrok
Copy link
Collaborator

mthrok commented Apr 12, 2023

@faroit @jjmmchema

So I tested AudioEffector and what FFmpeg documentations says, and was able to bring in the phase.

Figure_1

script
import torch
from torchaudio.io import AudioEffector

sample_rate = 8000

phase = torch.linspace(0, 2 * torch.pi * 3000, sample_rate, dtype=torch.float32)
left = torch.sin(phase)
right = -left
waveform = torch.stack((left, right), dim=-1)

print(waveform.shape)

mean = torch.mean(waveform, -1)
assert mean.abs().sum().item() == 0.0


effector = AudioEffector(
    effect=(
        "asplit[a],"
        "aphasemeter=video=0,"
        "ametadata=select:key=lavfi.aphasemeter.phase:value=-0.005:function=less,"
        "pan=1c|c0=c0,"
        "aresample=async=1:first_pts=0,"
        "[a]amix")
)

applied = effector.apply(waveform, sample_rate=sample_rate)
mean2 = torch.mean(applied, -1)


import matplotlib.pyplot as plt

f, axes = plt.subplots(4, 2)
axes[0][0].set_ylabel("Original - Channel 1")
axes[0][0].plot(waveform[:500, 0])
axes[0][1].specgram(waveform[:, 0], Fs=sample_rate)

axes[1][0].set_ylabel("Original - Channel 2")
axes[1][0].plot(waveform[:500, 1])
axes[1][1].specgram(waveform[:, 1], Fs=sample_rate)

axes[2][0].set_ylabel("Just mean")
axes[2][0].plot(mean[:500])
axes[2][1].specgram(mean, Fs=sample_rate)

axes[3][0].set_ylabel("Phase-in then mean")
axes[3][0].plot(mean2[:500])
axes[3][1].specgram(mean2, Fs=sample_rate)

plt.show()

To bring in the phase, we just need to add two lines before taking mean.

effector = torchaudio.io.AudioEffector(
    effect=(
        "asplit[a],"
        "aphasemeter=video=0,"
        "ametadata=select:key=lavfi.aphasemeter.phase:value=-0.005:function=less,"
        "pan=1c|c0=c0,"
        "aresample=async=1:first_pts=0,"
        "[a]amix")
)
applied = effector.apply(waveform, sample_rate=sample_rate)

@jjmmchema Can you apply this in the PR?

@jjmmchema
Copy link

@mthrok Done. Updated the PR with the phase correction.

I still need to write the tests. It's the first time I'm writing actual tests so I'm having a bit of a hard time understanding how to do it properly according to the PyTorch guidelines. If you can give me any advice or guidance with this it would be really appreciated.

Also, if you have the time, I'd be grateful if you could explain what do you see in the last graph and spectrogram that allows you to say that the phase correction was properly applied.

@mthrok
Copy link
Collaborator

mthrok commented Apr 13, 2023

@mthrok Done. Updated the PR with the phase correction.

Also, if you have the time, I'd be grateful if you could explain what do you see in the last graph and spectrogram that allows you to say that the phase correction was properly applied.

The script generates 2 channel waveform with exact opposite sign.
In the figure, the first row is the spectrogram of the first channel, and second row is the second channel.
The third row is what happens when taking mean as-is. All the waveform cancels out and nothing is left there.
The fourth row is the spectrogram when applying phase-in before taking the mean.
We can see the similar frequency pattern there as the original one.

I still need to write the tests. It's the first time I'm writing actual tests so I'm having a bit of a hard time understanding how to do it properly according to the PyTorch guidelines. If you can give me any advice or guidance with this it would be really appreciated.

Tests should be small and concise. For example, take a look at the transforms.Speed. It instantiates the transform, run it and checks the output is what is expected.

def test_speed_identity(self):
"""speed of 1.0 does not alter input waveform and length"""
leading_dims = (5, 4, 2)
time = 1000
waveform = torch.rand(*leading_dims, time)
lengths = torch.randint(1, 1000, leading_dims)
speed = T.Speed(1000, 1.0)
actual_waveform, actual_lengths = speed(waveform, lengths)
self.assertEqual(waveform, actual_waveform)
self.assertEqual(lengths, actual_lengths)

You can do something similar, but from here it depends on the functionality.
First, we need to ensure that some of the fundamental property.

  1. Giving tensor of shape other than (2, N) (or (N, 2)?) should be rejected. You can give tensors like torch.empty((1, 1, 1)) and torch.empty([]).
  2. When a valid input is provided the number of channels of the output tensor must be one. You can check the shape of the output tensor to do this.
  3. As discussed above, passing a waveform with channels of exact sign should not produce zero tensor.

There are other things like, tensor of any float dtype should work etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants