Add Stereo to Mono Convertions #877

RemonComputer · 2020-08-12T16:52:15Z

🚀 Feature

Make mono to stereo or stereo to mono conversion

Motivation

You Guys have made an amazing job, but stereo to mono and vice versa is simple, it seems that you missed it,
I think it might be done with simple mean in the channel dim

Pitch

This should be a Simple Transform like ToMono(channel_first=True) which will be in Torchaudio.Transforms

RemonComputer · 2020-08-13T13:17:35Z

Attached a small Implementation of Mine
transformations.py.txt

mthrok · 2020-08-14T20:30:13Z

Hi @RemonComputer

Thanks for the suggestion.

I think this problem is underspecified. I am not sure if there is one and only one way to convert stereo to monaural. Taking average is the simplest way, but adjusting the power level of each signal before mixing them is also popular approach. So I imagine there will be a wide variety of opinion what "stereo to monaural conversion" should do.

Adding this to the library will increase the maintenance cost, but I do not see much value there especially when it is so easy for users to write one. Users can just pick their favorite conversion algorithm write them with only a few lines of code.

RemonComputer · 2020-08-16T07:58:36Z

@mthrok Ok, If I can help in anything please let me know

bagustris · 2022-11-25T02:28:59Z

I am supporting this feature request.

Although the conversion from stereo to mono can be done in just a few lines of code and or using other libraries, having the same pattern would make the overall audio processing workflow easier and more consistent (e.g., the type and format of output [tensor vs. array, int64 vs float32], methods of converting [mean or first channel or second channel], etc).

mthrok · 2023-02-04T04:56:56Z

By the popular demand, we are reconsidering this.
There are couple of improvements we can make

In documentation, explain how monaural conversion can be achieved.
Add transforms.

I am not sure what kind of mixing methods exist, but I guess we can simply start from taking average waveform across channel dimension.

CLI tools like sox and ffmpeg allows to pick one channel. Perhaps that can be an extension to this.

https://www.nesono.com/node/275
https://trac.ffmpeg.org/wiki/AudioChannelManipulation
https://en.wikipedia.org/wiki/Out_of_Phase_Stereo

mthrok · 2023-02-04T05:27:19Z

cc @rbracco @hwangjeff @xiaohui-zhang

rbracco · 2023-02-04T18:48:35Z

Thank you Moto, I support this feature request as I think that, even though it can be done in a single line of pytorch code, it is a common operation that is A. unintuitive for beginners and B. often necessary to begin training a model.

I would also support documentation clearly explaining that conversion to mono can be achieved by taking the average over the channel dim, and giving a clear example in code. If you go the documentation only route I would suggest making sure it is easy to find via searching google and/or the documentation. Thank for reraising this.

jjmmchema · 2023-04-04T21:17:51Z

Hey there, it looks like this issue is still open. Can I work on this?
I was manipulating some audio files and saw that torchaudio doesn't have a built-in transformation for converting from stereo to mono.

mthrok · 2023-04-04T21:32:19Z

Hey there, it looks like this issue is still open. Can I work on this?
I was manipulating some audio files and saw that torchaudio doesn't have a built-in transformation for converting from stereo to mono.

@jjmmchema Thanks. Go ahead. torchaudio transforms take waveform in channel-first format, so make sure to support the shape of <...., channel, time>

jjmmchema · 2023-04-04T22:54:53Z

Hey there, it looks like this issue is still open. Can I work on this?
I was manipulating some audio files and saw that torchaudio doesn't have a built-in transformation for converting from stereo to mono.

@jjmmchema Thanks. Go ahead. torchaudio transforms take waveform in channel-first format, so make sure to support the shape of <...., channel, time>

@mthrok just a quick question, why is it that the supported shape should be <...., channel, time>? I thought it should be <...., time> since the dots, from what I understand, refer to the number of channels of the input tensor.

mthrok · 2023-04-05T01:06:46Z

@jjmmchema My intent was to emphasize that channel should not be the last, however, come to think of it, since this transform is about channel, it might be better to be explicit about where the channel dimension should be, and also flexible.

The most typical and common shape is <batch, channel, time>, which is what other transforms (except some) handles.
Another typical one is <channel, time> (like Tensors returned by torchaudio.load(..., channels_first=True)).
Both cases can be handled by accepting <..., time> and specifying channel dimension as -2, like torch.mean(waveform, dim=-2.

There are cases where we use <time, channel> format. For example, I/O features in torchaudio.io uses this format because they are more for streaming I/O. in this case, dim=-1 would be better.

So I think simply making the channel dimension configurable and expecting the shape to be <..., channel, ....>, and defaulting dim=-2 can be a way. What do you think?

jjmmchema · 2023-04-05T03:00:34Z

@mthrok Yeah, it makes sense that the channel dimension should be defaulted to dim=-2 while being flexible. To be honest I completely forgot about the batch dimension so thanks for bringing it up.

Expecting that the input has the shape of <...., channel, ....> seems to be the most appropriate thing to do.

faroit · 2023-04-07T10:29:17Z

@mthrok @jjmmchema Just a general remark for such a feature:

using mean to downmix stereo to mono is not a good idea for music signals since it cancels out any stereo phase. For spectrogram models it is much preferred to downmix stereo in the magnitude stft or Mel domain as this doesn't have that issue.
ffmpeg suggests to measure and shift the phase prior to downmixing https://superuser.com/questions/1667532/ffmpeg-aphasemeter-meaning-of-measurement
a general downmix to mono for more than 2 channels (e.g. 5.1) is not a good idea since there are specific downmix schemes to be respected when using Dolby surround content or spatial audio in general.

So as it's proposed in this issue and implemented in #3242 I strongly propose not to add such a feature to torchaudio and instead let the user decide how to downmix on their own to make the aware that this isn't a trivial operation.

jjmmchema · 2023-04-07T15:04:40Z

@faroit Thanks for your response.

I see that this isn't as simple as it seemed at first.

Maybe instead of implementing the downmixing with just a simple mean transformation a new issue can be opened to look for and discuss about different downmixing techniques?

Or maybe just create a documentation section that talks about the possibility of downmixing using mean but warns about the things that you mentioned?

Let me know about the decision so that I don't continue with the mean implementation shown in #3242 or I finish writing the tests for the PR.

mthrok · 2023-04-07T15:41:27Z

@faroit Thanks for the feedback. I do agree with your points, and I am grad that they are laid out nicely.
What do you think of

limiting the scope of the feature to stereo input (reject if #channels !=2),
implement phase correction,
and emphasize in doc that this is just AN implementation of such transform, and not universal.

As of a way to implement phase correction, we could implement it on PyTorch, or we can delegate it to FFmpeg.
I recently landed a feature to run ffmpeg filter on Tensor (PR, doc, tutorial (wip)), although this one is a bit overkill for just a channel manipulation, so more efficient implementation would be ideal.

(Also in the nightly build I added a feature to delegate channel manipulation to FFmpeg in StreamReader, so this is somewhat doable loading audio from file.)

mthrok · 2023-04-12T02:38:06Z

@faroit @jjmmchema

So I tested AudioEffector and what FFmpeg documentations says, and was able to bring in the phase.

script

import torch
from torchaudio.io import AudioEffector

sample_rate = 8000

phase = torch.linspace(0, 2 * torch.pi * 3000, sample_rate, dtype=torch.float32)
left = torch.sin(phase)
right = -left
waveform = torch.stack((left, right), dim=-1)

print(waveform.shape)

mean = torch.mean(waveform, -1)
assert mean.abs().sum().item() == 0.0


effector = AudioEffector(
    effect=(
        "asplit[a],"
        "aphasemeter=video=0,"
        "ametadata=select:key=lavfi.aphasemeter.phase:value=-0.005:function=less,"
        "pan=1c|c0=c0,"
        "aresample=async=1:first_pts=0,"
        "[a]amix")
)

applied = effector.apply(waveform, sample_rate=sample_rate)
mean2 = torch.mean(applied, -1)


import matplotlib.pyplot as plt

f, axes = plt.subplots(4, 2)
axes[0][0].set_ylabel("Original - Channel 1")
axes[0][0].plot(waveform[:500, 0])
axes[0][1].specgram(waveform[:, 0], Fs=sample_rate)

axes[1][0].set_ylabel("Original - Channel 2")
axes[1][0].plot(waveform[:500, 1])
axes[1][1].specgram(waveform[:, 1], Fs=sample_rate)

axes[2][0].set_ylabel("Just mean")
axes[2][0].plot(mean[:500])
axes[2][1].specgram(mean, Fs=sample_rate)

axes[3][0].set_ylabel("Phase-in then mean")
axes[3][0].plot(mean2[:500])
axes[3][1].specgram(mean2, Fs=sample_rate)

plt.show()

To bring in the phase, we just need to add two lines before taking mean.

effector = torchaudio.io.AudioEffector(
    effect=(
        "asplit[a],"
        "aphasemeter=video=0,"
        "ametadata=select:key=lavfi.aphasemeter.phase:value=-0.005:function=less,"
        "pan=1c|c0=c0,"
        "aresample=async=1:first_pts=0,"
        "[a]amix")
)
applied = effector.apply(waveform, sample_rate=sample_rate)

@jjmmchema Can you apply this in the PR?

jjmmchema · 2023-04-12T04:10:46Z

@mthrok Done. Updated the PR with the phase correction.

I still need to write the tests. It's the first time I'm writing actual tests so I'm having a bit of a hard time understanding how to do it properly according to the PyTorch guidelines. If you can give me any advice or guidance with this it would be really appreciated.

Also, if you have the time, I'd be grateful if you could explain what do you see in the last graph and spectrogram that allows you to say that the phase correction was properly applied.

mthrok · 2023-04-13T21:32:24Z

@mthrok Done. Updated the PR with the phase correction.

Also, if you have the time, I'd be grateful if you could explain what do you see in the last graph and spectrogram that allows you to say that the phase correction was properly applied.

The script generates 2 channel waveform with exact opposite sign.
In the figure, the first row is the spectrogram of the first channel, and second row is the second channel.
The third row is what happens when taking mean as-is. All the waveform cancels out and nothing is left there.
The fourth row is the spectrogram when applying phase-in before taking the mean.
We can see the similar frequency pattern there as the original one.

I still need to write the tests. It's the first time I'm writing actual tests so I'm having a bit of a hard time understanding how to do it properly according to the PyTorch guidelines. If you can give me any advice or guidance with this it would be really appreciated.

Tests should be small and concise. For example, take a look at the transforms.Speed. It instantiates the transform, run it and checks the output is what is expected.

audio/test/torchaudio_unittest/transforms/transforms_test_impl.py

Lines 216 to 225 in d5b2996

    
           def test_speed_identity(self): 
        
               """speed of 1.0 does not alter input waveform and length""" 
        
               leading_dims = (5, 4, 2) 
        
               time = 1000 
        
               waveform = torch.rand(*leading_dims, time) 
        
               lengths = torch.randint(1, 1000, leading_dims) 
        
               speed = T.Speed(1000, 1.0) 
        
               actual_waveform, actual_lengths = speed(waveform, lengths) 
        
               self.assertEqual(waveform, actual_waveform) 
        
               self.assertEqual(lengths, actual_lengths)

You can do something similar, but from here it depends on the functionality.
First, we need to ensure that some of the fundamental property.

Giving tensor of shape other than (2, N) (or (N, 2)?) should be rejected. You can give tensors like torch.empty((1, 1, 1)) and torch.empty([]).
When a valid input is provided the number of channels of the output tensor must be one. You can check the shape of the output tensor to do this.
As discussed above, passing a waveform with channels of exact sign should not produce zero tensor.

There are other things like, tensor of any float dtype should work etc...

ezyang transferred this issue from pytorch/pytorch Aug 12, 2020

RemonComputer closed this as completed Aug 16, 2020

mthrok reopened this Feb 4, 2023

mthrok added help wanted good first issue labels Feb 4, 2023

jjmmchema mentioned this issue Apr 5, 2023

Add stereo to mono transform #3242

Draft

4 tasks

nateanl added the triaged label Apr 11, 2023

mthrok mentioned this issue Jul 27, 2023

Add pretrained VGGish inference pipeline #3491

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Stereo to Mono Convertions #877

Add Stereo to Mono Convertions #877

RemonComputer commented Aug 12, 2020

RemonComputer commented Aug 13, 2020

mthrok commented Aug 14, 2020 •

edited

Loading

RemonComputer commented Aug 16, 2020

bagustris commented Nov 25, 2022

mthrok commented Feb 4, 2023

mthrok commented Feb 4, 2023

rbracco commented Feb 4, 2023

jjmmchema commented Apr 4, 2023 •

edited

Loading

mthrok commented Apr 4, 2023

jjmmchema commented Apr 4, 2023

mthrok commented Apr 5, 2023

jjmmchema commented Apr 5, 2023

faroit commented Apr 7, 2023

jjmmchema commented Apr 7, 2023 •

edited

Loading

mthrok commented Apr 7, 2023

mthrok commented Apr 12, 2023 •

edited

Loading

jjmmchema commented Apr 12, 2023

mthrok commented Apr 13, 2023

Add Stereo to Mono Convertions #877

Add Stereo to Mono Convertions #877

Comments

RemonComputer commented Aug 12, 2020

🚀 Feature

Motivation

Pitch

RemonComputer commented Aug 13, 2020

mthrok commented Aug 14, 2020 • edited Loading

RemonComputer commented Aug 16, 2020

bagustris commented Nov 25, 2022

mthrok commented Feb 4, 2023

mthrok commented Feb 4, 2023

rbracco commented Feb 4, 2023

jjmmchema commented Apr 4, 2023 • edited Loading

mthrok commented Apr 4, 2023

jjmmchema commented Apr 4, 2023

mthrok commented Apr 5, 2023

jjmmchema commented Apr 5, 2023

faroit commented Apr 7, 2023

jjmmchema commented Apr 7, 2023 • edited Loading

mthrok commented Apr 7, 2023

mthrok commented Apr 12, 2023 • edited Loading

jjmmchema commented Apr 12, 2023

mthrok commented Apr 13, 2023

mthrok commented Aug 14, 2020 •

edited

Loading

jjmmchema commented Apr 4, 2023 •

edited

Loading

jjmmchema commented Apr 7, 2023 •

edited

Loading

mthrok commented Apr 12, 2023 •

edited

Loading