Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchaudio.compliance.kaldi.spectrogram is very different from torchaudio.transforms.spectrogram #157

Closed
vincentqb opened this issue Jul 19, 2019 · 9 comments

Comments

@vincentqb
Copy link
Contributor

vincentqb commented Jul 19, 2019

Does torchaudio.compliance.kaldi.spectrogram only currently support vectors?

When feeding a tensor of shape torch.Size([2, 276858]) the result is not what's expected, yet there is no error. I would expect a "train pattern" to be visible, as in the second figure below.

This is what kaldi gives
download (1)

This is what torchaudio.transforms.spectrogram gives
download (2)

The "train pattern" is also visible on academo.org.

@vincentqb
Copy link
Contributor Author

Code to generate the two figures below. Sound file here.

import torch
import torchaudio
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

filename = "assets/steam-train-whistle-daniel_simon-converted-from-mp3.wav"
tensor, frequency = torchaudio.load(filename)

spec = torchaudio.transforms.Spectrogram()(tensor)
plt.imshow(spec.log2().transpose(1,2)[0,:,:].numpy(), cmap='gray')
plt.show()

spec = torchaudio.compliance.kaldi.spectrogram(tensor)
plt.imshow(spec.log2().transpose(0,1).numpy(), cmap='gray')
plt.show()

@vincentqb
Copy link
Contributor Author

vincentqb commented Jul 19, 2019

The error appears unrelated to multiple channels, since I get similar results with

spec = torchaudio.compliance.kaldi.spectrogram(tensor[0,:].view(1,-1))
plt.imshow(spec.log2().transpose(0,1).numpy(), cmap='gray')
plt.show()

download (3)

Note also that I had to pass the tensor with shape torch.Size([1, 276858]) and not torch.Size([276858]). The channel flag specifies which one of the channels will be process (the last by default) -- thanks @jamarshon for pointing this out!

@vincentqb vincentqb changed the title torchaudio.compliance.kaldi.spectrogram gives incorrect results for non-vector input torchaudio.compliance.kaldi.spectrogram gives result different from torchaudio.transforms.spectrogram Jul 19, 2019
@vincentqb vincentqb changed the title torchaudio.compliance.kaldi.spectrogram gives result different from torchaudio.transforms.spectrogram torchaudio.compliance.kaldi.spectrogram gives results different from torchaudio.transforms.spectrogram Jul 19, 2019
@vincentqb vincentqb changed the title torchaudio.compliance.kaldi.spectrogram gives results different from torchaudio.transforms.spectrogram torchaudio.compliance.kaldi.spectrogram is very different from torchaudio.transforms.spectrogram Jul 19, 2019
@vincentqb
Copy link
Contributor Author

The main issue is that the result from kaldi looks like noise, and the fact that the train pattern is not visible in the spectrogram is unexpected.

@cpuhrsch
Copy link
Contributor

Try smaller inputs, zeros, ones, arange, etc. ; but in general we want to standardize on kaldi and whatever they produce is what we produce.

@jamarshon
Copy link
Contributor

@vincentqb I could investigate the flags more kaldi.spectrogram to get a more closer result but is this more similar to what you would expect?

import torch
import torchaudio
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

filename = "/Users/jamarshon/Documents/GitHub/audio/test/assets/steam-train-whistle-daniel_simon.mp3"
s, sr = torchaudio.load(filename)
EPSILON = torch.tensor(torch.finfo(torch.float).eps, dtype=torch.get_default_dtype())

spec = torchaudio.transforms.Spectrogram()(s)
x = torch.max(EPSILON, spec).log2().transpose(1,2)[0,:,:]
plt.imshow(x.numpy(), cmap='gray')
plt.show()

n_fft = 400.0
fl = n_fft / sr * 1000.0
fs = fl / 2.0
spec2 = torchaudio.compliance.kaldi.spectrogram(
	s, dither=0.0, window_type='hanning', 
	frame_length=fl, frame_shift=fs, remove_dc_offset=False, 
	round_to_power_of_two=False, sample_frequency=sr)
y = spec2.t()
plt.imshow(y.numpy(), cmap='gray')
plt.show()

Spec1:
spec1
Spec2:
spec2

@vincentqb
Copy link
Contributor Author

Great, that's good enough. Thanks!

@mahmoodn
Copy link

@vincentqb
Can you upload the wav file again. I can not find it. The link is broken.

@vincentqb
Copy link
Contributor Author

@vincentqb
Can you upload the wav file again. I can not find it. The link is broken.

The file can still be accessed here.

@vincentqb
Copy link
Contributor Author

vincentqb commented Dec 26, 2019

For reference, this is enough to produce reasonable spectrogram.

spec = torchaudio.compliance.kaldi.spectrogram(tensor, dither=0.)
plt.imshow(spec.t().numpy(), cmap='gray')
plt.show()

EDIT: no log needed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants