Questions about the format and shape of the data #36

funasshi · 2021-11-04T05:06:46Z

Hello. I am currently using this package. I'm afraid this may be a basic question, but I'd like to ask a question.

1 Is the input a spectrogram or raw audio data?

2 When I run model(x,x_len,target,target_len), I get a four-dimensional output (batch, join_len,target_len,class_num) due to the calculation of the loss function.

I wanted to see the recognition result, so I used model.recognize(x,x_len), but the shape of the output was (batch,join_len). But the shape of the output was (batch,join_len). I would like to see it with (batch,target_len). What is the process of recognizing?

sooftware · 2021-11-04T13:31:16Z

Spectrogram
Ignore everything behind .

funasshi · 2021-11-05T11:34:03Z

Thank you very much.
Sorry, I have one more question related to point 2. When I look at the results of model.recognize, the last category label is output unusually many times.

ex, in a model with 40 categories
[40,40,40,40,40,40,40,3,40,40,40,40,40,40,5,8,40,40,....]

Does this mean that the last category is being used as the category that should be ignored?

zwan074 · 2021-12-05T23:03:51Z

Spectrogram

Ignore everything behind .

Hi , I am working on a ASR related project by using conformer.

The four dim output is has confused me for calculating the loss to train the ASR model

Would you please provide an example for the calculation of the loss ?

Kind Regards

zwan074 · 2021-12-05T23:05:18Z

Thank you very much. Sorry, I have one more question related to point 2. When I look at the results of model.recognize, the last category label is output unusually many times.

ex, in a model with 40 categories [40,40,40,40,40,40,40,3,40,40,40,40,40,40,5,8,40,40,....]

Does this mean that the last category is being used as the category that should be ignored?

It should be recongnised as the blank symbol

ArtemisZGL · 2022-01-25T11:52:35Z

Spectrogram

Ignore everything behind .

Hi , I am working on a ASR related project by using conformer.

The four dim output is has confused me for calculating the loss to train the ASR model

Would you please provide an example for the calculation of the loss ?

Kind Regards

do you have some idea? I am confused about that too.

ArtemisZGL · 2022-01-25T11:53:00Z

Spectrogram

Ignore everything behind .

Hi , I am working on a ASR related project by using conformer.

The four dim output is has confused me for calculating the loss to train the ASR model

Would you please provide an example for the calculation of the loss ?

Kind Regards

jcgeo9 · 2022-02-18T17:14:42Z

@sooftware can you please answer to @zwan074 ? many of us are confused as to how to use a loss function to train the conformer as the outputs are log probabilities of model prediction in 4 dimensions

sooftware · 2022-02-20T11:58:54Z

Sorry for the late response. I recommend checking this project

jcgeo9 · 2022-02-20T12:14:06Z

I have another question about the function of the conformer.
I am using a vocab of 6030 classes and my input data: batch, dim, seq_len= 32, 201, 1162 (where 1162 max len as they are padded) and targets 32,20 (where 20 max len as they are padded)
I am forwarding propagating and then when using the recognize function, it returns a tensor 32,289. I am trying to understand what is that 289 as I was expecting a tensor 32,20 so I would then convert it to text. @sooftware

sooftware · 2022-02-20T12:28:27Z

Show me the code.

jcgeo9 · 2022-02-20T13:38:08Z

@sooftware When I execute the following code, recognize_sp variable has the shape: [32, 289]

import torch
import time
import torch.nn as nn
from conformer import Conformer

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')
print(device)

#conformer model init
model = nn.DataParallel(Conformer(num_classes=6030, input_dim=201, encoder_dim=32, num_encoder_layers=3, decoder_dim=32)).to(device)

for i, (audio,audio_len, translations, translation_len) in enumerate(train_loader):

  #sorting inputs and targets to have targets in descending order based on len
  sorted_list,sorted_indices=torch.sort(translation_len,descending=True)

  sorted_audio=torch.zeros((32,201,1162),dtype=torch.float)
  sorted_audio_len=torch.zeros(32,dtype=torch.int)
  sorted_translations=torch.zeros((32,20),dtype=torch.int)     
  sorted_translation_len=sorted_list

  for index, contentof in enumerate(translation_len):
    sorted_audio[index]=audio[sorted_indices[index]]
    sorted_audio_len[index]=audio_len[sorted_indices[index]]
    sorted_translations[index]=translations[sorted_indices[index]]

  #transpose inputs from 32, 201, 1162 (batch, dim, seq_len) to 32, 1162, 201 (batch, seq_len, dim)
  inputs=sorted_audio.to(device)
  inputs=torch.transpose(inputs, 1, 2)
  input_lengths=sorted_audio_len 
  targets=sorted_translations.to(device) 
  target_lengths=sorted_translation_len
  
  # shapes:
  # inputs: [32, 1162, 201]
  # input_len: [32]
  # targets: [32, 20]
  # target_len: [32]
  preds = model(inputs, input_lengths, targets, target_lengths)

  recognize_sp=model.module.recognize(inputs, input_lengths)
  print(recognize_sp.shape)
  break

zwan074 · 2022-02-20T21:05:52Z

@sooftware can you please answer to @zwan074 ? many of us are confused as to how to use a loss function to train the conformer as the outputs are log probabilities of model prediction in 4 dimensions

As per the https://github.com/openspeech-team/openspeech project

When training the conformer model, it uses conformer block to compute the output for a ctc loss. The LSTM decoder layer is unused ..

code is as below:

`
def training_step(self, batch: tuple, batch_idx: int) -> OrderedDict:

    inputs, targets, input_lengths, target_lengths = batch
    encoder_outputs, encoder_logits, output_lengths = self.encoder(inputs, input_lengths)

    logits = self.fc(encoder_outputs).log_softmax(dim=-1)
    return self.collect_outputs(
        stage='train',
        logits=logits,
        output_lengths=output_lengths,
        targets=targets,
        target_lengths=target_lengths,
    )`

sooftware · 2022-02-21T01:47:26Z

@zwan074 Check this link

sooftware · 2022-02-21T02:04:50Z

@jcgeo9 289 is almost a quarter of 1162. This phenomenon occurs due to Conv2dSubampling during the convolution block of the Conformer.

jcgeo9 · 2022-02-21T02:11:01Z

@sooftware hmm ok but what do i do with that? i mean how do i convert it to what i actually want? isnt it suppose to return [32, 20] tensor containing integers that correspond to words from my vocabulary that will then be converted with itos in order to check the loss?

sooftware · 2022-02-21T02:41:22Z

I updated the code and README because many people seemed to have a hard time calculating losses.
Below is an example of calculating CTC Loss.

import torch
import torch.nn as nn
from conformer import Conformer

batch_size, sequence_length, dim = 3, 12345, 80

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')

criterion = nn.CTCLoss()

inputs = torch.rand(batch_size, sequence_length, dim).to(device)
input_lengths = torch.IntTensor([12345, 12300, 12000])
targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                            [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                            [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
target_lengths = torch.LongTensor([9, 8, 7])

model = Conformer(num_classes=10, 
                  input_dim=dim, 
                  encoder_dim=32, 
                  num_encoder_layers=3)

# Forward propagate
outputs, output_lengths = model(inputs, input_lengths)

# Calculate CTC Loss
loss = criterion(outputs.transpose(0, 1), targets, output_lengths, target_lengths)

aijianiula0601 · 2022-03-29T09:06:50Z

I have a question. The input_lengths has not send to calculate the mask for mulithead-attention. Is it work?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the format and shape of the data #36

Questions about the format and shape of the data #36

funasshi commented Nov 4, 2021 •

edited

Loading

sooftware commented Nov 4, 2021 •

edited

Loading

funasshi commented Nov 5, 2021 •

edited

Loading

zwan074 commented Dec 5, 2021

zwan074 commented Dec 5, 2021

ArtemisZGL commented Jan 25, 2022 •

edited

Loading

ArtemisZGL commented Jan 25, 2022

jcgeo9 commented Feb 18, 2022

sooftware commented Feb 20, 2022

jcgeo9 commented Feb 20, 2022

sooftware commented Feb 20, 2022

jcgeo9 commented Feb 20, 2022 •

edited

Loading

zwan074 commented Feb 20, 2022 •

edited

Loading

sooftware commented Feb 21, 2022

sooftware commented Feb 21, 2022 •

edited

Loading

jcgeo9 commented Feb 21, 2022

sooftware commented Feb 21, 2022

aijianiula0601 commented Mar 29, 2022 •

edited

Loading

Questions about the format and shape of the data #36

Questions about the format and shape of the data #36

Comments

funasshi commented Nov 4, 2021 • edited Loading

sooftware commented Nov 4, 2021 • edited Loading

funasshi commented Nov 5, 2021 • edited Loading

zwan074 commented Dec 5, 2021

zwan074 commented Dec 5, 2021

ArtemisZGL commented Jan 25, 2022 • edited Loading

ArtemisZGL commented Jan 25, 2022

jcgeo9 commented Feb 18, 2022

sooftware commented Feb 20, 2022

jcgeo9 commented Feb 20, 2022

sooftware commented Feb 20, 2022

jcgeo9 commented Feb 20, 2022 • edited Loading

zwan074 commented Feb 20, 2022 • edited Loading

sooftware commented Feb 21, 2022

sooftware commented Feb 21, 2022 • edited Loading

jcgeo9 commented Feb 21, 2022

sooftware commented Feb 21, 2022

aijianiula0601 commented Mar 29, 2022 • edited Loading

funasshi commented Nov 4, 2021 •

edited

Loading

sooftware commented Nov 4, 2021 •

edited

Loading

funasshi commented Nov 5, 2021 •

edited

Loading

ArtemisZGL commented Jan 25, 2022 •

edited

Loading

jcgeo9 commented Feb 20, 2022 •

edited

Loading

zwan074 commented Feb 20, 2022 •

edited

Loading

sooftware commented Feb 21, 2022 •

edited

Loading

aijianiula0601 commented Mar 29, 2022 •

edited

Loading