Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the format and shape of the data #36

Open
funasshi opened this issue Nov 4, 2021 · 17 comments
Open

Questions about the format and shape of the data #36

funasshi opened this issue Nov 4, 2021 · 17 comments

Comments

@funasshi
Copy link

funasshi commented Nov 4, 2021

Hello. I am currently using this package. I'm afraid this may be a basic question, but I'd like to ask a question.

1 Is the input a spectrogram or raw audio data?

2 When I run model(x,x_len,target,target_len), I get a four-dimensional output (batch, join_len,target_len,class_num) due to the calculation of the loss function.

I wanted to see the recognition result, so I used model.recognize(x,x_len), but the shape of the output was (batch,join_len). But the shape of the output was (batch,join_len). I would like to see it with (batch,target_len). What is the process of recognizing?

@sooftware
Copy link
Owner

sooftware commented Nov 4, 2021

  1. Spectrogram
  2. Ignore everything behind .

@funasshi
Copy link
Author

funasshi commented Nov 5, 2021

Thank you very much.
Sorry, I have one more question related to point 2. When I look at the results of model.recognize, the last category label is output unusually many times.

ex, in a model with 40 categories
[40,40,40,40,40,40,40,3,40,40,40,40,40,40,5,8,40,40,....]

Does this mean that the last category is being used as the category that should be ignored?

@zwan074
Copy link

zwan074 commented Dec 5, 2021

  1. Spectrogram

    1. Ignore everything behind .

Hi , I am working on a ASR related project by using conformer.

The four dim output is has confused me for calculating the loss to train the ASR model

Would you please provide an example for the calculation of the loss ?

Kind Regards

@zwan074
Copy link

zwan074 commented Dec 5, 2021

Thank you very much. Sorry, I have one more question related to point 2. When I look at the results of model.recognize, the last category label is output unusually many times.

ex, in a model with 40 categories [40,40,40,40,40,40,40,3,40,40,40,40,40,40,5,8,40,40,....]

Does this mean that the last category is being used as the category that should be ignored?

It should be recongnised as the blank symbol

@ArtemisZGL
Copy link

ArtemisZGL commented Jan 25, 2022

  1. Spectrogram

    1. Ignore everything behind .

Hi , I am working on a ASR related project by using conformer.

The four dim output is has confused me for calculating the loss to train the ASR model

Would you please provide an example for the calculation of the loss ?

Kind Regards

do you have some idea? I am confused about that too.

@ArtemisZGL
Copy link

  1. Spectrogram

    1. Ignore everything behind .

Hi , I am working on a ASR related project by using conformer.

The four dim output is has confused me for calculating the loss to train the ASR model

Would you please provide an example for the calculation of the loss ?

Kind Regards

@jcgeo9
Copy link

jcgeo9 commented Feb 18, 2022

@sooftware can you please answer to @zwan074 ? many of us are confused as to how to use a loss function to train the conformer as the outputs are log probabilities of model prediction in 4 dimensions

@sooftware
Copy link
Owner

Sorry for the late response. I recommend checking this project

@jcgeo9
Copy link

jcgeo9 commented Feb 20, 2022

I have another question about the function of the conformer.
I am using a vocab of 6030 classes and my input data: batch, dim, seq_len= 32, 201, 1162 (where 1162 max len as they are padded) and targets 32,20 (where 20 max len as they are padded)
I am forwarding propagating and then when using the recognize function, it returns a tensor 32,289. I am trying to understand what is that 289 as I was expecting a tensor 32,20 so I would then convert it to text. @sooftware

@sooftware
Copy link
Owner

Show me the code.

@jcgeo9
Copy link

jcgeo9 commented Feb 20, 2022

@sooftware When I execute the following code, recognize_sp variable has the shape: [32, 289]

import torch
import time
import torch.nn as nn
from conformer import Conformer

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')
print(device)

#conformer model init
model = nn.DataParallel(Conformer(num_classes=6030, input_dim=201, encoder_dim=32, num_encoder_layers=3, decoder_dim=32)).to(device)

for i, (audio,audio_len, translations, translation_len) in enumerate(train_loader):

  #sorting inputs and targets to have targets in descending order based on len
  sorted_list,sorted_indices=torch.sort(translation_len,descending=True)

  sorted_audio=torch.zeros((32,201,1162),dtype=torch.float)
  sorted_audio_len=torch.zeros(32,dtype=torch.int)
  sorted_translations=torch.zeros((32,20),dtype=torch.int)     
  sorted_translation_len=sorted_list

  for index, contentof in enumerate(translation_len):
    sorted_audio[index]=audio[sorted_indices[index]]
    sorted_audio_len[index]=audio_len[sorted_indices[index]]
    sorted_translations[index]=translations[sorted_indices[index]]

  #transpose inputs from 32, 201, 1162 (batch, dim, seq_len) to 32, 1162, 201 (batch, seq_len, dim)
  inputs=sorted_audio.to(device)
  inputs=torch.transpose(inputs, 1, 2)
  input_lengths=sorted_audio_len 
  targets=sorted_translations.to(device) 
  target_lengths=sorted_translation_len
  
  # shapes:
  # inputs: [32, 1162, 201]
  # input_len: [32]
  # targets: [32, 20]
  # target_len: [32]
  preds = model(inputs, input_lengths, targets, target_lengths)

  recognize_sp=model.module.recognize(inputs, input_lengths)
  print(recognize_sp.shape)
  break

@zwan074
Copy link

zwan074 commented Feb 20, 2022

@sooftware can you please answer to @zwan074 ? many of us are confused as to how to use a loss function to train the conformer as the outputs are log probabilities of model prediction in 4 dimensions

As per the https://github.com/openspeech-team/openspeech project

When training the conformer model, it uses conformer block to compute the output for a ctc loss. The LSTM decoder layer is unused ..

code is as below:

`
def training_step(self, batch: tuple, batch_idx: int) -> OrderedDict:

    inputs, targets, input_lengths, target_lengths = batch
    encoder_outputs, encoder_logits, output_lengths = self.encoder(inputs, input_lengths)

    logits = self.fc(encoder_outputs).log_softmax(dim=-1)
    return self.collect_outputs(
        stage='train',
        logits=logits,
        output_lengths=output_lengths,
        targets=targets,
        target_lengths=target_lengths,
    )`

@sooftware
Copy link
Owner

@zwan074 Check this link

@sooftware
Copy link
Owner

sooftware commented Feb 21, 2022

@jcgeo9 289 is almost a quarter of 1162. This phenomenon occurs due to Conv2dSubampling during the convolution block of the Conformer.

@jcgeo9
Copy link

jcgeo9 commented Feb 21, 2022

@sooftware hmm ok but what do i do with that? i mean how do i convert it to what i actually want? isnt it suppose to return [32, 20] tensor containing integers that correspond to words from my vocabulary that will then be converted with itos in order to check the loss?

@sooftware
Copy link
Owner

I updated the code and README because many people seemed to have a hard time calculating losses.
Below is an example of calculating CTC Loss.

import torch
import torch.nn as nn
from conformer import Conformer

batch_size, sequence_length, dim = 3, 12345, 80

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')

criterion = nn.CTCLoss()

inputs = torch.rand(batch_size, sequence_length, dim).to(device)
input_lengths = torch.IntTensor([12345, 12300, 12000])
targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                            [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                            [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
target_lengths = torch.LongTensor([9, 8, 7])

model = Conformer(num_classes=10, 
                  input_dim=dim, 
                  encoder_dim=32, 
                  num_encoder_layers=3)

# Forward propagate
outputs, output_lengths = model(inputs, input_lengths)

# Calculate CTC Loss
loss = criterion(outputs.transpose(0, 1), targets, output_lengths, target_lengths)

@aijianiula0601
Copy link

aijianiula0601 commented Mar 29, 2022

I have a question. The input_lengths has not send to calculate the mask for mulithead-attention. Is it work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants