Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about training process #1

Open
seaplus296 opened this issue Jun 17, 2024 · 3 comments
Open

Question about training process #1

seaplus296 opened this issue Jun 17, 2024 · 3 comments

Comments

@seaplus296
Copy link

it seems T5 embedding from FrozenT5 has shape (B, max_length, D)

def encode(self, text):
t5_batch_encoding = self.t5_tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
struct_tokens = t5_batch_encoding["input_ids"].to(self.device)
z = self.t5_transformer(input_ids=struct_tokens).last_hidden_state
return z

is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??

tmp_g_se_loss = self.loss_fn(z_q_i.mean(2), text_features)

  1. neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?
@yangdongchao
Copy link
Owner

it seems T5 embedding from FrozenT5 has shape (B, max_length, D)

def encode(self, text):
t5_batch_encoding = self.t5_tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
struct_tokens = t5_batch_encoding["input_ids"].to(self.device)
z = self.t5_transformer(input_ids=struct_tokens).last_hidden_state
return z

is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??

tmp_g_se_loss = self.loss_fn(z_q_i.mean(2), text_features)

  1. neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?
  1. Yes, we use global pooling to extract T5 embedding
  2. We randomly crop a segment to train, due to the encoder is convolution net, so it can encode any length of audio in the inference.

@seaplus296
Copy link
Author

@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?

By the way, I really like this approach, injecting subword, word-level info directly into codec.

@yangdongchao
Copy link
Owner

@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?

By the way, I really like this approach, injecting subword, word-level info directly into codec.

Yes, You are right. I am sorry for the late reply. I donot notice this message in the pass days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants