Remove the redundant shift during the loss computation in the Moshi m… #36928

glynpu · 2025-03-24T12:48:53Z

What does this PR do?

Correct the loss computation process in the Moshi model to apply the shift only once, as it is currently being applied twice.

Because the class name MoshiForCausalLM contains 'ForCausalLM', according to the mapping rules in LOSS_MAPPING, the self.loss_function used in the forward function of MoshiForCausalLM should be ForCausalLMLoss. As a result, logits and labels are shifted twice: once before calling self.loss_function and once inside self.loss_function. This leads to tokens < n - 1 predicting n instead of the expected behavior where tokens < n predict n.

This PR removes the shift before the self.loss_function call.

References:
LOSS_MAPPING:

transformers/src/transformers/loss/loss_utils.py

Lines 130 to 135 in 9e125d9

    
           LOSS_MAPPING = { 
        
               "ForCausalLM": ForCausalLMLoss, 
        
               "ForMaskedLM": ForMaskedLMLoss, 
        
               "ForQuestionAnswering": ForQuestionAnsweringLoss, 
        
               "ForSequenceClassification": ForSequenceClassificationLoss, 
        
               "ForTokenClassification": ForTokenClassification,

ForCausalLMLoss:

transformers/src/transformers/loss/loss_utils.py

Lines 33 to 57 in 9e125d9

    
           def ForCausalLMLoss( 
        
               logits, 
        
               labels, 
        
               vocab_size: int, 
        
               num_items_in_batch: int = None, 
        
               ignore_index: int = -100, 
        
               shift_labels=None, 
        
               **kwargs, 
        
           ): 
        
               # Upcast to float if we need to compute the loss to avoid potential precision issues 
        
               logits = logits.float() 
        
               if shift_labels is None: 
        
                   labels = labels.to(logits.device) 
        
                   # Shift so that tokens < n predict n 
        
                   labels = nn.functional.pad(labels, (0, 1), value=ignore_index) 
        
                   shift_labels = labels[..., 1:].contiguous() 
        
               # Flatten the tokens 
        
               logits = logits.view(-1, vocab_size) 
        
               shift_labels = shift_labels.view(-1) 
        
               # Enable model parallelism 
        
               shift_labels = shift_labels.to(logits.device) 
        
               loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs) 
        
               return loss

…odel.

github-actions · 2025-03-24T12:49:06Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

Rocketknight1 · 2025-03-24T14:07:03Z

cc @eustlb!

Remove the redundant shift during the loss computation in the Moshi m…

7ac2337

…odel.

github-actions bot marked this pull request as draft March 24, 2025 12:49

glynpu marked this pull request as ready for review March 24, 2025 12:56

github-actions bot requested a review from eustlb March 24, 2025 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove the redundant shift during the loss computation in the Moshi m… #36928

Remove the redundant shift during the loss computation in the Moshi m… #36928

Uh oh!

glynpu commented Mar 24, 2025

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

Rocketknight1 commented Mar 24, 2025

Uh oh!

Uh oh!

	LOSS_MAPPING = {
	"ForCausalLM": ForCausalLMLoss,
	"ForMaskedLM": ForMaskedLMLoss,
	"ForQuestionAnswering": ForQuestionAnsweringLoss,
	"ForSequenceClassification": ForSequenceClassificationLoss,
	"ForTokenClassification": ForTokenClassification,

	def ForCausalLMLoss(
	logits,
	labels,
	vocab_size: int,
	num_items_in_batch: int = None,
	ignore_index: int = -100,
	shift_labels=None,
	**kwargs,
	):
	# Upcast to float if we need to compute the loss to avoid potential precision issues
	logits = logits.float()

	if shift_labels is None:
	labels = labels.to(logits.device)
	# Shift so that tokens < n predict n
	labels = nn.functional.pad(labels, (0, 1), value=ignore_index)
	shift_labels = labels[..., 1:].contiguous()

	# Flatten the tokens
	logits = logits.view(-1, vocab_size)
	shift_labels = shift_labels.view(-1)
	# Enable model parallelism
	shift_labels = shift_labels.to(logits.device)
	loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
	return loss

Remove the redundant shift during the loss computation in the Moshi m… #36928

Are you sure you want to change the base?

Remove the redundant shift during the loss computation in the Moshi m… #36928

Uh oh!

Conversation

glynpu commented Mar 24, 2025

What does this PR do?

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

Rocketknight1 commented Mar 24, 2025

Uh oh!

Uh oh!