Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch size and penalty terms #35

Closed
rgemulla opened this issue Aug 13, 2019 · 6 comments
Closed

Batch size and penalty terms #35

rgemulla opened this issue Aug 13, 2019 · 6 comments
Assignees

Comments

@rgemulla
Copy link
Member

7aa82ce introduces a patch that divides penalty terms by the number of batches to keep the penelty terms consistent. This needs discussion.

In particular: we average example losses/gradients over the batch. Thus, before 7aa82ce:

E[gradient] = E[gradient of a random example] + gradient of penatly term

That's independent of the number of batches. After 7aa82ce :

E[gradient] = E[gradient of a random example] + gradient of penalty term / num_batches

That's dependent on the number of batches. This patch thus seems to introduce what it tries to avoid.

@samuelbroscheit
Copy link
Member

Gradients are not averaged, however the losses are if the loss is instantiated in such a way.

Relevant part is here:

https://github.com/uma-pi1/kge/blob/master/kge/job/train.py#L270

            penalty_values = self.model.penalty(
                epoch=self.epoch, batch_index=batch_index, num_batches=len(self.loader)
            )
            for pv_index, pv_value in enumerate(penalty_values):
                penalty_value = penalty_value + pv_value
                if len(sum_penalties) > pv_index:
                    sum_penalties[pv_index] += pv_value.item()
                else:
                    sum_penalties.append(pv_value.item())
            sum_penalty += penalty_value.item()
            batch_forward_time += time.time()
            forward_time += batch_forward_time

            # backward pass
            batch_backward_time = -time.time()
            cost_value = loss_value + penalty_value
            cost_value.backward()

penalty_value is not averaged.

@rgemulla
Copy link
Member Author

Loss average implies gradient average. The relevant code piece is here:

            loss_value, batch_size, batch_prepare_time, batch_forward_time = self._compute_batch_loss(
                batch_index, batch
            )
            sum_loss += loss_value.item() * batch_size

_compute_batch_loss is thus supposed to average, that's why it's multiplied by the batch size afterwards.

7aa82ce should thus be reverted. If anything, the norms should be averaged over the number of training examples. But since this is constant (and can be thus be viewed as part of lambda) and not meaningful for all penalties, we shouldn't do this.

@samuelbroscheit
Copy link
Member

samuelbroscheit commented Aug 13, 2019

So smaller batch size (bs) means more penalty per epoch? With your suggestion we apply a penalty of N / bs * || T || per epoch and with mine || T || , correct?

Another question to this topic: Why didn't we implement the weighted norm like in the CP paper?

@rgemulla
Copy link
Member Author

Perhaps _compute_batch_loss should be renamed into _compute_batch_loss_avg to be easier to interpret.

@rgemulla
Copy link
Member Author

rgemulla commented Aug 13, 2019

Yes, more penalty per epoch. But also more loss per epoch, since the gradient of every example now has more impact in every step. Again, without the patch:

E[gradient] = E[gradient of a random example] + gradient of penatly term

This is what we want: the expected gradient is independent of the batch size.

@rgemulla
Copy link
Member Author

rgemulla commented Aug 13, 2019

As for weighted norm: that's a separate issue. If you mean that frequency-based weighting is not implemented: #20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants