Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative KL divergence #44

Closed
yarinbar opened this issue Aug 13, 2023 · 3 comments
Closed

Negative KL divergence #44

yarinbar opened this issue Aug 13, 2023 · 3 comments
Assignees
Labels
good first issue Good for newcomers question Further information is requested

Comments

@yarinbar
Copy link

yarinbar commented Aug 13, 2023

Hi!

I am using your package and get negative KLDiv when training. I am not sure as to why. I saw #29 but I suspect that solution is not applicable in my case.

Here is how the model is made:

b = torch.Tensor([1 if i % 2 == 0 else 0 for i in range(latent_dim)])

s = nf.nets.MLP([latent_dim, 2 * latent_dim, latent_dim], init_zeros=True)
t = nf.nets.MLP([latent_dim, 2 * latent_dim, latent_dim], init_zeros=True)
flows = [nf.flows.MaskedAffineFlow(b, t, s)]
flows += [nf.flows.ActNorm(self.latent_dim)]

base = nf.distributions.base.DiagGaussian(latent_dim)

# Construct flow model
self.nfm = nf.NormalizingFlow(base, flows)

And here is the training loop:

optimizer = torch.optim.Adam(self.nfm.parameters(), lr=lr, weight_decay=weight_decay)
loss_list = []

for epoch in range(n_epochs):
    print(f"Start epoch number {epoch + 1}")

    batch_cum_loss = 0
    n_batches = len(nf_train_loader)

    for batch_idx, (inputs, labels) in enumerate(nf_train_loader):
        batch_size = inputs.shape[0]

        inputs_cls = inputs.to(self.device)
        labels_cls = labels.to(self.device)
        optimizer.zero_grad()

        with torch.no_grad():
            outputs, _, latent = self.net(inputs_cls)

        # Compute loss
        loss = self.nfm.forward_kld(latent[-1])

Where latent[-1] is an intermediate output of a given network (before the classifier).

The loss that comes out is negative whereas if sklearn method mutual_info_score i get a positive number:

q = torch.normal(mean=0, std=1, size=(batch_size, latent_dim))
res = mutual_info_score(latent[-1].view(-1,), q.view(-1,))
res = kl_div(latent[-1], q)
res = kl_loss(latent[-1], q)

As evident in the loss graph, the loss values are also not stable being negative - although when overlooking the sign, the graph does look like a normal training graph.

image

I would appreciate any help!

@VincentStimper VincentStimper self-assigned this Sep 13, 2023
@VincentStimper VincentStimper added good first issue Good for newcomers question Further information is requested labels Sep 13, 2023
@VincentStimper
Copy link
Owner

Hi @yarinbar,

the forward KL divergence is given by $\text{KL}(p||q)=\mathbf{E}_p[\log\frac{p(x)}{q(x)}]=\mathbf{E}_p[\log p(x)]-\mathbf{E}_p[\log q(x)]$, where $p$ is the target and $q$ is the model. Since the target distribution is often unknown, as seems to be the case for your problem, the expectations are estimated with samples from the target, i.e. data. $\mathbf{E}_p[\log p(x)]$ still cannot be estimated in this case, but since it does not contain any model parameters, it is just a constant and is left out when computing the forward KL divergence. Hence, your loss is not literally the forward KL divergence, but the forward KL divergence minus an unknown constant shift, and, therefore, can become negative.

Best regards,
Vincent

@ArtemKar123
Copy link

Hello,

Do I understand correctly that minimising such loss (KL divergence minus an unknown constant shift) will still be correct, despite it being negative?

@VincentStimper
Copy link
Owner

VincentStimper commented Jan 4, 2024

Hi @ArtemKar123,

Yes, since the constant does not depend on the model's parameters, so it will disappear anyway when computing the gradient with respect to the parameters for the optimizer.
Moreover, in this case you are essentially minimizing $-\mathbf{E}_p[\log q(x)]$, so minimizing the forward KL divergence corresponds to maximizing the model's likelihood of the samples from the target, which itself is a common way to train machine learning models.

Best regards,
Vincent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants