Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss function in optimizer.py #20

Open
zzheyu opened this issue Dec 10, 2018 · 24 comments
Open

Loss function in optimizer.py #20

zzheyu opened this issue Dec 10, 2018 · 24 comments

Comments

@zzheyu
Copy link

zzheyu commented Dec 10, 2018

Hi @tkipf,

Thank you for sharing the implementation.

In the OptimizerVAE class, when defining the KL divergence, I think there is an (1/num_nodes)^2 term extra.
One num_nodes comes in (0.5 / num_nodes), and the other is introduced by tf.reduce_mean.

This contradicts the results in Auto-Encoding Variational Bayes by Kingma and Welling (Appendix B, Solution of -KL, Gaussian case).

Could you expound on this a bit more?

Thanks!

@tkipf
Copy link
Owner

tkipf commented Dec 10, 2018

This is because the cross-entropy loss applies to N^2 terms, i.e. all potential edges, while the KL term only applies to N terms, i.e. all nodes. This normalization makes sure they have a comparable scale.

@zzheyu
Copy link
Author

zzheyu commented Dec 11, 2018

Thank you.

But why would you divide the KL term by N^2 when it is the sum of N terms? Would it make more sense to divide it by N (so that we have the average of the sum, like in the cross-entropy term)?

@tkipf
Copy link
Owner

tkipf commented Dec 11, 2018

I think you are right, this makes the KL term indeed quite small in comparison. The model can still be seen as a beta-VAE with a very small beta parameter. I’ll have to check what scale these terms have in practice when runnning the model to see if there’s an issue.

Overall, a Gaussian prior is not a good choice in any case in combination with a dot-product decoder, as mentioned in our follow-up paper: https://nicola-decao.github.io/s-vae.html

I would recommend to run the GAE model in the non-probabilistic variant or use a hyperspherical posterior/prior.

@zzheyu
Copy link
Author

zzheyu commented Dec 11, 2018

Thanks for the suggestions:)

@YH-learning
Copy link

Hi @tkipf,
Thanks for your shares. When running the code with the big beta parameter.(i.e., kl_loss + ent_loss), the performance is not as good as that using kl_loss / num_nodes + ent_loss, I am very confused about this.

Could you explain this more or Have you solved this issue?

Thanks again for your sharing.

@tkipf
Copy link
Owner

tkipf commented Jan 3, 2019 via email

@zzheyu
Copy link
Author

zzheyu commented Jan 18, 2019

Hi @tkipf,

Have you tried to output the reconstructed adjacency matrix based on the training data? I have found the reconstruction deviated a lot from the original matrix (for the cora graph and the citeseer graph), the number of edges in the reconstructed matrix is 300 times more than the original adj matrix (cora graph).

Of course this could be a mistake, please do correct me if this is not the case.

Thanks

@tkipf
Copy link
Owner

tkipf commented Jan 18, 2019 via email

@dawnranger
Copy link

dawnranger commented Jan 20, 2019

Can you paste the code you used to reconstruct the adjacency matrix?

I'm also confused about this. The results of the reconstructed matrix look like this:

Epoch: 0010 TP=0013246 FN=0000018 FP=4508390 TN=2811610 Precision=0.0029 Recall=0.9986
Epoch: 0020 TP=0013238 FN=0000026 FP=3539226 TN=3780774 Precision=0.0037 Recall=0.9980
Epoch: 0030 TP=0013248 FN=0000016 FP=3282812 TN=4037188 Precision=0.0040 Recall=0.9988
Epoch: 0040 TP=0013254 FN=0000010 FP=3176094 TN=4143906 Precision=0.0042 Recall=0.9992
Epoch: 0050 TP=0013256 FN=0000008 FP=3168532 TN=4151468 Precision=0.0042 Recall=0.9994
Epoch: 0060 TP=0013258 FN=0000006 FP=3138802 TN=4181198 Precision=0.0042 Recall=0.9995
Epoch: 0070 TP=0013264 FN=0000000 FP=3110030 TN=4209970 Precision=0.0042 Recall=1.0000
Epoch: 0080 TP=0013264 FN=0000000 FP=3082102 TN=4237898 Precision=0.0043 Recall=1.0000
Epoch: 0090 TP=0013264 FN=0000000 FP=3063600 TN=4256400 Precision=0.0043 Recall=1.0000
Epoch: 0100 TP=0013264 FN=0000000 FP=3061570 TN=4258430 Precision=0.0043 Recall=1.0000
Epoch: 0110 TP=0013264 FN=0000000 FP=3065990 TN=4254010 Precision=0.0043 Recall=1.0000
Epoch: 0120 TP=0013264 FN=0000000 FP=3069514 TN=4250486 Precision=0.0043 Recall=1.0000
Epoch: 0130 TP=0013264 FN=0000000 FP=3075558 TN=4244442 Precision=0.0043 Recall=1.0000
Epoch: 0140 TP=0013264 FN=0000000 FP=3084226 TN=4235774 Precision=0.0043 Recall=1.0000
Epoch: 0150 TP=0013264 FN=0000000 FP=3092052 TN=4227948 Precision=0.0043 Recall=1.0000
Epoch: 0160 TP=0013264 FN=0000000 FP=3097308 TN=4222692 Precision=0.0043 Recall=1.0000
Epoch: 0170 TP=0013264 FN=0000000 FP=3100544 TN=4219456 Precision=0.0043 Recall=1.0000
Epoch: 0180 TP=0013264 FN=0000000 FP=3101718 TN=4218282 Precision=0.0043 Recall=1.0000
Epoch: 0190 TP=0013264 FN=0000000 FP=3103924 TN=4216076 Precision=0.0043 Recall=1.0000
Epoch: 0200 TP=0013264 FN=0000000 FP=3106388 TN=4213612 Precision=0.0043 Recall=1.0000

The codes are:

    preds = tf.cast(tf.greater_equal(tf.sigmoid(adj_preds), 0.5), tf.int32)
    labels = tf.cast(adj_out, tf.int32)
    self.accuracy = tf.reduce_mean(tf.cast(tf.equal(preds, labels),tf.float32))
    self.TP = tf.count_nonzero(preds * labels)
    self.FP = tf.count_nonzero(preds * (labels - 1))
    self.FN = tf.count_nonzero((preds - 1) * labels)
    self.TN = tf.count_nonzero((preds - 1) * (labels - 1))
    self.precision = self.TP / (self.TP + self.FP) 
    self.recall = self.TP / (self.TP + self.FN) 

It seems that the inner product module tend to reconstruct far more edges than expected. The optimization procedure is reducing FP and increasing TN, but nothing to do with TP.

@tkipf
Copy link
Owner

tkipf commented Jan 20, 2019 via email

@dawnranger
Copy link

In the case of adjacency matrix reconstruction, I think the training set is always unbalanced. For example, in cora dataset, the size of adjacency matrix is 2708*2708=7333264, but there are only 5429 edges exists. The positive-negative rate is 5429*2:(2708*2708-5429*2)=1:674.37, which is the pos_weight in your code, that's why you are using tf.weighted_cross_entropy_with_logits rather than tf.sigmoid_cross_entropy_with_logits.

Maybe your training set is unbalanced? It looks like this is not necessarily a problem with this code release...

@tkipf
Copy link
Owner

tkipf commented Jan 21, 2019 via email

@zzheyu
Copy link
Author

zzheyu commented Jan 23, 2019

Hi @tkipf,

Thank you for your suggestion, I have tried on various thresholds. But it seems the reconstructed adjacency matrix is still very noisy even when the threshold is above 0.9.

However, I think the problem is not with the threshold. The performance on validation and test sets in the paper are very good due to these sets being very balanced (half edges and half non-edges). To justify this, I have increased the number of non-edges in the validation and the test sets (5 times, 30 times and 100 times of the number of edges), to simulate the sparsity of the graph. And the average precision scores dropped significantly as the number of non-edges increased. Also I have evaluated F1 score alongside on val/test sets, and they are usually quite low.

Below are the code for the reconstruction. They are identical to those when you have used to evaluate validation and test performance.

feed_dict.update({placeholders['dropout']: 0})
pred = sess.run(model.reconstructions, feed_dict=feed_dict)
A_recon_prob_vec = 1 / (1 + np.exp(-pred))
A_recon_prob = A_recon_prob_vec.reshape(num_nodes,-1)
A_recon_prob = A_recon_prob - np.diag(np.diag(A_recon_prob))

And here is the modification to sample more non-edges in val and test sets.

# Original
val_edges_false = []
while len(val_edges_false) < len(val_edges):
# Modified
val_edges_false = []
while len(val_edges_false) < non_edges * len(val_edges):
# non_edges is an integer indicating the multiple

Thanks

@tkipf
Copy link
Owner

tkipf commented Jan 23, 2019 via email

@zzheyu
Copy link
Author

zzheyu commented Jan 24, 2019

Thanks for clarifying and your further suggestions.

@jlevy44
Copy link

jlevy44 commented Apr 1, 2019

I've been exploring a few real world social networks using GAE/VGAE and am running into this very problem. Is there a consensus on a solution? Number of positive edges are far less than the negative edges. I've just been adjusting my threshold, which has helped some, but ideally, I at least should be able to overfit on the adjacency itself and recover the original adjacency. This is not so.

I have a 33 node network and a 71 node network.

@tkipf
Copy link
Owner

tkipf commented Apr 1, 2019 via email

@jlevy44
Copy link

jlevy44 commented Apr 1, 2019

Is there literature to support the use of this?

@tkipf
Copy link
Owner

tkipf commented Apr 1, 2019 via email

@XuanHeIIIS
Copy link

Hi @tkipf
Thanks for your nice work!
I noticed that most of the GCNs were trained based on optimizing the structure (X) reconstruction error. So did you tried to train GCN by minimizing the feature (A) reconstruction error ? If you tried, can you share more details about it?
Thanks!

@tkipf
Copy link
Owner

tkipf commented Apr 2, 2019 via email

@XuanHeIIIS
Copy link

Can you share some examples which reconstruct the node features? I am a little confused about the reconstruction process. Thanks!

@tkipf
Copy link
Owner

tkipf commented Apr 2, 2019 via email

@Yumlembam
Copy link

exp

what should I used for the activation z*z^T if I use exp(-d(x,y)) as the lost @tkipf .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants