-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss function in optimizer.py #20
Comments
This is because the cross-entropy loss applies to N^2 terms, i.e. all potential edges, while the KL term only applies to N terms, i.e. all nodes. This normalization makes sure they have a comparable scale. |
Thank you. But why would you divide the KL term by N^2 when it is the sum of N terms? Would it make more sense to divide it by N (so that we have the average of the sum, like in the cross-entropy term)? |
I think you are right, this makes the KL term indeed quite small in comparison. The model can still be seen as a beta-VAE with a very small beta parameter. I’ll have to check what scale these terms have in practice when runnning the model to see if there’s an issue. Overall, a Gaussian prior is not a good choice in any case in combination with a dot-product decoder, as mentioned in our follow-up paper: https://nicola-decao.github.io/s-vae.html I would recommend to run the GAE model in the non-probabilistic variant or use a hyperspherical posterior/prior. |
Thanks for the suggestions:) |
Hi @tkipf, Could you explain this more or Have you solved this issue? Thanks again for your sharing. |
A large beta acts as a regularizer / soft constraint on the latent code, so
this will typically harm link prediction performance if the parameter is
too large. You should even get reasonable results with beta=0 (i.e. no KL
term at all), as the model is typically sufficiently regularized via
negative sampling (i.e. we choose random negative links).
…On Tue, Dec 25, 2018 at 10:24 AM YH-learning ***@***.***> wrote:
Hi @tkipf <https://github.com/tkipf>,
Thanks for your shares. When running the code with the big beta
parameter.(i.e., *kl_loss + ent_loss*), the performance is not as good as
that using *kl_loss / num_nodes + ent_loss*, I am very confused about
this.
Could you explain this more or Have you solved this issue?
Thanks again for your sharing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYNIkpy3J3naeKXeBYdKfoEf1UBhAks5u8e7HgaJpZM4ZLuk4>
.
|
Hi @tkipf, Have you tried to output the reconstructed adjacency matrix based on the training data? I have found the reconstruction deviated a lot from the original matrix (for the cora graph and the citeseer graph), the number of edges in the reconstructed matrix is 300 times more than the original adj matrix (cora graph). Of course this could be a mistake, please do correct me if this is not the case. Thanks |
Can you paste the code you used to reconstruct the adjacency matrix?
…On Fri, Jan 18, 2019 at 6:14 PM Zheyu Zhang ***@***.***> wrote:
Hi @tkipf <https://github.com/tkipf>,
Have you tried to output the reconstructed adjacency matrix based on the
training data? I have found the reconstruction deviated a lot from the
original matrix (for the cora graph and the citeseer graph), the number of
edges in the reconstructed matrix is 300 times more than the original adj
matrix (cora graph).
Of course this could be a mistake, please do correct me if this is not the
case.
Thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYJqyhDjcRMGGqetb_57EBTYxd64Sks5vEgD-gaJpZM4ZLuk4>
.
|
I'm also confused about this. The results of the reconstructed matrix look like this:
The codes are:
It seems that the inner product module tend to reconstruct far more edges than expected. The optimization procedure is reducing FP and increasing TN, but nothing to do with TP. |
Maybe your training set is unbalanced? It looks like this is not
necessarily a problem with this code release...
…On Sun, Jan 20, 2019 at 3:26 AM dawnranger ***@***.***> wrote:
Can you paste the code you used to reconstruct the adjacency matrix?
I'm also confused about this. The results of the reconstructed matrix
looks like this:
Epoch: 0010 TP=0013246 FN=0000018 FP=4508390 TN=2811610 Precision=0.0029
Recall=0.9986
Epoch: 0020 TP=0013238 FN=0000026 FP=3539226 TN=3780774 Precision=0.0037
Recall=0.9980
Epoch: 0030 TP=0013248 FN=0000016 FP=3282812 TN=4037188 Precision=0.0040
Recall=0.9988
Epoch: 0040 TP=0013254 FN=0000010 FP=3176094 TN=4143906 Precision=0.0042
Recall=0.9992
Epoch: 0050 TP=0013256 FN=0000008 FP=3168532 TN=4151468 Precision=0.0042
Recall=0.9994
Epoch: 0060 TP=0013258 FN=0000006 FP=3138802 TN=4181198 Precision=0.0042
Recall=0.9995
Epoch: 0070 TP=0013264 FN=0000000 FP=3110030 TN=4209970 Precision=0.0042
Recall=1.0000
Epoch: 0080 TP=0013264 FN=0000000 FP=3082102 TN=4237898 Precision=0.0043
Recall=1.0000
Epoch: 0090 TP=0013264 FN=0000000 FP=3063600 TN=4256400 Precision=0.0043
Recall=1.0000
Epoch: 0100 TP=0013264 FN=0000000 FP=3061570 TN=4258430 Precision=0.0043
Recall=1.0000
Epoch: 0110 TP=0013264 FN=0000000 FP=3065990 TN=4254010 Precision=0.0043
Recall=1.0000
Epoch: 0120 TP=0013264 FN=0000000 FP=3069514 TN=4250486 Precision=0.0043
Recall=1.0000
Epoch: 0130 TP=0013264 FN=0000000 FP=3075558 TN=4244442 Precision=0.0043
Recall=1.0000
Epoch: 0140 TP=0013264 FN=0000000 FP=3084226 TN=4235774 Precision=0.0043
Recall=1.0000
Epoch: 0150 TP=0013264 FN=0000000 FP=3092052 TN=4227948 Precision=0.0043
Recall=1.0000
Epoch: 0160 TP=0013264 FN=0000000 FP=3097308 TN=4222692 Precision=0.0043
Recall=1.0000
Epoch: 0170 TP=0013264 FN=0000000 FP=3100544 TN=4219456 Precision=0.0043
Recall=1.0000
Epoch: 0180 TP=0013264 FN=0000000 FP=3101718 TN=4218282 Precision=0.0043
Recall=1.0000
Epoch: 0190 TP=0013264 FN=0000000 FP=3103924 TN=4216076 Precision=0.0043
Recall=1.0000
Epoch: 0200 TP=0013264 FN=0000000 FP=3106388 TN=4213612 Precision=0.0043
Recall=1.0000
The codes are:
preds = tf.cast(tf.greater_equal(tf.sigmoid(adj_preds), 0.5), tf.int32)
labels = tf.cast(adj_out, tf.int32)
self.accuracy = tf.reduce_mean(tf.cast(tf.equal(preds, labels),tf.float32))
self.TP = tf.count_nonzero(preds * labels)
self.FP = tf.count_nonzero(preds * (labels - 1))
self.FN = tf.count_nonzero((preds - 1) * labels)
self.TN = tf.count_nonzero((preds - 1) * (labels - 1))
self.precision = self.TP / (self.TP + self.FP)
self.recall = self.TP / (self.TP + self.FN)
It seems that the inner product module rend to reconstruct far more edges
than expected.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYCmMRMn-xF1_AT5nu09kJkpOv6nBks5vE9O6gaJpZM4ZLuk4>
.
|
In the case of adjacency matrix reconstruction, I think the training set is always unbalanced. For example, in cora dataset, the size of adjacency matrix is 2708*2708=7333264, but there are only 5429 edges exists. The positive-negative rate is 5429*2:(2708*2708-5429*2)=1:674.37, which is the pos_weight in your code, that's why you are using tf.weighted_cross_entropy_with_logits rather than tf.sigmoid_cross_entropy_with_logits.
|
I see, I think I misunderstood the question - I assumed that this was
related to some custom adaptation of the model or the use of some other
dataset.
Note that 0.5 is not necessarily the best threshold for edge classification
in this case. If you want to find the prediction with the best
precision/recall trade-off, you can plot a precision/recall curve and find
the optimal threshold value for your particular case. This should hopefully
give more reasonable results.
…On Mon, Jan 21, 2019 at 1:56 AM dawnranger ***@***.***> wrote:
In the case of adjacency matrix reconstruction, I think the training set
is always unbalanced. For example, in cora dataset, the size of adjacency
matrix is 2708*2708=7333264, but there are only 5429 edges exists. The
positive-negative rate is 5429*2:(2708*2708-5429*2)=1:674.37, *which is
the pos_weight <https://github.com/tkipf/gae/blob/master/gae/train.py#L77>
in your code, that's why you are using
tf.weighted_cross_entropy_with_logits rather than
tf.sigmoid_cross_entropy_with_logits*.
Maybe your training set is unbalanced? It looks like this is not
necessarily a problem with this code release...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYM56khcpzxGvFuxuwLnQwb0R_30Pks5vFRAzgaJpZM4ZLuk4>
.
|
Hi @tkipf, Thank you for your suggestion, I have tried on various thresholds. But it seems the reconstructed adjacency matrix is still very noisy even when the threshold is above 0.9. However, I think the problem is not with the threshold. The performance on validation and test sets in the paper are very good due to these sets being very balanced (half edges and half non-edges). To justify this, I have increased the number of non-edges in the validation and the test sets (5 times, 30 times and 100 times of the number of edges), to simulate the sparsity of the graph. And the average precision scores dropped significantly as the number of non-edges increased. Also I have evaluated F1 score alongside on val/test sets, and they are usually quite low. Below are the code for the reconstruction. They are identical to those when you have used to evaluate validation and test performance. feed_dict.update({placeholders['dropout']: 0})
pred = sess.run(model.reconstructions, feed_dict=feed_dict)
A_recon_prob_vec = 1 / (1 + np.exp(-pred))
A_recon_prob = A_recon_prob_vec.reshape(num_nodes,-1)
A_recon_prob = A_recon_prob - np.diag(np.diag(A_recon_prob)) And here is the modification to sample more non-edges in val and test sets. # Original
val_edges_false = []
while len(val_edges_false) < len(val_edges): # Modified
val_edges_false = []
while len(val_edges_false) < non_edges * len(val_edges):
# non_edges is an integer indicating the multiple Thanks |
Thanks, this is indeed a very insightful analysis. It would be interesting
to see what model/loss adaptions are necessary for this model to generalize
in more realistic scenarios.
We have evaluated variants of GAEs for knowledge base completion (
https://arxiv.org/abs/1703.06103) and recommender systems (
https://arxiv.org/abs/1706.02263), where the model could perform very well
on more standard benchmarks. So I think it is not necessarily an issue with
the model per se. I think this issue could probably be fixed by training
with negative sampling (instead of using a weighted cross entropy
objective) -- as also mentioned in the GAE workshop paper.
…On Wed, Jan 23, 2019 at 7:11 PM Zheyu Zhang ***@***.***> wrote:
Hi @tkipf <https://github.com/tkipf>,
Thank you for your suggestion, I have tried on various thresholds. But it
seems the reconstructed adjacency matrix is still very noisy even when the
threshold is above 0.9.
However, I think the problem is not with the threshold. The performance on
validation and test sets in the paper are very good due to these sets being
very balanced (half edges and half non-edges). To justify this, I have
increased the number of non-edges in the validation and the test sets (5
times, 30 times and 100 times of the number of edges), to simulate the
sparsity of the graph. And the average precision scores dropped
significantly as the number of non-edges increased. Also I have evaluated
F1 score alongside on val/test sets, and they are usually quite low.
Below are the code for the reconstruction. They are identical to those
when you have used to evaluate validation and test performance.
feed_dict.update({placeholders['dropout']: 0})
pred = sess.run(model.reconstructions, feed_dict=feed_dict)
A_recon_prob_vec = 1 / (1 + np.exp(-pred))
A_recon_prob = A_recon_prob_vec.reshape(num_nodes,-1)
A_recon_prob = A_recon_prob - np.diag(np.diag(A_recon_prob))
And here is the modification to sample more non-edges in val and test sets.
# Original
val_edges_false = []while len(val_edges_false) < len(val_edges):
# Modified
val_edges_false = []while len(val_edges_false) < non_edges * len(val_edges):# non_edges is an integer indicating the multiple
Thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYC5NUv1s0ij2fWN7GzQvw8ZjZj8Iks5vGKXdgaJpZM4ZLuk4>
.
|
Thanks for clarifying and your further suggestions. |
I've been exploring a few real world social networks using GAE/VGAE and am running into this very problem. Is there a consensus on a solution? Number of positive edges are far less than the negative edges. I've just been adjusting my threshold, which has helped some, but ideally, I at least should be able to overfit on the adjacency itself and recover the original adjacency. This is not so. I have a 33 node network and a 71 node network. |
You can try `exp(-d(x,y))` as a scoring function for links where `d(x,y)`
is the Euclidean distance between two node embedding vectors. This might be
less susceptible to producing more or less densely connected graphs than
the scoring function `sigmoid(x^T y)` (based on an inner product).
…On Mon 1. Apr 2019 at 17:49 Joshua Levy ***@***.***> wrote:
I've been exploring a few real world social networks using GAE/VGAE and am
running into this very problem. Is there a consensus on a solution? Number
of positive edges are far less than the negative edges. I've just been
adjusting my threshold, which has helped some, but ideally, I at least
should be able to overfit on the adjacency itself and recover the original
adjacency. This is not so.
I have a 33 node network and a 71 node network.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYNwvshGpTtWevZa93kBRNX9EuOPvks5vcjiFgaJpZM4ZLuk4>
.
|
Is there literature to support the use of this? |
I recommend having a look at "A Tutorial on Energy-Based Learning" by LeCun
et al.. Using a scoring function like `exp(-d(x,y))` essentially boils down
to an energy-based model with `d(x,y)` being the energy associated with a
particular edge. Energy-based losses have been explored in the context of
link prediction before, e.g., in https://arxiv.org/abs/1707.03815 . Scoring
functions in terms of inner product, on the other hand, come up in the
context of graphons (https://en.wikipedia.org/wiki/Graphon) which are known
to produce "dense blobs" when queried as a generative model.
…On Mon, Apr 1, 2019 at 9:43 PM Joshua Levy ***@***.***> wrote:
Is there literature to support the use of this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYOMxVBEP3JHxjpubuFqoOQTjDCW0ks5vcm9XgaJpZM4ZLuk4>
.
|
Hi @tkipf, |
Yes, we've played around with model variants where we also reconstruct the
node features (X) or include label information in a semi-supervised
auto-encoder setting. Both approaches usually help improve the overall
performance.
…On Tue, Apr 2, 2019 at 10:22 AM Andy He ***@***.***> wrote:
Hi @tkipf <https://github.com/tkipf>,
Thanks for your nice work!
I noticed that most of the GCNs were trained based on optimizing the
structure (X) reconstruction error. So did you tried to train GCN by
minimizing the feature (A) reconstruction error ? If you tried, can you
share more details about it?
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYMAIRcFHx0Ut6V-ec06sgg4pE5dbks5vcyFmgaJpZM4ZLuk4>
.
|
Can you share some examples which reconstruct the node features? I am a little confused about the reconstruction process. Thanks! |
You can use a node-level MLP to reconstruct node features from the final
decoder GCN outputs. If node features are binary, then simply using a
binary cross entropy loss works well. Otherwise for continuous features,
MSE (mean-squared error) works well.
…On Tue, Apr 2, 2019 at 10:43 AM Andy He ***@***.***> wrote:
Can you share some examples which reconstruct the node features? I am a
little confused about the reconstruction process. Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHAcYNDV6H_vQ_6Pc59HFONW8I8l_PdWks5vcyZRgaJpZM4ZLuk4>
.
|
what should I used for the activation z*z^T if I use exp(-d(x,y)) as the lost @tkipf . |
Hi @tkipf,
Thank you for sharing the implementation.
In the
OptimizerVAE
class, when defining the KL divergence, I think there is an (1/num_nodes)^2 term extra.One num_nodes comes in
(0.5 / num_nodes)
, and the other is introduced bytf.reduce_mean
.This contradicts the results in Auto-Encoding Variational Bayes by Kingma and Welling (Appendix B, Solution of -KL, Gaussian case).
Could you expound on this a bit more?
Thanks!
The text was updated successfully, but these errors were encountered: