Add CTDNE (time-based random walk) demo #1056

kjun9 · 2020-03-10T06:22:22Z

This running on CI with reduced numbers for:

walk_length: 5  # random walk length
cw_size: 5  # context window size

@habiba-h I kept your approach of comparing biased random walks to temporal walks - there's barely any difference in the performance though 🤷‍♂

Also added the link prediction utilities directly in the demo - it ended up looking fine I think, so #1020 is closed

review-notebook-app · 2020-03-10T06:22:28Z

Check out this pull request on

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

codeclimate · 2020-03-10T06:23:05Z

Code Climate has analyzed commit 22da378 and detected 0 issues on this pull request.

View more on Code Climate.

huonw · 2020-03-10T23:58:11Z

I can do that here or in a separate PR depending on which is preferred for reviewers, but I think the notebook can be reviewed for its general content/workflow regardless

It'd be good to land this being checked by CI rather than have to add another FIXME. I'd be happy to see the dataset added here or in a separate PR (that lands first).

huonw

Nice.

Given the results of static and dynamic are so close, do you have a sense for the variance of the metrics like the ROC AUC? As in, does it seem like temporal walks are truly out performing static ones?

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

huonw · 2020-03-11T00:03:36Z

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

+    "    return pos, neg\n",
+    "\n",
+    "\n",
+    "def sample_negative_examples(g, positive_examples):\n",


I'm sure you realise this, but it's lucky this is a graph with only 151 nodes, since this function is O(n^2) 😄

I guess an alternative would be to just do the random sampling in a while loop and each time check that it doesn't exist in the graph - do you think that would be better?

Looks like our EdgeSplitter._sample_negative_examples_global has a slightly different approach too, but I thought it was simple/clear enough to just do it in the notebook instead.

My vote would be to use our code (i.e. EdgeSplitter) if we can, I think for this code it could be done by creating a graph from edges_other and then using EdgeSplitter. However, maybe that's awkward and not worth it...

Oh, and maybe it's not correct to use EdgeSplitter: it seems like this code considers everything that's outside pos a negative example, i.e. neg might include some edges that are in pos_test. Is this on purpose? Does this match what the paper does, or is this a point where it's too vague to tell? If it doesn't match (whether we know it or not), I wonder if this might explain the dramatic improvement we're seeing over the paper's numbers.

In section 3 of the paper:

The remaining 25% are considered as positive links and
we sample an equal number of negative edges randomly

is the only reference to how this is done. I would expect the scores to improve further if negative edges were forced to not overlap with any of the positives at all? So in this case I'm not sure it explains the improvement in our code.

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

huonw · 2020-03-11T00:15:24Z

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

+   "source": [
+    "temporal_node_embeddings = temporal_model.wv.vectors\n",
+    "static_node_embeddings = static_model.wv.vectors\n",
+    "plot_tsne(\"TSNE visualisation of temporal node embeddings\", temporal_node_embeddings)\n",


There's no labels we can use for these to make the plots coloured, and thus more interpretable (rather than just "oh, yeah, seem grouped")?

Yeah I guess I could omit this part, I found it slightly useful to see whether the embeddings "looked informative" on the surface while I was working on it

I think it's reasonable to include, since as you say, it gives a hint about whether the embeddings are picking up signal in some form, I'm just wondering if there's a way we can augment them further... it seems not easily, from what's in the dataset, though.

Theoretically, the temporal embeddings might cluster together nodes that have edges with similar time values, and separate those with very different time values. I can try to assign colours to nodes based on something like the average time across all its incident edges, but no guarantee that it will actually look meaningful

Ah, yeah. Interesting idea, although maybe too much code to be worth it, unless you think it is going to be enough of an improvement?

The code wasn't too much to add in the end, but I'm not sure if the visual is all that convincing:

Yeah, I'm not particularly convinced. It kinda seems like similar-age things (assuming that's what the colors mean) are being cluttered better, but hard to say for sure.

huonw · 2020-03-11T00:19:12Z

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

+    "* Node2Vec: 0.759 \n",
+    "* CTDNE: 0.777\n",
+    "\n",
+    "Below are a set of helper functions we can use to train and evaluate a link prediction classifier. The rest of the notebook will use the binary operator defined in the cell below. Available operators in the demo are:\n",


I'm not personally a huge fan of adding unused code to notebooks: it's easy for the code to just be a distraction from the teaching, and it's not tested and so may break. We could delete it, but that's a bit less nice than exploring multiple styles, especially if they give very different results.

Do you think we could at least run the temporal embeddings ROC AUC evaluation with all of them, but maybe still use only L2 for static and the t-SNE plot?

Some of the operators ended up causing convergence warnings while training the classifier (although the scores still looked fine) so that's partly why I didn't go through all of them - should I just remove them?

Can you reduce it to only two that converge quickly? If we do this, and run them both unconditionally for temporal random walks and mention in text other possibilities (like "any function that combines the features can work, like the absolute (L1) distance between numbers, or the average, or the product (hadamard)"), that might be a good balance between generic (demonstrating how to use any function) without being an overload.

Looks like it would be l1 and l2 - I'll switch to just using both of those unconditionally

The unconditional switch ended up looking a lot more verbose and I wasn't such a fan of introducing even more helper functions in the notebook.. so I ended up just going with l2. The scores end up looking pretty much identical for l1 and l2 too

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

kjun9 · 2020-03-11T01:23:13Z

It'd be good to land this being checked by CI rather than have to add another FIXME. I'd be happy to see the dataset added here or in a separate PR (that lands first).

@huonw I've opened #1058 if you're interested in reviewing!

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

habiba-h · 2020-03-11T01:40:30Z

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

@@ -0,0 +1,680 @@
+{


Interestingly, the AUC for the temporal and static method are almost the same over and above being better than the paper's results.

Reply via ReviewNB

Yeah I tried to figure out for a while whether the way I've set up is cheating in some way but decided to move forward eventually... so let me know if you spot something!

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

huonw · 2020-03-12T01:56:28Z

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

+    "    return pos, neg\n",
+    "\n",
+    "\n",
+    "def sample_negative_examples(g, positive_examples):\n",


My vote would be to use our code (i.e. EdgeSplitter) if we can, I think for this code it could be done by creating a graph from edges_other and then using EdgeSplitter. However, maybe that's awkward and not worth it...

Oh, and maybe it's not correct to use EdgeSplitter: it seems like this code considers everything that's outside pos a negative example, i.e. neg might include some edges that are in pos_test. Is this on purpose? Does this match what the paper does, or is this a point where it's too vague to tell? If it doesn't match (whether we know it or not), I wonder if this might explain the dramatic improvement we're seeing over the paper's numbers.

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

huonw · 2020-03-12T02:39:56Z

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

+    "\n",
+    "    plt.figure(figsize=(7, 7))\n",
+    "    plt.title(title)\n",
+    "    if y is not None:\n",


The color choice here is a bit suboptimal because red/green colour-blindness is the most common, based on https://www.color-blindness.com/coblis-color-blindness-simulator/ it seems that this particular choice of red and green is worst for protanopia and deuteranopia which, apparently affect 1% of males (and up to 8% of males of northwestern European descent, apparently https://en.wikipedia.org/wiki/Color_blindness#Classification ), e.g. it's only just distinguishable

One way to solve this would be to choose something other than r and g, but I personally think a better way would be to simplify the code, e.g. replace these if and else statements with:

alpha = 0.7 if y is None else 0.5 plt.scatter(x_t[:, 0], x_t[:, 1], c=y, cmap="jet", alpha=alpha)

jet is a poor general purpose colormap, but it's the most common one we use in other demos (i.e. if we're going to change it, we can change it everywhere), and it's ~ok for this binary/categorical plots (it fails for chromatopsia, though). https://matplotlib.org/tutorials/colors/colormaps.html has more options, but using many of the best ones (e.g. viridis) directly ends up with (translucent) yellow-on-white as one of the colours, which is bad for everyone.

Ah thanks for providing those links! jet looks very monochrome-y to me.. maybe I could use the colours being used in the plot_history util?

colors = plt.rcParams["axes.prop_cycle"].by_key()["color"] color_train = colors[0] color_validation = colors[1]

That's an option, but I don't know of a good way to specify that conveniently, without a chunk of manual code. I personally think the priority here should be obvious code, meaning we shouldn't have too much user-facing tweaking code, rather than perfect "report ready" output.

If the manual code isn't too bad, or there's another way to get better colours (like using a non-jet colormap), I'm all for it, though.

I've just gone with jet for now, but it also meant a legend is more necessary since the colours don't match the general expectation (?) of green being true and red being false, so it does end up with slightly more code

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

habiba-h

Looks good!
Just a few things that stood out while reading the notebook:
Maybe this portion,

It's good to verify that the data structures we've created so far from the raw data look reasonable.
StellarGraph.info is a useful method for inspecting the graph we've created
( along with the accompanying code), can come after creating the Stellargraph object above.

Followed by the positive and negative links methods and results.

And we can also check that the number of link examples correspond to the train/test subset values we defined earlier.

For temporal walks, this is the maximum length of a walk, since walks may end up being shorter when there are not enough valid temporal edges to traverse.

valid temporal edges sounds vague to me, an alternative way of saying it would be "not enough time respecting edges to traverse."

indicate that they are uninformative."
opening quotation mark missing and closing after the full stop.

For the positive and negative link classifier, what would have been visually appealing would be that the positive link embeddings were more compact and the negative can be all over the place as they are randomly generated.

The demo has come together quite nicely!

huonw · 2020-03-18T22:36:45Z

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

@@ -0,0 +1,686 @@
+{


Now that https://github.com//pull/1033 is merged, this can be slightly simplified potentially, although maybe we want to be sure we've got all the nodes using the main graph? (Also the source_column and target_column parameters are being set to their default and so don't seem necessary?)

graph = StellarGraph(edges=edges_graph, edge_weight_column="time")

Reply via ReviewNB

Hmm it seems like the reduced graph has 2 less nodes than the full graph actually:

Number of nodes in full graph: 151 Number of nodes in training graph: 149

this is handled already for temporal walks by the function that grabs the embedding with try: ... except KeyError, but I'd have to add the same logic for static walks too. Alternatively:

keep passing in nodes=pd.DataFrame(index=full_graph.nodes()) when constructing the graph

update the dataset loader to make sure every node still exists in the reduced graph

@kjun9 whats the reduced graph? I might have missed something.

The reduced graph is from taking the first 75% of edges from the full dataset as the "graph", and the remaining 25% edges for training and test examples

Is this how they are doing it in the paper?

But yeah this would make sense that some nodes might be left out with the edge splitting.

Yep, from the paper:

To generate a set of labeled examples for link prediction, we first sort the edges in each graph by time (ascending) and use the first 75% for representation learning. The remaining 25% are considered as positive links and we sample an equal number of negative edges randomly

Another detail that the paper brushed under the rug...
Your alternative sounds more proper to me. With the note in the demo that the original method does not explain in their experimental section how they handle the missing nodes after the split and this is how you are taking care of it.

keep passing in nodes=pd.DataFrame(index=full_graph.nodes()) when constructing the graph

Given reducing the code would likely break it, going with this option and keeping it as is seems fine to me.

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb

kjun9 · 2020-03-19T02:19:23Z

Thanks everyone for the reviews! Will merge once CI passes, and i'll work on removing the experimental tag separately

Add temporal walk demo

9acd39c

kjun9 changed the title ~~Add temporal walk demo~~ Add CTDNE (time-based random walk) demo Mar 10, 2020

kjun9 mentioned this pull request Mar 10, 2020

Add temporal walk demo notebook #800

Closed

kevin added 2 commits March 11, 2020 10:33

Merge branch 'develop' into feature/827-ctdne-demo

dc46d63

Update ci

71883d2

kjun9 marked this pull request as ready for review March 10, 2020 23:55

kjun9 requested review from huonw, kieranricardo and habiba-h March 10, 2020 23:55

huonw reviewed Mar 11, 2020

View reviewed changes

habiba-h reviewed Mar 11, 2020

View reviewed changes

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb Show resolved Hide resolved

habiba-h reviewed Mar 11, 2020

View reviewed changes

demos/link-prediction/random-walks/ctdne-link-prediction.ipynb Show resolved Hide resolved

kevin added 7 commits March 11, 2020 16:04

Address review comments

dd3e0c3

Merge branch 'develop' into feature/827-ctdne-demo

a30dd8a

Use dataset loader and test on CI

18151f8

Parametrize notebook for CI

24f27df

Reduce context window size too

d933600

Fix cw_size

6acf19e

Missed a + 1

b808887

kjun9 requested review from huonw and habiba-h March 12, 2020 01:12

huonw reviewed Mar 12, 2020

View reviewed changes

kevin added 3 commits March 16, 2020 17:14

Address review comments

a95da76

Address more review comments

09c30a9

Merge branch 'develop' into feature/827-ctdne-demo

e1e4742

kjun9 requested a review from huonw March 18, 2020 02:17

habiba-h approved these changes Mar 18, 2020

View reviewed changes

huonw reviewed Mar 18, 2020

View reviewed changes

kevin added 3 commits March 19, 2020 11:27

Address review comments

39e9609

Minor formatting, remove optional params

d131c45

Merge branch 'develop' into feature/827-ctdne-demo

dbb02ed

kjun9 requested a review from huonw March 19, 2020 00:54

huonw approved these changes Mar 19, 2020

View reviewed changes

Increase parallelism

22da378

kjun9 merged commit 659e012 into develop Mar 19, 2020

kjun9 deleted the feature/827-ctdne-demo branch March 19, 2020 02:46

kjun9 mentioned this pull request Mar 19, 2020

CTDNE implementation should handle cases when nodes do not appear in any of the random walks at all #991

Closed

8 tasks

Add CTDNE (time-based random walk) demo #1056

Add CTDNE (time-based random walk) demo #1056

Conversation

kjun9 commented Mar 10, 2020 • edited

review-notebook-app bot commented Mar 10, 2020

codeclimate bot commented Mar 10, 2020 • edited

huonw commented Mar 10, 2020

huonw left a comment

Choose a reason for hiding this comment

huonw Mar 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huonw Mar 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kjun9 commented Mar 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huonw Mar 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

habiba-h left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kjun9 commented Mar 19, 2020 • edited

kjun9 commented Mar 10, 2020 •

edited

codeclimate bot commented Mar 10, 2020 •

edited

huonw Mar 11, 2020 •

edited

huonw Mar 12, 2020 •

edited

huonw Mar 12, 2020 •

edited

kjun9 commented Mar 19, 2020 •

edited