Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CTDNE (time-based random walk) demo #1056

Merged
merged 17 commits into from Mar 19, 2020
Merged

Add CTDNE (time-based random walk) demo #1056

merged 17 commits into from Mar 19, 2020

Conversation

kjun9
Copy link
Contributor

@kjun9 kjun9 commented Mar 10, 2020

See #827

This running on CI with reduced numbers for:

walk_length: 5  # random walk length
cw_size: 5  # context window size

@habiba-h I kept your approach of comparing biased random walks to temporal walks - there's barely any difference in the performance though 🤷‍♂

Also added the link prediction utilities directly in the demo - it ended up looking fine I think, so #1020 is closed

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

@codeclimate
Copy link

codeclimate bot commented Mar 10, 2020

Code Climate has analyzed commit 22da378 and detected 0 issues on this pull request.

View more on Code Climate.

@kjun9 kjun9 changed the title Add temporal walk demo Add CTDNE (time-based random walk) demo Mar 10, 2020
@kjun9 kjun9 marked this pull request as ready for review March 10, 2020 23:55
@huonw
Copy link
Member

huonw commented Mar 10, 2020

I can do that here or in a separate PR depending on which is preferred for reviewers, but I think the notebook can be reviewed for its general content/workflow regardless

It'd be good to land this being checked by CI rather than have to add another FIXME. I'd be happy to see the dataset added here or in a separate PR (that lands first).

Copy link
Member

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

Given the results of static and dynamic are so close, do you have a sense for the variance of the metrics like the ROC AUC? As in, does it seem like temporal walks are truly out performing static ones?

" return pos, neg\n",
"\n",
"\n",
"def sample_negative_examples(g, positive_examples):\n",
Copy link
Member

@huonw huonw Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you realise this, but it's lucky this is a graph with only 151 nodes, since this function is O(n^2) 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess an alternative would be to just do the random sampling in a while loop and each time check that it doesn't exist in the graph - do you think that would be better?

Looks like our EdgeSplitter._sample_negative_examples_global has a slightly different approach too, but I thought it was simple/clear enough to just do it in the notebook instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be to use our code (i.e. EdgeSplitter) if we can, I think for this code it could be done by creating a graph from edges_other and then using EdgeSplitter. However, maybe that's awkward and not worth it...

Oh, and maybe it's not correct to use EdgeSplitter: it seems like this code considers everything that's outside pos a negative example, i.e. neg might include some edges that are in pos_test. Is this on purpose? Does this match what the paper does, or is this a point where it's too vague to tell? If it doesn't match (whether we know it or not), I wonder if this might explain the dramatic improvement we're seeing over the paper's numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In section 3 of the paper:

The remaining 25% are considered as positive links and
we sample an equal number of negative edges randomly

is the only reference to how this is done. I would expect the scores to improve further if negative edges were forced to not overlap with any of the positives at all? So in this case I'm not sure it explains the improvement in our code.

"source": [
"temporal_node_embeddings = temporal_model.wv.vectors\n",
"static_node_embeddings = static_model.wv.vectors\n",
"plot_tsne(\"TSNE visualisation of temporal node embeddings\", temporal_node_embeddings)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no labels we can use for these to make the plots coloured, and thus more interpretable (rather than just "oh, yeah, seem grouped")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess I could omit this part, I found it slightly useful to see whether the embeddings "looked informative" on the surface while I was working on it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to include, since as you say, it gives a hint about whether the embeddings are picking up signal in some form, I'm just wondering if there's a way we can augment them further... it seems not easily, from what's in the dataset, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, the temporal embeddings might cluster together nodes that have edges with similar time values, and separate those with very different time values. I can try to assign colours to nodes based on something like the average time across all its incident edges, but no guarantee that it will actually look meaningful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah. Interesting idea, although maybe too much code to be worth it, unless you think it is going to be enough of an improvement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code wasn't too much to add in the end, but I'm not sure if the visual is all that convincing:
image

Copy link
Member

@huonw huonw Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not particularly convinced. It kinda seems like similar-age things (assuming that's what the colors mean) are being cluttered better, but hard to say for sure.

"* Node2Vec: 0.759 \n",
"* CTDNE: 0.777\n",
"\n",
"Below are a set of helper functions we can use to train and evaluate a link prediction classifier. The rest of the notebook will use the binary operator defined in the cell below. Available operators in the demo are:\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not personally a huge fan of adding unused code to notebooks: it's easy for the code to just be a distraction from the teaching, and it's not tested and so may break. We could delete it, but that's a bit less nice than exploring multiple styles, especially if they give very different results.

Do you think we could at least run the temporal embeddings ROC AUC evaluation with all of them, but maybe still use only L2 for static and the t-SNE plot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the operators ended up causing convergence warnings while training the classifier (although the scores still looked fine) so that's partly why I didn't go through all of them - should I just remove them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reduce it to only two that converge quickly? If we do this, and run them both unconditionally for temporal random walks and mention in text other possibilities (like "any function that combines the features can work, like the absolute (L1) distance between numbers, or the average, or the product (hadamard)"), that might be a good balance between generic (demonstrating how to use any function) without being an overload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it would be l1 and l2 - I'll switch to just using both of those unconditionally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unconditional switch ended up looking a lot more verbose and I wasn't such a fan of introducing even more helper functions in the notebook.. so I ended up just going with l2. The scores end up looking pretty much identical for l1 and l2 too

@kjun9
Copy link
Contributor Author

kjun9 commented Mar 11, 2020

It'd be good to land this being checked by CI rather than have to add another FIXME. I'd be happy to see the dataset added here or in a separate PR (that lands first).

@huonw I've opened #1058 if you're interested in reviewing!

@@ -0,0 +1,680 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, the AUC for the temporal and static method are almost the same over and above being better than the paper's results.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried to figure out for a while whether the way I've set up is cheating in some way but decided to move forward eventually... so let me know if you spot something!

@kjun9 kjun9 requested review from huonw and habiba-h March 12, 2020 01:12
" return pos, neg\n",
"\n",
"\n",
"def sample_negative_examples(g, positive_examples):\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be to use our code (i.e. EdgeSplitter) if we can, I think for this code it could be done by creating a graph from edges_other and then using EdgeSplitter. However, maybe that's awkward and not worth it...

Oh, and maybe it's not correct to use EdgeSplitter: it seems like this code considers everything that's outside pos a negative example, i.e. neg might include some edges that are in pos_test. Is this on purpose? Does this match what the paper does, or is this a point where it's too vague to tell? If it doesn't match (whether we know it or not), I wonder if this might explain the dramatic improvement we're seeing over the paper's numbers.

"\n",
" plt.figure(figsize=(7, 7))\n",
" plt.title(title)\n",
" if y is not None:\n",
Copy link
Member

@huonw huonw Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The color choice here is a bit suboptimal because red/green colour-blindness is the most common, based on https://www.color-blindness.com/coblis-color-blindness-simulator/ it seems that this particular choice of red and green is worst for protanopia and deuteranopia which, apparently affect 1% of males (and up to 8% of males of northwestern European descent, apparently https://en.wikipedia.org/wiki/Color_blindness#Classification ), e.g. it's only just distinguishable

image

One way to solve this would be to choose something other than r and g, but I personally think a better way would be to simplify the code, e.g. replace these if and else statements with:

alpha = 0.7 if y is None else 0.5
plt.scatter(x_t[:, 0], x_t[:, 1], c=y, cmap="jet", alpha=alpha)

image

jet is a poor general purpose colormap, but it's the most common one we use in other demos (i.e. if we're going to change it, we can change it everywhere), and it's ~ok for this binary/categorical plots (it fails for chromatopsia, though). https://matplotlib.org/tutorials/colors/colormaps.html has more options, but using many of the best ones (e.g. viridis) directly ends up with (translucent) yellow-on-white as one of the colours, which is bad for everyone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for providing those links! jet looks very monochrome-y to me.. maybe I could use the colours being used in the plot_history util?

    colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
    color_train = colors[0]
    color_validation = colors[1]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an option, but I don't know of a good way to specify that conveniently, without a chunk of manual code. I personally think the priority here should be obvious code, meaning we shouldn't have too much user-facing tweaking code, rather than perfect "report ready" output.

If the manual code isn't too bad, or there's another way to get better colours (like using a non-jet colormap), I'm all for it, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just gone with jet for now, but it also meant a legend is more necessary since the colours don't match the general expectation (?) of green being true and red being false, so it does end up with slightly more code

@kjun9 kjun9 requested a review from huonw March 18, 2020 02:17
Copy link
Contributor

@habiba-h habiba-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Just a few things that stood out while reading the notebook:
Maybe this portion,

It's good to verify that the data structures we've created so far from the raw data look reasonable.
StellarGraph.info is a useful method for inspecting the graph we've created
( along with the accompanying code), can come after creating the Stellargraph object above.

Followed by the positive and negative links methods and results.

And we can also check that the number of link examples correspond to the train/test subset values we defined earlier.

For temporal walks, this is the maximum length of a walk, since walks may end up being shorter when there are not enough valid temporal edges to traverse.

valid temporal edges sounds vague to me, an alternative way of saying it would be "not enough time respecting edges to traverse."

indicate that they are uninformative."
opening quotation mark missing and closing after the full stop.

For the positive and negative link classifier, what would have been visually appealing would be that the positive link embeddings were more compact and the negative can be all over the place as they are randomly generated.

The demo has come together quite nicely!

@@ -0,0 +1,686 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that https://github.com//pull/1033 is merged, this can be slightly simplified potentially, although maybe we want to be sure we've got all the nodes using the main graph? (Also the source_column and target_column parameters are being set to their default and so don't seem necessary?)

graph = StellarGraph(edges=edges_graph, edge_weight_column="time")

Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it seems like the reduced graph has 2 less nodes than the full graph actually:

Number of nodes in full graph: 151
Number of nodes in training graph: 149

this is handled already for temporal walks by the function that grabs the embedding with try: ... except KeyError, but I'd have to add the same logic for static walks too. Alternatively:

  • keep passing in nodes=pd.DataFrame(index=full_graph.nodes()) when constructing the graph
  • update the dataset loader to make sure every node still exists in the reduced graph

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kjun9 whats the reduced graph? I might have missed something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reduced graph is from taking the first 75% of edges from the full dataset as the "graph", and the remaining 25% edges for training and test examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how they are doing it in the paper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah this would make sense that some nodes might be left out with the edge splitting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, from the paper:

To generate a set of labeled examples for link prediction, we first sort the edges in each graph by time (ascending) and use the first 75% for representation learning. The remaining 25% are considered as positive links and we sample an equal number of negative edges randomly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another detail that the paper brushed under the rug...
Your alternative sounds more proper to me. With the note in the demo that the original method does not explain in their experimental section how they handle the missing nodes after the split and this is how you are taking care of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep passing in nodes=pd.DataFrame(index=full_graph.nodes()) when constructing the graph

Given reducing the code would likely break it, going with this option and keeping it as is seems fine to me.

@kjun9 kjun9 requested a review from huonw March 19, 2020 00:54
@kjun9
Copy link
Contributor Author

kjun9 commented Mar 19, 2020

Thanks everyone for the reviews! Will merge once CI passes, and i'll work on removing the experimental tag separately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants