Add a notebook that reproduces the results from the ComplEx paper #901

huonw · 2020-02-20T05:25:41Z

This adds a notebook that reproduces the results from the ComplEx paper (http://jmlr.org/proceedings/papers/v48/trouillon16.pdf), in Table 2 in particular:

In particular, the notebook does an evaluation of the "raw" ranks of a edge (s, r, o) (source, relation type, object) by comparing the model's predicted score for that edge against the scores for all "mutated-source" versions of that link (n, r, o) (for every node n), and for all "mutated-object" versions (s, r, n). It does this on the WN18 and FB15k datasets.

The notebook is phrased as a bit of a tutorial for WN18, but omits the description for FB15k, since the code is otherwise identical. I decided against defining a function to do all the work for pedagogical reasons: it's easier to interleave the discussion of the code (and the results) if it's not all in a single big function that is used for both WN18 and FB15k. (This somewhat relates to #910; this notebook is both a how-to and a paper reproduction.)

Why only raw ranks? The paper (like most knowledge graph papers) focuses more on the "filtered" ranks, which are ranking a link (s, r, o) against only the mutated-source and mutated-object instances that aren't known edges (that is, that don't appear in the train, test or validation set). However, as yet, I've been unable to reproduce the large jump in performance one sees when comparing only these edges, despite trying (and manually validating) several methods of ignoring the known edges. (Follow-up issue: #1060)

Why so much custom ranking-computing code? The naive approach to implement this validation would be to create DataFrames storing all of the mutated-source and mutated-object links and running model.predict to get a score for each of the mutated edges. This would mean 2 (mutating source and object, separately) × 5000 (number of edges in the test set) = 10000 predict calls, each of which is on 40943 triples (total number of nodes, to slot in to the mutated source or object column). This is inefficient, and it's much better to work with the underlying embedding matrices directly, to phrase computing scores against all nodes as bulk operations on simple matrices (rather than having to go through embedding layers, etc.). This is implemented in the ComplEx.rank_edges_against_all_nodes function, and the optimised form is validated to match the naive/manual form in the unit tests.

Why work with WN18 and FB15k, since those datasets have test-set leakage of inverse relations? It's what the paper uses, and so for validating our results against the author, we should match.

See: #862

review-notebook-app · 2020-02-20T05:25:47Z

Check out this pull request on

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

stellargraph/layer/knowledge_graph.py

codeclimate · 2020-02-20T05:26:25Z

Code Climate has analyzed commit f4a7c1b and detected 3 issues on this pull request.

Here's the issue category breakdown:

Category	Count
Security	3

View more on Code Climate.

stellar-graph-bot · 2020-02-20T22:42:51Z

Codecov Report

Merging #901 into develop will decrease coverage by 0.5%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           develop    #901     +/-   ##
=========================================
- Coverage     85.3%   84.8%   -0.5%     
=========================================
  Files           51      51             
  Lines         5189    5080    -109     
=========================================
- Hits          4427    4310    -117     
- Misses         762     770      +8

Impacted Files	Coverage Δ
stellargraph/core/graph.py	`98.5% <0.0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 928f80b...277a294. Read the comment docs.

codecov-io · 2020-02-20T22:44:30Z

Codecov Report

Merging #901 into develop will decrease coverage by 0.5%.
The diff coverage is 82.5%.

@@            Coverage Diff            @@
##           develop    #901     +/-   ##
=========================================
- Coverage     85.3%   84.8%   -0.5%     
=========================================
  Files           51      51             
  Lines         5189    5080    -109     
=========================================
- Hits          4427    4310    -117     
- Misses         762     770      +8

Impacted Files	Coverage Δ
stellargraph/datasets/dataset_loader.py	`95% <100%> (+0.1%)`	⬆️
stellargraph/datasets/datasets.py	`46.4% <43.8%> (-3.6%)`	⬇️
stellargraph/layer/knowledge_graph.py	`98% <97.4%> (-0.4%)`	⬇️
stellargraph/core/graph.py	`98.5% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 928f80b...277a294. Read the comment docs.

kieranricardo

The bulk operations for speeeeed are nice! I have a couple of initial small comments but I'll review the notebooks and tests more thoroughly soon.

Just to be clear, is this meant to remove the experimental status from ComplEx or are you waiting to reproduce the filtered metrics?

stellargraph/layer/knowledge_graph.py

huonw · 2020-03-12T04:24:23Z

Just to be clear, is this meant to remove the experimental status from ComplEx or are you waiting to reproduce the filtered metrics?

Yeah, I think it'd be better to reproduce the filtered metrics before calling this done, because they're what almost all papers use. I marked #1060 as experimental blocker for this reason, but we can remove that/revise if we think it's not worth it. However, given our discussion and your better results with feature/862-complex-notebook...feature/complex-filtered-scores (update: 957bf18 in #1180) maybe this can be done sooner rather than later.

kieranricardo · 2020-03-12T05:28:00Z

stellargraph/layer/knowledge_graph.py

+
+        num_nodes = known_edges_graph.number_of_nodes()
+
+        def ranks(pred, true_ilocs, true_is_source):


I think this function could be pulled out of ComplEx and re-used for all our other knowledge graph algorithms. But this function is likely the main experimental blocker so I'm happy for this do be done as part of #1060

Similar to #865, I'd prefer to not start abstracting before we're more sure about what's useful, but yes, this is definitely likely to be able to be shared in some form.

kieranricardo · 2020-03-12T05:39:40Z

demos/link-prediction/knowledge-graphs/complex.ipynb

+   ]
+  },
+  {
+   "cell_type": "code",


Minor optional thing, but the mis-alignment of the tables below is kinda jarring. Maybe you make and print out a dataframe with the papers results e.g.

pd.DataFrame({ "mrr": [0.587], "filtered mrr": [0.941], "hits at 1": [0.936], "hits at 3": [0.945], "hits at 10": [0.947], })

kieranricardo · 2020-03-12T05:40:28Z

demos/link-prediction/knowledge-graphs/complex.ipynb

+   "source": [
+    "For comparison, Table 2 in the paper gives the following results for WN18:\n",
+    "\n",
+    "| raw MRR | filtered MRR | hits at 1 | hits at 3 | hits at 10 |\n",


I've combined this with the suggestion of including these results as a dataframe, to have it have the raw and filtered numbers on separate rows.

PantelisElinas · 2020-03-13T00:00:53Z

Notebook looks good and it is great to see the results from the paper confirmed. I only have a couple of minor suggestions to make.

I think using early stopping with low patience, e.g., 10, and increasing the number of training epochs to something like 200 would be better than assuming 20 epochs are sufficient.

When you create the generators, you set the batch size as follows batch_size=len(wn18_train) // 100. If your goal is to have a batch size of 100 then you should set it explicitly batch_size=100 as is common practice.

I would rename variable wn18_g and fb15k_g to wn18_graph and fb15k_graph respectively for clarity. Just making it a bit more obvious that this is the knowledge graph.

You should be able to calculate the filtered metrics by first calculating the raw metrics and then counting the number of known triplets (those in the data) that rank higher than the test triplet and subtracting from the test triplet rank.

huonw · 2020-03-13T00:39:21Z

@PantelisElinas Is there a reason you didn't comment on the file itself either on the JSON or on ReviewNB? Is there a process/tooling thing that can be optimised?

I think using early stopping with low patience, e.g., 10, and increasing the number of training epochs to something like 200 would be better than assuming 20 epochs are sufficient.

👍

When you create the generators, you set the batch size as follows batch_size=len(wn18_train) // 100. If your goal is to have a batch size of 100 then you should set it explicitly batch_size=100 as is common practice.

Per the comment on that line, that is not what that code is doing, and if we wanted 100 elements per batch, of course batch_size=100 would be better. The goal is to have 100 batches per epoch, because that's what the paper uses. On the page with Table 2:

We also tried varying the batch size but this had no impact and we settled with 100 batches per epoch

I could hardcode the size based on the known size of the data (e.g. batch_size=1414 for WN18), but that seems to be much less clear.

I would rename variable wn18_g and fb15k_g to wn18_graph and fb15k_graph respectively for clarity. Just making it a bit more obvious that this is the knowledge graph.

👍

You should be able to calculate the filtered metrics by first calculating the raw metrics and then counting the number of known triplets (those in the data) that rank higher than the test triplet and subtracting from the test triplet rank.

Unfortunately, per the PR description, I tried various ways to compute it and none of them were working correctly so restating a definition doesn't lend much light. @kieranricardo did some investigations and wrote some code that did seem to be working better (see #901 (comment) above), so maybe I just had bugs in all of the version I'd tried.

kieranricardo

Looks good! just need to increase parallelism (or maybe merge with develop) to pass CI

…x-notebook

PantelisElinas · 2020-03-13T02:12:29Z

I don't have ReviewNB setup, so I didn't use it.

I misunderstood what you were trying to do with the batch size so it is fine as it is then.

@kieranricardo

This adds the final piece of the evaluation required to make `ComplEx` non-`@experimental`. it extends the ranking procedure performed in #901 to also compute the "filtered" ranks. This gives the rest of the metrics in Table 2 of the ComplEx paper (http://jmlr.org/proceedings/papers/v48/trouillon16.pdf). As a reminder from #901, the knowledge graph link prediction metrics for a test edge `E = (s, r, o)` connecting nodes `s` and `o` are calculated by ranking the prediction for that edge against all modified-source `E' = (n, r, o)` and modified-object `E'' = (s, r, n)` edges (for all nodes `n` in the graph). The "raw" ranks are just the rank of `E` against the `E'` and against the `E''`. The "filtered" ranks exclude the modified edges `E'` and `E''` that are known, i.e. are in the train, validation or test sets. For instance, if `E = (A, x, B)` has score 1, but the modified edges `(A, x, C)` and `(A, x, D)` have scores 1.3 and 1.5 respectively, `E` has raw modified-object rank 3. If `(A, x, D)` is in the train set (or validation or test) but `(A, x, C)` is not, it is excluded from the filtered ranking, and so `E` has filtered modified-object rank 2. This has been a struggle to implement correctly, because it has been difficult to correctly use the right node nodes in the right place of the ranking procedure. For modified-object ranking, with `E` and `E''` as above, calculating the score of `E` in the column of scores of every modified-object edge `E''` needs to use `o`, but calculating the known edges similar to `E` needs to use `(s, r, _)`, not `(o, r, _)` (the latter is meaningless). (And similarly for modified-subject ranking.) It sounds obvious when written out like this, but it's somewhat difficult to keep track of which entity needs to go where in practice. (@kieranricardo had this key insight.) The implementation works by: start with the raw `greater` matrix, where each column represents the test edges `E` with a row for every node in the graph (i.e. row `n` represents swapping node `n` into the subject or object) and the elements of a column are `True` if the score of that modified edge is greater than the score of `E`. For each edge/column, compute the indices of the similar known edges and set those indices to false, leaving only unknown edges with scores greater than `E`. See: #1060

codeclimate bot reviewed Feb 20, 2020

View reviewed changes

stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved

stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved

Write a demo notebook for the ComplEx model

1d5b556

huonw force-pushed the feature/862-complex-notebook branch from 277a294 to 1d5b556 Compare March 11, 2020 04:51

black

4c4de3b

huonw mentioned this pull request Mar 11, 2020

ComplEx/DistMult: reproduce the 'filtered' metrics from the paper #1060

Closed

8 tasks

parallelism

4158974

huonw requested review from kieranricardo and PantelisElinas March 11, 2020 06:00

huonw marked this pull request as ready for review March 11, 2020 06:01

kieranricardo reviewed Mar 11, 2020

View reviewed changes

stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved

stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved

kieranricardo self-requested a review March 12, 2020 00:09

kieranricardo reviewed Mar 12, 2020

View reviewed changes

stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved

better variable names, better iteration, remove unused true_is_source

b72268f

kieranricardo reviewed Mar 12, 2020

View reviewed changes

dataframe printing, more epochs

7b8c862

huonw requested a review from kieranricardo March 12, 2020 23:21

PantelisElinas approved these changes Mar 13, 2020

View reviewed changes

kieranricardo approved these changes Mar 13, 2020

View reviewed changes

huonw added 3 commits March 13, 2020 12:43

more epochs + early stopping, rename

92ccfd5

Merge remote-tracking branch 'origin/develop' into feature/862-comple…

c0b5946

…x-notebook

Parallelism

f4a7c1b

huonw merged commit 8684379 into develop Mar 16, 2020

huonw deleted the feature/862-complex-notebook branch March 17, 2020 01:47

huonw mentioned this pull request Mar 17, 2020

Compute filtered ranks for evaluating ComplEx #1080

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a notebook that reproduces the results from the ComplEx paper #901

Add a notebook that reproduces the results from the ComplEx paper #901

huonw commented Feb 20, 2020 •

edited

review-notebook-app bot commented Feb 20, 2020

codeclimate bot commented Feb 20, 2020 •

edited

stellar-graph-bot commented Feb 20, 2020

codecov-io commented Feb 20, 2020

kieranricardo left a comment

huonw commented Mar 12, 2020 •

edited

kieranricardo Mar 12, 2020

huonw Mar 12, 2020

kieranricardo Mar 12, 2020

kieranricardo Mar 12, 2020

huonw Mar 12, 2020

PantelisElinas commented Mar 13, 2020

huonw commented Mar 13, 2020

kieranricardo left a comment

PantelisElinas commented Mar 13, 2020


		num_nodes = known_edges_graph.number_of_nodes()

		def ranks(pred, true_ilocs, true_is_source):

	"\| raw MRR \| filtered MRR \| hits at 1 \| hits at 3 \| hits at 10 \|\n",
	"\| raw MRR \| filtered MRR \| filtered hits at 1 \| filtered hits at 3 \| filtered hits at 10 \|\n",

Add a notebook that reproduces the results from the ComplEx paper #901

Add a notebook that reproduces the results from the ComplEx paper #901

Conversation

huonw commented Feb 20, 2020 • edited

review-notebook-app bot commented Feb 20, 2020

codeclimate bot commented Feb 20, 2020 • edited

stellar-graph-bot commented Feb 20, 2020

Codecov Report

codecov-io commented Feb 20, 2020

Codecov Report

kieranricardo left a comment

Choose a reason for hiding this comment

huonw commented Mar 12, 2020 • edited

kieranricardo Mar 12, 2020

Choose a reason for hiding this comment

huonw Mar 12, 2020

Choose a reason for hiding this comment

kieranricardo Mar 12, 2020

Choose a reason for hiding this comment

kieranricardo Mar 12, 2020

Choose a reason for hiding this comment

huonw Mar 12, 2020

Choose a reason for hiding this comment

PantelisElinas commented Mar 13, 2020

huonw commented Mar 13, 2020

kieranricardo left a comment

Choose a reason for hiding this comment

PantelisElinas commented Mar 13, 2020

huonw commented Feb 20, 2020 •

edited

codeclimate bot commented Feb 20, 2020 •

edited

huonw commented Mar 12, 2020 •

edited