New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a notebook that reproduces the results from the ComplEx paper #901
Conversation
Check out this pull request on You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB. |
Code Climate has analyzed commit f4a7c1b and detected 3 issues on this pull request. Here's the issue category breakdown:
View more on Code Climate. |
Codecov Report
@@ Coverage Diff @@
## develop #901 +/- ##
=========================================
- Coverage 85.3% 84.8% -0.5%
=========================================
Files 51 51
Lines 5189 5080 -109
=========================================
- Hits 4427 4310 -117
- Misses 762 770 +8
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## develop #901 +/- ##
=========================================
- Coverage 85.3% 84.8% -0.5%
=========================================
Files 51 51
Lines 5189 5080 -109
=========================================
- Hits 4427 4310 -117
- Misses 762 770 +8
Continue to review full report at Codecov.
|
277a294
to
1d5b556
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bulk operations for speeeeed are nice! I have a couple of initial small comments but I'll review the notebooks and tests more thoroughly soon.
Just to be clear, is this meant to remove the experimental status from ComplEx or are you waiting to reproduce the filtered metrics?
Yeah, I think it'd be better to reproduce the filtered metrics before calling this done, because they're what almost all papers use. I marked #1060 as |
|
||
num_nodes = known_edges_graph.number_of_nodes() | ||
|
||
def ranks(pred, true_ilocs, true_is_source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function could be pulled out of ComplEx
and re-used for all our other knowledge graph algorithms. But this function is likely the main experimental blocker so I'm happy for this do be done as part of #1060
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to #865, I'd prefer to not start abstracting before we're more sure about what's useful, but yes, this is definitely likely to be able to be shared in some form.
] | ||
}, | ||
{ | ||
"cell_type": "code", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor optional thing, but the mis-alignment of the tables below is kinda jarring. Maybe you make and print out a dataframe with the papers results e.g.
pd.DataFrame({
"mrr": [0.587],
"filtered mrr": [0.941],
"hits at 1": [0.936],
"hits at 3": [0.945],
"hits at 10": [0.947],
})
"source": [ | ||
"For comparison, Table 2 in the paper gives the following results for WN18:\n", | ||
"\n", | ||
"| raw MRR | filtered MRR | hits at 1 | hits at 3 | hits at 10 |\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mention that these are filtered hits
"| raw MRR | filtered MRR | hits at 1 | hits at 3 | hits at 10 |\n", | |
"| raw MRR | filtered MRR | filtered hits at 1 | filtered hits at 3 | filtered hits at 10 |\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've combined this with the suggestion of including these results as a dataframe, to have it have the raw and filtered numbers on separate rows.
Notebook looks good and it is great to see the results from the paper confirmed. I only have a couple of minor suggestions to make. I think using early stopping with low patience, e.g., 10, and increasing the number of training epochs to something like 200 would be better than assuming 20 epochs are sufficient. When you create the generators, you set the batch size as follows I would rename variable You should be able to calculate the filtered metrics by first calculating the raw metrics and then counting the number of known triplets (those in the data) that rank higher than the test triplet and subtracting from the test triplet rank. |
@PantelisElinas Is there a reason you didn't comment on the file itself either on the JSON or on ReviewNB? Is there a process/tooling thing that can be optimised?
👍
Per the comment on that line, that is not what that code is doing, and if we wanted 100 elements per batch, of course
I could hardcode the size based on the known size of the data (e.g.
👍
Unfortunately, per the PR description, I tried various ways to compute it and none of them were working correctly so restating a definition doesn't lend much light. @kieranricardo did some investigations and wrote some code that did seem to be working better (see #901 (comment) above), so maybe I just had bugs in all of the version I'd tried. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! just need to increase parallelism
(or maybe merge with develop) to pass CI
I don't have ReviewNB setup, so I didn't use it. I misunderstood what you were trying to do with the batch size so it is fine as it is then. |
This adds the final piece of the evaluation required to make `ComplEx` non-`@experimental`. it extends the ranking procedure performed in #901 to also compute the "filtered" ranks. This gives the rest of the metrics in Table 2 of the ComplEx paper (http://jmlr.org/proceedings/papers/v48/trouillon16.pdf). As a reminder from #901, the knowledge graph link prediction metrics for a test edge `E = (s, r, o)` connecting nodes `s` and `o` are calculated by ranking the prediction for that edge against all modified-source `E' = (n, r, o)` and modified-object `E'' = (s, r, n)` edges (for all nodes `n` in the graph). The "raw" ranks are just the rank of `E` against the `E'` and against the `E''`. The "filtered" ranks exclude the modified edges `E'` and `E''` that are known, i.e. are in the train, validation or test sets. For instance, if `E = (A, x, B)` has score 1, but the modified edges `(A, x, C)` and `(A, x, D)` have scores 1.3 and 1.5 respectively, `E` has raw modified-object rank 3. If `(A, x, D)` is in the train set (or validation or test) but `(A, x, C)` is not, it is excluded from the filtered ranking, and so `E` has filtered modified-object rank 2. This has been a struggle to implement correctly, because it has been difficult to correctly use the right node nodes in the right place of the ranking procedure. For modified-object ranking, with `E` and `E''` as above, calculating the score of `E` in the column of scores of every modified-object edge `E''` needs to use `o`, but calculating the known edges similar to `E` needs to use `(s, r, _)`, not `(o, r, _)` (the latter is meaningless). (And similarly for modified-subject ranking.) It sounds obvious when written out like this, but it's somewhat difficult to keep track of which entity needs to go where in practice. (@kieranricardo had this key insight.) The implementation works by: start with the raw `greater` matrix, where each column represents the test edges `E` with a row for every node in the graph (i.e. row `n` represents swapping node `n` into the subject or object) and the elements of a column are `True` if the score of that modified edge is greater than the score of `E`. For each edge/column, compute the indices of the similar known edges and set those indices to false, leaving only unknown edges with scores greater than `E`. See: #1060
This adds a notebook that reproduces the results from the ComplEx paper (http://jmlr.org/proceedings/papers/v48/trouillon16.pdf), in Table 2 in particular:
In particular, the notebook does an evaluation of the "raw" ranks of a edge
(s, r, o)
(source, relation type, object) by comparing the model's predicted score for that edge against the scores for all "mutated-source" versions of that link(n, r, o)
(for every noden
), and for all "mutated-object" versions(s, r, n)
. It does this on the WN18 and FB15k datasets.The notebook is phrased as a bit of a tutorial for WN18, but omits the description for FB15k, since the code is otherwise identical. I decided against defining a function to do all the work for pedagogical reasons: it's easier to interleave the discussion of the code (and the results) if it's not all in a single big function that is used for both WN18 and FB15k. (This somewhat relates to #910; this notebook is both a how-to and a paper reproduction.)
Why only raw ranks? The paper (like most knowledge graph papers) focuses more on the "filtered" ranks, which are ranking a link
(s, r, o)
against only the mutated-source and mutated-object instances that aren't known edges (that is, that don't appear in the train, test or validation set). However, as yet, I've been unable to reproduce the large jump in performance one sees when comparing only these edges, despite trying (and manually validating) several methods of ignoring the known edges. (Follow-up issue: #1060)Why so much custom ranking-computing code? The naive approach to implement this validation would be to create DataFrames storing all of the mutated-source and mutated-object links and running
model.predict
to get a score for each of the mutated edges. This would mean 2 (mutating source and object, separately) × 5000 (number of edges in the test set) = 10000predict
calls, each of which is on 40943 triples (total number of nodes, to slot in to the mutated source or object column). This is inefficient, and it's much better to work with the underlying embedding matrices directly, to phrase computing scores against all nodes as bulk operations on simple matrices (rather than having to go through embedding layers, etc.). This is implemented in theComplEx.rank_edges_against_all_nodes
function, and the optimised form is validated to match the naive/manual form in the unit tests.Why work with WN18 and FB15k, since those datasets have test-set leakage of inverse relations? It's what the paper uses, and so for validating our results against the author, we should match.
See: #862