Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a notebook that reproduces the results from the ComplEx paper #901

Merged
merged 8 commits into from Mar 16, 2020

Conversation

huonw
Copy link
Member

@huonw huonw commented Feb 20, 2020

This adds a notebook that reproduces the results from the ComplEx paper (http://jmlr.org/proceedings/papers/v48/trouillon16.pdf), in Table 2 in particular:

image

In particular, the notebook does an evaluation of the "raw" ranks of a edge (s, r, o) (source, relation type, object) by comparing the model's predicted score for that edge against the scores for all "mutated-source" versions of that link (n, r, o) (for every node n), and for all "mutated-object" versions (s, r, n). It does this on the WN18 and FB15k datasets.

The notebook is phrased as a bit of a tutorial for WN18, but omits the description for FB15k, since the code is otherwise identical. I decided against defining a function to do all the work for pedagogical reasons: it's easier to interleave the discussion of the code (and the results) if it's not all in a single big function that is used for both WN18 and FB15k. (This somewhat relates to #910; this notebook is both a how-to and a paper reproduction.)

Why only raw ranks? The paper (like most knowledge graph papers) focuses more on the "filtered" ranks, which are ranking a link (s, r, o) against only the mutated-source and mutated-object instances that aren't known edges (that is, that don't appear in the train, test or validation set). However, as yet, I've been unable to reproduce the large jump in performance one sees when comparing only these edges, despite trying (and manually validating) several methods of ignoring the known edges. (Follow-up issue: #1060)

Why so much custom ranking-computing code? The naive approach to implement this validation would be to create DataFrames storing all of the mutated-source and mutated-object links and running model.predict to get a score for each of the mutated edges. This would mean 2 (mutating source and object, separately) × 5000 (number of edges in the test set) = 10000 predict calls, each of which is on 40943 triples (total number of nodes, to slot in to the mutated source or object column). This is inefficient, and it's much better to work with the underlying embedding matrices directly, to phrase computing scores against all nodes as bulk operations on simple matrices (rather than having to go through embedding layers, etc.). This is implemented in the ComplEx.rank_edges_against_all_nodes function, and the optimised form is validated to match the naive/manual form in the unit tests.

Why work with WN18 and FB15k, since those datasets have test-set leakage of inverse relations? It's what the paper uses, and so for validating our results against the author, we should match.

See: #862

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved
stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved
@codeclimate
Copy link

codeclimate bot commented Feb 20, 2020

Code Climate has analyzed commit f4a7c1b and detected 3 issues on this pull request.

Here's the issue category breakdown:

Category Count
Security 3

View more on Code Climate.

@stellar-graph-bot
Copy link

Codecov Report

Merging #901 into develop will decrease coverage by 0.5%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           develop    #901     +/-   ##
=========================================
- Coverage     85.3%   84.8%   -0.5%     
=========================================
  Files           51      51             
  Lines         5189    5080    -109     
=========================================
- Hits          4427    4310    -117     
- Misses         762     770      +8     
Impacted Files Coverage Δ
stellargraph/core/graph.py 98.5% <0.0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 928f80b...277a294. Read the comment docs.

@codecov-io
Copy link

Codecov Report

Merging #901 into develop will decrease coverage by 0.5%.
The diff coverage is 82.5%.

Impacted file tree graph

@@            Coverage Diff            @@
##           develop    #901     +/-   ##
=========================================
- Coverage     85.3%   84.8%   -0.5%     
=========================================
  Files           51      51             
  Lines         5189    5080    -109     
=========================================
- Hits          4427    4310    -117     
- Misses         762     770      +8
Impacted Files Coverage Δ
stellargraph/datasets/dataset_loader.py 95% <100%> (+0.1%) ⬆️
stellargraph/datasets/datasets.py 46.4% <43.8%> (-3.6%) ⬇️
stellargraph/layer/knowledge_graph.py 98% <97.4%> (-0.4%) ⬇️
stellargraph/core/graph.py 98.5% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 928f80b...277a294. Read the comment docs.

@huonw huonw force-pushed the feature/862-complex-notebook branch from 277a294 to 1d5b556 Compare March 11, 2020 04:51
@huonw huonw marked this pull request as ready for review March 11, 2020 06:01
Copy link
Contributor

@kieranricardo kieranricardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bulk operations for speeeeed are nice! I have a couple of initial small comments but I'll review the notebooks and tests more thoroughly soon.

Just to be clear, is this meant to remove the experimental status from ComplEx or are you waiting to reproduce the filtered metrics?

stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved
stellargraph/layer/knowledge_graph.py Outdated Show resolved Hide resolved
@kieranricardo kieranricardo self-requested a review March 12, 2020 00:09
@huonw
Copy link
Member Author

huonw commented Mar 12, 2020

Just to be clear, is this meant to remove the experimental status from ComplEx or are you waiting to reproduce the filtered metrics?

Yeah, I think it'd be better to reproduce the filtered metrics before calling this done, because they're what almost all papers use. I marked #1060 as experimental blocker for this reason, but we can remove that/revise if we think it's not worth it. However, given our discussion and your better results with feature/862-complex-notebook...feature/complex-filtered-scores (update: 957bf18 in #1180) maybe this can be done sooner rather than later.


num_nodes = known_edges_graph.number_of_nodes()

def ranks(pred, true_ilocs, true_is_source):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function could be pulled out of ComplEx and re-used for all our other knowledge graph algorithms. But this function is likely the main experimental blocker so I'm happy for this do be done as part of #1060

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to #865, I'd prefer to not start abstracting before we're more sure about what's useful, but yes, this is definitely likely to be able to be shared in some form.

]
},
{
"cell_type": "code",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor optional thing, but the mis-alignment of the tables below is kinda jarring. Maybe you make and print out a dataframe with the papers results e.g.

pd.DataFrame({
    "mrr": [0.587], 
    "filtered mrr": [0.941], 
    "hits at 1": [0.936], 
    "hits at 3": [0.945],
    "hits at 10": [0.947],
})

"source": [
"For comparison, Table 2 in the paper gives the following results for WN18:\n",
"\n",
"| raw MRR | filtered MRR | hits at 1 | hits at 3 | hits at 10 |\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention that these are filtered hits

Suggested change
"| raw MRR | filtered MRR | hits at 1 | hits at 3 | hits at 10 |\n",
"| raw MRR | filtered MRR | filtered hits at 1 | filtered hits at 3 | filtered hits at 10 |\n",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've combined this with the suggestion of including these results as a dataframe, to have it have the raw and filtered numbers on separate rows.

@PantelisElinas
Copy link
Contributor

Notebook looks good and it is great to see the results from the paper confirmed. I only have a couple of minor suggestions to make.

I think using early stopping with low patience, e.g., 10, and increasing the number of training epochs to something like 200 would be better than assuming 20 epochs are sufficient.

When you create the generators, you set the batch size as follows batch_size=len(wn18_train) // 100. If your goal is to have a batch size of 100 then you should set it explicitly batch_size=100 as is common practice.

I would rename variable wn18_g and fb15k_g to wn18_graph and fb15k_graph respectively for clarity. Just making it a bit more obvious that this is the knowledge graph.

You should be able to calculate the filtered metrics by first calculating the raw metrics and then counting the number of known triplets (those in the data) that rank higher than the test triplet and subtracting from the test triplet rank.

@huonw
Copy link
Member Author

huonw commented Mar 13, 2020

@PantelisElinas Is there a reason you didn't comment on the file itself either on the JSON or on ReviewNB? Is there a process/tooling thing that can be optimised?

I think using early stopping with low patience, e.g., 10, and increasing the number of training epochs to something like 200 would be better than assuming 20 epochs are sufficient.

👍

When you create the generators, you set the batch size as follows batch_size=len(wn18_train) // 100. If your goal is to have a batch size of 100 then you should set it explicitly batch_size=100 as is common practice.

Per the comment on that line, that is not what that code is doing, and if we wanted 100 elements per batch, of course batch_size=100 would be better. The goal is to have 100 batches per epoch, because that's what the paper uses. On the page with Table 2:

We also tried varying the batch size but this had no impact and we settled with 100 batches per epoch

I could hardcode the size based on the known size of the data (e.g. batch_size=1414 for WN18), but that seems to be much less clear.

I would rename variable wn18_g and fb15k_g to wn18_graph and fb15k_graph respectively for clarity. Just making it a bit more obvious that this is the knowledge graph.

👍

You should be able to calculate the filtered metrics by first calculating the raw metrics and then counting the number of known triplets (those in the data) that rank higher than the test triplet and subtracting from the test triplet rank.

Unfortunately, per the PR description, I tried various ways to compute it and none of them were working correctly so restating a definition doesn't lend much light. @kieranricardo did some investigations and wrote some code that did seem to be working better (see #901 (comment) above), so maybe I just had bugs in all of the version I'd tried.

Copy link
Contributor

@kieranricardo kieranricardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! just need to increase parallelism (or maybe merge with develop) to pass CI

@PantelisElinas
Copy link
Contributor

I don't have ReviewNB setup, so I didn't use it.

I misunderstood what you were trying to do with the batch size so it is fine as it is then.

@huonw huonw merged commit 8684379 into develop Mar 16, 2020
@huonw huonw deleted the feature/862-complex-notebook branch March 17, 2020 01:47
huonw added a commit that referenced this pull request Mar 19, 2020
This adds the final piece of the evaluation required to make `ComplEx`
non-`@experimental`. it extends the ranking procedure performed in #901 to also
compute the "filtered" ranks. This gives the rest of the metrics in Table 2 of
the ComplEx paper (http://jmlr.org/proceedings/papers/v48/trouillon16.pdf).

As a reminder from #901, the knowledge graph link prediction metrics for a test
edge `E = (s, r, o)` connecting nodes `s` and `o` are calculated by ranking the
prediction for that edge against all modified-source `E' = (n, r, o)` and
modified-object `E'' = (s, r, n)` edges (for all nodes `n` in the graph). The
"raw" ranks are just the rank of `E` against the `E'` and against the `E''`.

The "filtered" ranks exclude the modified edges `E'` and `E''` that are known,
i.e. are in the train, validation or test sets. For instance, if `E = (A, x, B)`
has score 1, but the modified edges `(A, x, C)` and `(A, x, D)` have scores 1.3
and 1.5 respectively, `E` has raw modified-object rank 3. If `(A, x, D)` is in
the train set (or validation or test) but `(A, x, C)` is not, it is excluded
from the filtered ranking, and so `E` has filtered modified-object rank 2.

This has been a struggle to implement correctly, because it has been difficult
to correctly use the right node nodes in the right place of the ranking
procedure. For modified-object ranking, with `E` and `E''` as above, calculating
the score of `E` in the column of scores of every modified-object edge `E''`
needs to use `o`, but calculating the known edges similar to `E` needs to use
`(s, r, _)`, not `(o, r, _)` (the latter is meaningless). (And similarly for
modified-subject ranking.) It sounds obvious when written out like this, but
it's somewhat difficult to keep track of which entity needs to go where in
practice. (@kieranricardo had this key insight.)

The implementation works by: start with the raw `greater` matrix, where each
column represents the test edges `E` with a row for every node in the graph
(i.e. row `n` represents swapping node `n` into the subject or object) and the
elements of a column are `True` if the score of that modified edge is greater
than the score of `E`. For each edge/column, compute the indices of the similar
known edges and set those indices to false, leaving only unknown edges with
scores greater than `E`.

See: #1060
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants