Textual Entailment using Bejamin Riedel's model #6

j6mes · 2017-11-27T11:23:13Z

No description provided.

j6mes · 2017-11-30T00:41:39Z

Run with the following command (GPU optional). Gets about 88% accuracy which is within 1% of the score reported in [Riedel et al 2017].

On our dataset, we're sitting about 60% accuracy for a 2-way classification 👎

GPU=1 PYTHONPATH=src:lib/DrQA python src/scripts/rte/fnc_riedel.py (on FNC data)
GPU=1 PYTHONPATH=src:lib/DrQA python src/scripts/rte/fever_riedel.py (on our data)

This is without early stopping or a clever learning rate schedule.

I also want to evaluate different vocab/NN sizes. Might try a grid search on Sharc over the weekend.

andreasvlachos · 2017-11-30T10:35:50Z

Interesting! The random baseline would be 50% right? (same number of supported/refuted).

I would say let's first complete the full task evaluation (and make sure we are happy with the metrics) and then we optimize the various components according to what the metrics tell us.

j6mes · 2017-12-02T22:46:49Z

Training with randomly sampled pages for the not enough info class:

Accuracy on gold labels dev set is approx: 70%.
Accuracy on predicted pages (from DRQA) is currently approx: 56%

A random baseline would be 33%

Will try incorporating DRQA predictions into the training set too.

j6mes · 2017-12-05T12:14:53Z

Training on pages solely on pages retrieved from DRQA for the Not Enough Info class gives a dev accuracy of 0.37

Training on FNC (merging discuss and unrelated into not enough info) and testing on pages predicted with DRQA gives an accuracy of 0.35.

andreasvlachos · 2017-12-05T19:39:52Z

Interesting! It is still the case that 0.33 is the random baseline, right? While accuracy is the main metric (given the balanced dataset), it would be good to look at the confusion matrix.

…

On Tue, 5 Dec 2017 at 12:15 James Thorne ***@***.***> wrote: Training on pages solely on pages retrieved from DRQA for the Not Enough Info class gives a dev accuracy of 0.37 Training on FNC (merging discuss and unrelated into not enough info) and testing on pages predicted with DRQA gives an accuracy of 0.35. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbUhV71DHAHu0JV-Bxw8xfZLV8ld01-ks5s9TO9gaJpZM4QrfKr> .

j6mes · 2017-12-08T20:55:36Z

Got 2 families of experiments going for this MLP model for generating training data for the Not Enough Info class. Method 1: NN - use the closest page from DRQA. Method 2: RS - Randomly sample.

Oracle RS - no DrQA for the test-time predictions. Just using the annotator labeled pages. Accuracy 71%. Confusion matrix/classification report: https://pastebin.com/x43y9vR0

Oracle NN - using DrQA just to identify the nearest neighbour pages for NEI claims. Accuracy 53%
https://pastebin.com/1MZxpaBg

DrQA selecting k pages for all claims

Confusion matrix for k=1 RS model https://pastebin.com/UQtTHYvK

j6mes · 2017-12-08T20:59:17Z

PS: that has early stopping with patience=8.

I think the reason the RS model is doing well is because the cosine similarity between TF-IDF vectors (one of the features) is going to be v. low for unrelated documents. this might work quite nicely as a document relatedness filter.

andreasvlachos · 2017-12-08T22:26:37Z

Hey, yes, makes sense. Random is often hard to beat, and for a good reason (unless we know what mistakes we will make). Is it correct to say that the oracle RS is a bit more oracle than the oracle NN as the first uses the labeled pages by the annotators while the second one uses DRQA?

j6mes · 2017-12-08T22:37:33Z

Both use labeled pages for the support/refutes classes. It's just for the NEI class where we have no labeled pages. I think because the nearest neighbour pages are more semantically similar than randomly sampled, the classifier needs to be more sensitive which we cannot achieve with this mlp

andreasvlachos · 2017-12-08T23:00:29Z

got it. The reason could be what you say. Maybe see if the NN chosen ones help if added to the RS chosen ones.

j6mes added Baseline Models P1 labels Nov 27, 2017

j6mes closed this as completed Dec 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textual Entailment using Bejamin Riedel's model #6

Textual Entailment using Bejamin Riedel's model #6

j6mes commented Nov 27, 2017

j6mes commented Nov 30, 2017

andreasvlachos commented Nov 30, 2017

j6mes commented Dec 2, 2017

j6mes commented Dec 5, 2017

andreasvlachos commented Dec 5, 2017 via email

j6mes commented Dec 8, 2017

j6mes commented Dec 8, 2017

andreasvlachos commented Dec 8, 2017

j6mes commented Dec 8, 2017

andreasvlachos commented Dec 8, 2017

Textual Entailment using Bejamin Riedel's model #6

Textual Entailment using Bejamin Riedel's model #6

Comments

j6mes commented Nov 27, 2017

j6mes commented Nov 30, 2017

andreasvlachos commented Nov 30, 2017

j6mes commented Dec 2, 2017

j6mes commented Dec 5, 2017

andreasvlachos commented Dec 5, 2017 via email

j6mes commented Dec 8, 2017

j6mes commented Dec 8, 2017

andreasvlachos commented Dec 8, 2017

j6mes commented Dec 8, 2017

andreasvlachos commented Dec 8, 2017