Skip to content
This repository has been archived by the owner on Feb 3, 2023. It is now read-only.

Error Analysis #31

Closed
3 of 5 tasks
j6mes opened this issue Feb 9, 2018 · 2 comments
Closed
3 of 5 tasks

Error Analysis #31

j6mes opened this issue Feb 9, 2018 · 2 comments

Comments

@j6mes
Copy link
Member

j6mes commented Feb 9, 2018

  • how often did DR return the right page?
  • how often did SR return the right page?
  • how often did SR return the original evidence?
  • for the times where SR returned different evidence. What are the differences between BLEU/ROUGE similarities between the claim and returned evidence vs claim and gold evidence?
  • Error coding scheme
@j6mes
Copy link
Member Author

j6mes commented Feb 10, 2018

Metric NLTK DRQA Sents Precomputed IDF DRQA Sents New IDF
Runtime 2 hours 10 hours 12 hours
Strict Accuracy (strict) requirement for correct evidence 0.2476 0.1827 0.2698
Classification Accuracy Without Need For Evidence 0.4885 0.4588 0.4922
Correct Document Return Rate (dmatch) 0.5793 0.5893 0.5893
Correct Document Return Rate after sentence selection (smatch) 0.4773 0.2690 0.5596
Correct Text Return Rate (for Refutes/Supports) 0.3647 0.1083 0.4680

@j6mes
Copy link
Member Author

j6mes commented Feb 10, 2018

@andreasvlachos using DrQA instead of NLTK for sentence selection gives us a 2% boost - at the cost of an extra 10 hours. dmatch and smatch figures give us upper bounds for strict accuracy (considering the supported/refuted class). In the case of DrQA - the number of times the correct document is in the evidence after sentence selection is 55% of the time whereas using NLTK, this is only 47%.

@j6mes j6mes closed this as completed Apr 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant