Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new datasets for MTL, QA, common sense; new NER results #132

Merged
merged 1 commit into from Oct 25, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -7,6 +7,7 @@
- [Automatic speech recognition](english/automatic_speech_recognition.md)
- [CCG supertagging](english/ccg_supertagging.md)
- [Chunking](english/chunking.md)
- [Common sense](english/common_sense.md)
- [Constituency parsing](english/constituency_parsing.md)
- [Coreference resolution](english/coreference_resolution.md)
- [Dependency parsing](english/dependency_parsing.md)
Expand Down
46 changes: 46 additions & 0 deletions english/common_sense.md
@@ -0,0 +1,46 @@
# Common sense

Common sense reasoning tasks are intended to require the model to go beyond pattern
recognition. Instead, the model should use "common sense" or world knowledge
to make inferences.

### Event2Mind

Event2Mind is a crowdsourced corpus of 25,000 event phrases covering a diverse range of everyday events and situations.
Given an event described in a short free-form text, a model should reason about the likely intents and reactions of the
event's participants. Models are evaluated based on average cross-entropy (lower is better).

| Model | Dev | Test | Paper / Source | Code |
| ------------- | :-----:| :-----:|--- | --- |
| BiRNN 100d (Rashkin et al., 2018) | 4.25 | 4.22 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | |
| ConvNet (Rashkin et al., 2018) | 4.44 | 4.40 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | |

### SWAG

Situations with Adversarial Generations (SWAG) is a dataset consisting of 113k multiple
choice questions about a rich spectrum of grounded situations.

| Model | Dev | Test | Paper / Source | Code |
| ------------- | :-----:| :-----:|--- | --- |
| BERT Large (Devlin et al., 2018) | 86.6 | 86.3 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| BERT Base (Devlin et al., 2018) | 81.6 | - | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| ESIM + ELMo (Zellers et al., 2018) | 59.1 | 59.2 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) | |
| ESIM + GloVe (Zellers et al., 2018) | 51.9 | 52.7 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) | |

### Winograd Schema Challenge

The [Winograd Schema Challenge](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492)
is a dataset for common sense reasoning. It employs Winograd Schema questions that
require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models
are evaluated based on accuracy.

Example:

The trophy doesn’t fit in the suitcase because _it_ is too big. What is too big?
Answer 0: the trophy. Answer 1: the suitcase

| Model | Score | Paper / Source |
| ------------- | :-----:| --- |
| Word-LM-partial (Trinh and Le, 2018) | 62.6 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) |
| Char-LM-partial (Trinh and Le, 2018) | 57.9 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) |
| USSM + Supervised DeepNet + KB (Liu et al., 2017) | 52.8 | [Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems](https://aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392) |
8 changes: 8 additions & 0 deletions english/multi-task_learning.md
Expand Up @@ -3,6 +3,14 @@
Multi-task learning aims to learn multiple different tasks simultaneously while maximizing
performance on one or all of the tasks.

### DecaNLP

The [Natural Language Decathlon](https://arxiv.org/abs/1806.08730) (decaNLP) is a benchmark for studying general NLP
models that can perform a variety of complex, natural language tasks.
It evaluates performance on ten disparate natural language tasks.

Results can be seen on the [public leaderboard](https://decanlp.com/).

### GLUE

The [General Language Understanding Evaluation benchmark](https://arxiv.org/abs/1804.07461) (GLUE)
Expand Down
5 changes: 4 additions & 1 deletion english/named_entity_recognition.md
Expand Up @@ -13,11 +13,14 @@ Example:
### CoNLL 2003 (English)

The [CoNLL 2003 NER task](http://www.aclweb.org/anthology/W03-0419.pdf) consists of newswire text from the Reuters RCV1
corpus tagged with four different entity types (PER, LOC, ORG, MISC). Models are evaluated based on span-based F1.
corpus tagged with four different entity types (PER, LOC, ORG, MISC). Models are evaluated based on span-based F1 on the test set.

| Model | F1 | Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Flair embeddings (Akbik et al., 2018) | 93.09 | [Contextual String Embeddings for Sequence Labeling](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view) | [Flair framework](https://github.com/zalandoresearch/flair)
| BERT Large (Devlin et al., 2018) | 92.8 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| CVT + Multi (Clark et al., 2018) | 92.6 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | |
| BERT Base (Devlin et al., 2018) | 92.4 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| BiLSTM-CRF+ELMo (Peters et al., 2018) | 92.22 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | [AllenNLP Project](https://allennlp.org/elmo) [AllenNLP GitHub](https://github.com/allenai/allennlp) |
| Peters et al. (2017) | 91.93 | [Semi-supervised sequence tagging with bidirectional language models](https://arxiv.org/abs/1705.00108) | |
| LM-LSTM-CRF (Liu et al., 2018)| 91.71 | [Empowering Character-aware Sequence Labeling with Task-Aware Neural Language Model](https://arxiv.org/pdf/1709.04109.pdf) | [LM-LSTM-CRF](https://github.com/LiyuanLucasLiu/LM-LSTM-CRF) |
Expand Down
61 changes: 42 additions & 19 deletions english/question_answering.md
Expand Up @@ -2,6 +2,24 @@

Question answering is the task of answering a question.

### Table of contents

- [ARC](#arc)
- [Reading comprehension](#reading-comprehension)
- [CliCR](#clicr)
- [CNN / Daily Mail](#cnn--daily-mail)
- [CoQA](#coqa)
- [HotpotQA](#hotpotqa)
- [MS MARCO](#ms-marco)
- [MultiRC](#multirc)
- [NewsQA](#newsqa)
- [QAngaroo](#qangaroo)
- [QuAC](#quac)
- [RACE](#race)
- [SQuAD](#squad)
- [Story Cloze Test](#story-cloze-test)
- [Recipe QA](#recipeqa)

### ARC

The [AI2 Reasoning Challenge (ARC)](http://ai2-website.s3.amazonaws.com/publications/AI2ReasoningChallenge2018.pdf)
Expand Down Expand Up @@ -35,7 +53,6 @@ Example:
| Gated-Attention Reader (Dhingra et al., 2017) | 33.9 | [CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension](http://aclweb.org/anthology/N18-1140) |
| Stanford Attentive Reader (Chen et al., 2016) | 27.2| [CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension](http://aclweb.org/anthology/N18-1140) |


### CNN / Daily Mail

The [CNN / Daily Mail dataset](https://arxiv.org/abs/1506.03340) is a Cloze-style reading comprehension dataset
Expand Down Expand Up @@ -68,6 +85,21 @@ Example:
| Classifier (Chen et al., 2016) | 67.9 | 68.3 | [A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task](https://www.aclweb.org/anthology/P16-1223) |
| Impatient Reader (Hermann et al., 2015) | 63.8 | 68.0 | [Teaching Machines to Read and Comprehend](https://arxiv.org/abs/1506.03340) |

### CoQA

[CoQA](https://arxiv.org/abs/1808.07042) is a large-scale dataset for building Conversational Question Answering systems.
CoQA contains 127,000+ questions with answers collected from 8000+ conversations.
Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers.

The data and public leaderboard are available [here](https://stanfordnlp.github.io/coqa/).

### HotpotQA

HotpotQA is a dataset with 113k Wikipedia-based question-answer pairs. Questions require
finding and reasoning over multiple supporting documents and are not constrained to any pre-existing knowledge bases.
Sentence-level supporting facts are available.

The data and public leaderboard are available from the [HotpotQA website](https://hotpotqa.github.io/).

### MS MARCO
[MS MARCO](http://www.msmarco.org/dataset.aspx) aka Human Generated MAchine
Expand Down Expand Up @@ -121,6 +153,15 @@ PubMed.

The leaderboards for both datasets are available on the [QAngaroo website](http://qangaroo.cs.ucl.ac.uk/leaderboard.html).

### QuAC

Question Answering in Context (QuAC) is a dataset for modeling, understanding, and participating in information seeking dialog.
Data instances consist of an interactive dialog between two crowd workers:
(1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text,
and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.

The leaderboard and data are available on the [QuAC website](http://quac.ai/).

### RACE

The [RACE dataset](https://arxiv.org/abs/1704.04683) is a reading comprehension dataset
Expand Down Expand Up @@ -157,24 +198,6 @@ endings. The systems must then choose the correct ending to the story.
| Hidden Coherence Model (Chaturvedi et al., 2017) | 77.6 | [Story Comprehension for Predicting What Happens Next](http://aclweb.org/anthology/D17-1168) |
| val-LS-skip (Srinivasan et al., 2018) | 76.5 | [A Simple and Effective Approach to the Story Cloze Test](http://aclweb.org/anthology/N18-2015) |

### Winograd Schema Challenge

The [Winograd Schema Challenge](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492)
is a dataset for common sense reasoning. It employs Winograd Schema questions that
require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models
are evaluated based on accuracy.

Example:

The trophy doesn’t fit in the suitcase because _it_ is too big. What is too big?
Answer 0: the trophy. Answer 1: the suitcase

| Model | Score | Paper / Source |
| ------------- | :-----:| --- |
| Word-LM-partial (Trinh and Le, 2018) | 62.6 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) |
| Char-LM-partial (Trinh and Le, 2018) | 57.9 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) |
| USSM + Supervised DeepNet + KB (Liu et al., 2017) | 52.8 | [Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems](https://aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392) |

### RecipeQA

[RecipeQA](https://arxiv.org/abs/1809.00812) is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.
Expand Down