sebastianruder · sebastianruder · Oct 25, 2018 · Oct 25, 2018
diff --git a/README.md b/README.md
@@ -7,6 +7,7 @@
 - [Automatic speech recognition](english/automatic_speech_recognition.md)
 - [CCG supertagging](english/ccg_supertagging.md)
 - [Chunking](english/chunking.md)
+- [Common sense](english/common_sense.md)
 - [Constituency parsing](english/constituency_parsing.md)
 - [Coreference resolution](english/coreference_resolution.md)
 - [Dependency parsing](english/dependency_parsing.md)

diff --git a/english/common_sense.md b/english/common_sense.md
@@ -0,0 +1,46 @@
+# Common sense
+
+Common sense reasoning tasks are intended to require the model to go beyond pattern 
+recognition. Instead, the model should use "common sense" or world knowledge
+to make inferences.
+
+### Event2Mind
+
+Event2Mind is a crowdsourced corpus of 25,000 event phrases covering a diverse range of everyday events and situations.
+Given an event described in a short free-form text, a model should reason about the likely intents and reactions of the
+event's participants. Models are evaluated based on average cross-entropy (lower is better).
+
+| Model           | Dev  | Test  |  Paper / Source | Code | 
+| ------------- | :-----:| :-----:|--- | --- | 
+| BiRNN 100d (Rashkin et al., 2018) | 4.25 | 4.22 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | |
+| ConvNet (Rashkin et al., 2018) | 4.44 | 4.40 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | |
+
+### SWAG
+
+Situations with Adversarial Generations (SWAG) is a dataset consisting of 113k multiple
+choice questions about a rich spectrum of grounded situations.
+
+| Model           | Dev  | Test  |  Paper / Source | Code | 
+| ------------- | :-----:| :-----:|--- | --- | 
+| BERT Large (Devlin et al., 2018) | 86.6 | 86.3 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
+| BERT Base (Devlin et al., 2018) | 81.6 | - | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
+| ESIM + ELMo (Zellers et al., 2018) | 59.1 | 59.2 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) |  |
+| ESIM + GloVe (Zellers et al., 2018) | 51.9 | 52.7 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) |  |
+
+### Winograd Schema Challenge
+
+The [Winograd Schema Challenge](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492)
+is a dataset for common sense reasoning. It employs Winograd Schema questions that
+require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models
+are evaluated based on accuracy.
+
+Example:
+
+The trophy doesn’t fit in the suitcase because _it_ is too big. What is too big?
+Answer 0: the trophy. Answer 1: the suitcase
+
+| Model           | Score  |  Paper / Source |
+| ------------- | :-----:| --- |
+| Word-LM-partial (Trinh and Le, 2018) | 62.6 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) |
+| Char-LM-partial (Trinh and Le, 2018) | 57.9 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) |
+| USSM + Supervised DeepNet + KB (Liu et al., 2017) | 52.8 | [Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems](https://aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392) |
diff --git a/english/multi-task_learning.md b/english/multi-task_learning.md
@@ -3,6 +3,14 @@
 Multi-task learning aims to learn multiple different tasks simultaneously while maximizing
 performance on one or all of the tasks. 
 
+### DecaNLP
+
+The [Natural Language Decathlon](https://arxiv.org/abs/1806.08730) (decaNLP) is a benchmark for studying general NLP 
+models that can perform a variety of complex, natural language tasks. 
+It evaluates performance on ten disparate natural language tasks.
+
+Results can be seen on the [public leaderboard](https://decanlp.com/).
+
 ### GLUE
 
 The [General Language Understanding Evaluation benchmark](https://arxiv.org/abs/1804.07461) (GLUE)

diff --git a/english/named_entity_recognition.md b/english/named_entity_recognition.md
@@ -13,11 +13,14 @@ Example:
 ### CoNLL 2003 (English)
 
 The [CoNLL 2003 NER task](http://www.aclweb.org/anthology/W03-0419.pdf) consists of newswire text from the Reuters RCV1 
-corpus tagged with four different entity types (PER, LOC, ORG, MISC). Models are evaluated based on span-based F1.
+corpus tagged with four different entity types (PER, LOC, ORG, MISC). Models are evaluated based on span-based F1 on the test set.
 
 | Model           | F1  |  Paper / Source | Code |
 | ------------- | :-----:| --- | --- |
 | Flair embeddings (Akbik et al., 2018) | 93.09 | [Contextual String Embeddings for Sequence Labeling](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view) | [Flair framework](https://github.com/zalandoresearch/flair)
+| BERT Large (Devlin et al., 2018) | 92.8 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
+| CVT + Multi (Clark et al., 2018) | 92.6 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) |  |
+| BERT Base (Devlin et al., 2018) | 92.4 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
 | BiLSTM-CRF+ELMo (Peters et al., 2018) | 92.22 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | [AllenNLP Project](https://allennlp.org/elmo) [AllenNLP GitHub](https://github.com/allenai/allennlp) |
 | Peters et al. (2017) | 91.93 | [Semi-supervised sequence tagging with bidirectional language models](https://arxiv.org/abs/1705.00108) | |
 | LM-LSTM-CRF (Liu et al., 2018)| 91.71 | [Empowering Character-aware Sequence Labeling with Task-Aware Neural Language Model](https://arxiv.org/pdf/1709.04109.pdf) | [LM-LSTM-CRF](https://github.com/LiyuanLucasLiu/LM-LSTM-CRF) |

diff --git a/english/question_answering.md b/english/question_answering.md
@@ -2,6 +2,24 @@
 
 Question answering is the task of answering a question.
 
+### Table of contents
+
+- [ARC](#arc)
+- [Reading comprehension](#reading-comprehension)
+  - [CliCR](#clicr)
+  - [CNN / Daily Mail](#cnn--daily-mail)
+  - [CoQA](#coqa)
+  - [HotpotQA](#hotpotqa)
+  - [MS MARCO](#ms-marco)
+  - [MultiRC](#multirc)
+  - [NewsQA](#newsqa)
+  - [QAngaroo](#qangaroo)
+  - [QuAC](#quac)
+  - [RACE](#race)
+  - [SQuAD](#squad)
+  - [Story Cloze Test](#story-cloze-test)
+  - [Recipe QA](#recipeqa)
+
 ### ARC
 
 The [AI2 Reasoning Challenge (ARC)](http://ai2-website.s3.amazonaws.com/publications/AI2ReasoningChallenge2018.pdf)
@@ -35,7 +53,6 @@ Example:
 | Gated-Attention Reader (Dhingra et al., 2017) | 33.9 | [CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension](http://aclweb.org/anthology/N18-1140) |
 | Stanford Attentive Reader (Chen et al., 2016) | 27.2| [CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension](http://aclweb.org/anthology/N18-1140) |
 
-
 ### CNN / Daily Mail
 
 The [CNN / Daily Mail dataset](https://arxiv.org/abs/1506.03340) is a Cloze-style reading comprehension dataset
@@ -68,6 +85,21 @@ Example:
 | Classifier (Chen et al., 2016) | 67.9 | 68.3 | [A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task](https://www.aclweb.org/anthology/P16-1223) |
 | Impatient Reader (Hermann et al., 2015) | 63.8 | 68.0 | [Teaching Machines to Read and Comprehend](https://arxiv.org/abs/1506.03340) |
 
+### CoQA
+
+[CoQA](https://arxiv.org/abs/1808.07042) is a large-scale dataset for building Conversational Question Answering systems. 
+CoQA contains 127,000+ questions with answers collected from 8000+ conversations.
+Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers.
+
+The data and public leaderboard are available [here](https://stanfordnlp.github.io/coqa/).
+
+### HotpotQA
+
+HotpotQA is a dataset with 113k Wikipedia-based question-answer pairs. Questions require 
+finding and reasoning over multiple supporting documents and are not constrained to any pre-existing knowledge bases.
+Sentence-level supporting facts are available.
+
+The data and public leaderboard are available from the [HotpotQA website](https://hotpotqa.github.io/).
 
 ### MS MARCO
 [MS MARCO](http://www.msmarco.org/dataset.aspx) aka Human Generated MAchine
@@ -121,6 +153,15 @@ PubMed.
 
 The leaderboards for both datasets are available on the [QAngaroo website](http://qangaroo.cs.ucl.ac.uk/leaderboard.html).
 
+### QuAC
+
+Question Answering in Context (QuAC) is a dataset for modeling, understanding, and participating in information seeking dialog.
+Data instances consist of an interactive dialog between two crowd workers:
+(1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text,
+and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.
+
+The leaderboard and data are available on the [QuAC website](http://quac.ai/).
+
 ### RACE
 
 The [RACE dataset](https://arxiv.org/abs/1704.04683) is a reading comprehension dataset
@@ -157,24 +198,6 @@ endings. The systems must then choose the correct ending to the story.
 | Hidden Coherence Model (Chaturvedi et al., 2017) | 77.6 | [Story Comprehension for Predicting What Happens Next](http://aclweb.org/anthology/D17-1168) |
 | val-LS-skip (Srinivasan et al., 2018) | 76.5 | [A Simple and Effective Approach to the Story Cloze Test](http://aclweb.org/anthology/N18-2015) |
 
-### Winograd Schema Challenge
-
-The [Winograd Schema Challenge](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492)
-is a dataset for common sense reasoning. It employs Winograd Schema questions that
-require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models
-are evaluated based on accuracy.
-
-Example:
-
-The trophy doesn’t fit in the suitcase because _it_ is too big. What is too big?
-Answer 0: the trophy. Answer 1: the suitcase
-
-| Model           | Score  |  Paper / Source |
-| ------------- | :-----:| --- |
-| Word-LM-partial (Trinh and Le, 2018) | 62.6 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) |
-| Char-LM-partial (Trinh and Le, 2018) | 57.9 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) |
-| USSM + Supervised DeepNet + KB (Liu et al., 2017) | 52.8 | [Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems](https://aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392) |
-
 ### RecipeQA
 
 [RecipeQA](https://arxiv.org/abs/1809.00812) is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.