# CoQA Dataset Analysis

This analysis is mostly based on the paper published by the Stanford CoQA team, the creator of this dataset, with some of my exploration of the dataset and some additional examples from the dataset.

CoQA: A Conversational Question Answering Challenge https://arxiv.org/pdf/1808.07042.pdf
    

### Goals of CoQA dataset:
 * To understand the succinct nature of questions in a human conversation because of coversation history
 * To ensure the naturalness of answers in a conversation
 * To enable building QA systems that perform robustly across domains
    

### linguistics concepts in this dataset:

* Conversational Question and Answer: This dataset provides a background information in a passage, then presents a series of questions and answers that are just like normal people would do when inqurying for information or testing a reader's understanding.

* Ways to arrive an answer:
    * lexical match: an answer can be found by matching lexical parts, words or tokens. (contains at least one content word that appears in the given) 
       * Example: 
           * Given: Outen was rescued by the coast guard
           * Question: Who had to rescue her?
           * Answer: the coast guard
            
    * Paraphrasing: an answer is derived from the source by paraphrasing the source but not the same in a lexical sense. 
       * Example:
           * Given: he drew cautiously closer
           * Question: Did the wild dog approach?
           * Answer: Yes
        
    * Pragmatics: an answer is derived by understanding.
       * Example:  
           * Given: It looked like a stick man so she kept him. She named her new noodle friend Joey
           * Question:  Is Joey a male or female?
           * Answer: Male
* Types of relationship between a question and its conversation history:
    * No co-reference: 
       * Example:
           * Q: What is IFL?
    * Explicit co-reference: 
       * Example:
            * Q: Who had Bashti forgotten? 
            * A: the puppy
            * Q: What was  **his** name?
    * Implicit co-reference: 
        * Example:
            * Q: When will Sirisena be sworn in? 
            * A: 6 p.m local time
            * Q: **Where**?
* CoQA answer types:
    * Yes 
        * Given: There is also a site optimized for display on mobile devices
        * Q: is MedlinePlus optimized for mobile? 
        * A: Yes
   
    * No 
        * Given: AFL is the highest level of professional indoor American football
        * Q: Is it played outside? 
        * A: No
       
    * Fluency 
        * Given: while the investigation continued
        * Q: Why? 
        * A: so the investigation could continue
       
    * Counting 
        * Given: The service provides curated consumer health information in English and Spanish
        * Q: how many languages is it offered in? 
        * A: Two
        
    * Multiple choice 
        * Given: her baby sister is crying so loud that Jenny can’t hear herself
        * Q: Is Jenny older or younger?  
        * A: Older
* Breakdown on ways of changes to gain fluency: 
    * Multiple edits 
        * Given: She **would give** her **baby sister** **one of her** toy horses.
        * Q: What did she try just before that? 
        * A: She **gave** her **a** toy **horse**.
        * (morphology: give → gave, horses → horse; delete: would, baby sister, one of her; insert: a)
        
    * Coreference insertion 
        * Given: The service is funded by the NLM and **is free** to users
        * Q: what is the cost to end users? 
        * A: **It** is free
        
    * Morphology 
        * Given: **vandalism** in the neighborhoods
        * Q: Who was messing up the neighborhoods?  
        * A: **vandals**
       
    * Article insertion 
        * Given: the heavy ax
        * Q: What would they cut with?  
        * A: **an** ax
       
    * Adverb insertion 
        * Given: kept 190 years ago
        * Q: How old was the diary?  
        * A: 190 years **old**
       
    * Adjective deletion 
        * Given: a **120-page** diary
        * Q: What type of book? 
        * A: A diary.
        
    * Preposition insertion 
        * Given: By the time they arrived, it was almost supper time.
        * Q: how long did it take to get to the fire? 
        * A: **Until** supper time!
        
    * Adverb deletion 
        * Given: It had **somewhat** changed its formation when they approached it
        * Q: What had happened to the ice? 
        * A: It had changed
      
    * Conjunction insertion 
        * Given: paid well, both in potatoes, carrots
        * Q: what else do they get for their work? 
        * A: potatoes **and** carrots
        
    * Noun insertion 
        * Given: But it was a Comedy Central account
        * Q: Who did  
        * A: Comedy Central **employee**
       
    * Coreference deletion 
        * Given: This is the story of a young girl and **her** dog
        * Q: What is the story about?  
        * A: A girl and a dog
        
    * Noun deletion 
        * Given: and has the fourth largest **student** population
        * Q: What is the ranking in the country in terms of people studying? 0.8% 
        * A: the fourth largest population
       
    * Possesive insertion 
        * Given: a 120-page diary kept 190 years ago by Deborah Logan
        * Q: Whose diary was it?  
        * A: Deborah Logan**’s**
        
    * Article deletion 
        * Given: They **all** were going to the circus to see the clowns
        * Q: why? 
        * A: They were going to the circus
        

### Dataset schema by example(JSON):

#### Training dataset: 
     
    {
      "version": "1.0",
      "data":
              [
                {
                  "source": "wikipedia",
                  "id": "3zotghdk5ibi9cex97fepx7jetpso7",
                  "filename": "Vatican_Library.txt",
                  "story": "The Vatican Apostolic ...",
                  "questions": [
                                {
                                  "input_text": "When was the Vat formally opened?",
                                  "turn_id": 1
                                },
                                {
                                  "input_text": "what is the library for?",
                                  "turn_id": 2
                                },
                                {...},
                                ...
                               ],
                  "answers": [
                                {
                                  "span_start": 151,
                                  "span_end": 179,
                                  "span_text": "Formally established in 1475",
                                  "input_text": "It was formally established in 1475",
                                  "turn_id": 1
                                 },
                                 {...},
                                 ...
                              ],
                   "name": "Vatican_Library.txt"
                  },
                  {...},
                  ...
               ]
    }
    
  ##### Note:  Here the input_text in the answers is the good answer and span_text is the rationale text from the original text (story) retrieved using the span_start and span_end. In training dataset, one question corresponds to one answer for the turn in the same story.
  
    
#### Test dataset:


    {
      "version": "1.0",
      "data": [
                {
                  "source": "mctest",
                  "id": "3dr23u6we5exclen4th8uq9rb42tel",
                  "filename": "mc160.test.41",
                  "story": "Once upon a time,...",
                  "questions": [
                                 {
                                  "input_text": "What color was Cotton?",
                                  "turn_id": 1
                                 },
                                 {...},
                                 ...
                                ],
                   "answers": [
                                {
                                  "span_start": 59,
                                  "span_end": 93,
                                  "span_text": "a little white kitten named Cotton",
                                  "input_text": "white",
                                  "turn_id": 1
                                },
                                {...},
                                ...
                               ],
                    "additional_answers": {
                                            "0": [
                                                   {
                                                      "span_start": 68,
                                                      "span_end": 93,
                                                      "span_text": "white kitten named Cotton",
                                                      "input_text": "white",
                                                      "turn_id": 1
                                                    },
                                                    {...},
                                                     ...
                                                  ],
                                              "1": []
                                              ...
                                            },
                      },
                      "name": "mc160.test.41"
                      },
                      {...},
                      ...
              ]
    }
              
  #### Note:  Test dataset has the similar schema, all the same except with a extra node: additonal_answers, which contains a few alternative answers key with 0,1, ....  The main difference between the gold answer verse additonal answers is the spans thus span texts.                                       


In [1]:
import json
from pandas.io.json import json_normalize
import pandas as pd
import re

In [2]:
data=json.load((open('CoQA/coqa-train-v1.0.json')))  #(open()) give up a generator)


In [3]:
qas=json_normalize(data['data'], ['questions'],['source','id','story'])

In [4]:
ans=json_normalize(data['data'], ['answers'],['id'])

In [5]:
train_df = pd.merge(qas,ans, left_on=['id','turn_id'],right_on=['id','turn_id'] )

In [6]:
train_df.loc[10:30,['turn_id','input_text_x','input_text_y','span_text'] ]

Unnamed: 0,turn_id,input_text_x,input_text_y,span_text
10,11,when were the Secret Archives moved from the r...,at the beginning of the 17th century;,atican Secret Archives were separated from the...
11,12,how many items are in this secret collection?,150000,Vatican Secret Archives were separated from t...
12,13,Can anyone use this library?,anyone who can document their qualifications a...,The Vatican Library is open to anyone who can...
13,14,what must be requested to view?,unknown,unknown
14,15,what must be requested in person or by mail?,Photocopies,Photocopies for private study of pages from bo...
15,16,of what books?,only books published between 1801 and 1990,hotocopies for private study of pages from boo...
16,17,What is the Vat the library of?,the Holy See,"simply the Vat, is the library of the Holy See,"
17,18,How many books survived the Pre Lateran period?,a handful of volumes,"Pre-Lateran period, comprising the initial day..."
18,19,what is the point of the project started in 2014?,digitising manuscripts,Vatican Library began an initial four-year pro...
19,20,what will this allow?,them to be viewed online.,"manuscripts, to be made available online."


In [7]:
#train_df[train_df['input_text_x']=='wife']
train_df['q_first_word']=train_df['input_text_x'].str.lower().str.extract(r'(\w+)')

  


In [8]:
train_df['q_first_two_words']=train_df['input_text_x'].str.lower().str.extract(r'^((?:\S+\s+){1}\S+).*')

  """Entry point for launching an IPython kernel.


In [9]:
train_df.groupby('q_first_word').count().sort_values(by='input_text_x',ascending=False).head(30)

Unnamed: 0_level_0,bad_turn_x,input_text_x,turn_id,source,id,story,bad_turn_y,input_text_y,span_end,span_start,span_text,q_first_two_words
q_first_word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
what,114,32092,32092,32092,32092,32092,611,32092,32092,32092,32092,31711
who,45,15684,15684,15684,15684,15684,301,15684,15684,15684,15684,15075
how,37,10946,10946,10946,10946,10946,224,10946,10946,10946,10946,10662
did,19,7381,7381,7381,7381,7381,137,7381,7381,7381,7381,7381
where,21,7214,7214,7214,7214,7214,121,7214,7214,7214,7214,6305
was,30,5121,5121,5121,5121,5121,121,5121,5121,5121,5121,5121
when,10,4530,4530,4530,4530,4530,83,4530,4530,4530,4530,3614
is,16,3431,3431,3431,3431,3431,76,3431,3431,3431,3431,3431
why,13,2921,2921,2921,2921,2921,65,2921,2921,2921,2921,1885
does,5,2110,2110,2110,2110,2110,33,2110,2110,2110,2110,2110


In [10]:
train_df.groupby('q_first_two_words').count().sort_values(by='input_text_x',ascending=False).head(30)

Unnamed: 0_level_0,bad_turn_x,input_text_x,turn_id,source,id,story,bad_turn_y,input_text_y,span_end,span_start,span_text,q_first_word
q_first_two_words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
what did,27,5622,5622,5622,5622,5622,97,5622,5622,5622,5622,5622
what was,10,5079,5079,5079,5079,5079,100,5079,5079,5079,5079,5079
what is,15,4800,4800,4800,4800,4800,101,4800,4800,4800,4800,4800
how many,12,3692,3692,3692,3692,3692,108,3692,3692,3692,3692,3692
who was,9,3390,3390,3390,3390,3390,74,3390,3390,3390,3390,3390
who is,11,2409,2409,2409,2409,2409,29,2409,2409,2409,2409,2409
did he,5,2366,2366,2366,2366,2366,40,2366,2366,2366,2366,2366
where did,3,1988,1988,1988,1988,1988,40,1988,1988,1988,1988,1988
when did,5,1810,1810,1810,1810,1810,38,1810,1810,1810,1810,1810
what does,5,1797,1797,1797,1797,1797,37,1797,1797,1797,1797,1797


In [11]:
train_df[train_df['bad_turn_y']=='true'].count()

bad_turn_x             39
input_text_x         2093
turn_id              2093
source               2093
id                   2093
story                2093
bad_turn_y           2093
input_text_y         2093
span_end             2093
span_start           2093
span_text            2093
q_first_word         2092
q_first_two_words    2033
dtype: int64