# Creation of Dataset

## Final output files are:
1. **"./final_data/ul_train_data_from_nli_contradict.tsv"**<br>
   &emsp;Data set used for unlikelihood training (train data). Synthesized from NLI contradict samples.

2. **"./final_data/ul_train_data_from_nli_entailment.tsv"**<br>
   &emsp;Data set used for unlikelihood training (train data). Synthesized from NLI entailment samples.

3. **"./final_data/ul_valid_data_from_nli_contradict.tsv"**<br>
   &emsp;Data set used for unlikelihood training (valid data). Synthesized from NLI contradict samples.

4. **"./final_data/ul_valid_data_from_nli_entailment.tsv"**<br>
   &emsp;Data set used for unlikelihood training (valid data). Synthesized from NLI entailment samples.

5. **"./final_data/test_data_from_nli_contradict.tsv"**<br>
   &emsp;Data set used for our analyses. Synthesized from NLI contradict samples.
   
6. **"./final_data/test_data_from_nli_entailment.tsv"**<br>
   &emsp;Data set used for our analyses. Synthesized from NLI entailment samples.

## 1. Download datas
We use the Multi-Genre Natural Language Inference (MultiNLI) corpus to synthesize stimulus inputs.<br>
See https://aclanthology.org/N18-1101/ for the details of the MultiNLI corpus.

In [1]:
%%bash
wget -q https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip -nc
unzip -q multinli_1.0.zip

## 2. Split datas by domain

In [2]:
import json
import os

def collect_genre(in_fname):
    genre_set = set()
    with open(in_fname) as in_f:
        for l in in_f:
            js = json.loads(l.strip())
            if not js['genre'] in genre_set:
                genre_set.add(js['genre'])
    return genre_set


def write_genre_file(in_fname, out_fname, genre):
    with open(in_fname) as in_f, open(out_fname, 'w') as out_f:
        for l in in_f:
            js = json.loads(l.strip())
            if js['genre'] == genre:
                out_f.write('{}\t{}\t{}\n'.format(js['sentence1'], js['sentence2'], js['gold_label']))
            

def split_by_genre(in_fname, out_dname, suffix_fname='train'):
    genre_set = collect_genre(in_fname)
    print('Genre Set: {}'.format(genre_set))
    for genre in list(genre_set):
        out_fname = f'{genre}_{suffix_fname}.tsv'
        print('Writing to: {}'.format(out_fname))
        write_genre_file(in_fname, os.path.join(out_dname, out_fname), genre)

In [3]:
!mkdir tmp

In [4]:
in_fname = 'multinli_1.0/multinli_1.0_train.jsonl'
out_dname = 'tmp'
split_by_genre(in_fname, out_dname)

Genre Set: {'telephone', 'slate', 'travel', 'government', 'fiction'}
Writing to: telephone_train.tsv
Writing to: slate_train.tsv
Writing to: travel_train.tsv
Writing to: government_train.tsv
Writing to: fiction_train.tsv


## 3. Create questions from "TELEPHONE"

### Transform hypothesis into yes-no questions
e.g., "I like it" -> "Do you like it?"

In [5]:
!python -m spacy download en_core_web_sm | grep -v 'already satisfied'

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
You should consider upgrading via the '/work/shiki/release_sigdial2022/create_questions/venv-create-questions/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
!python3 src/make_question_for_mnli.py \
    --in-fname 'tmp/telephone_train.tsv' \
    --out-fname 'tmp/telephone_train_q.tsv'

100%|█████████████████████████████████████| 83348/83348 [15:54<00:00, 87.34it/s]


In [7]:
def get_sample_of_label(in_fname, out_fname, label):
    with open(in_fname) as i_f, open(out_fname,'w') as o_f:
        for l in i_f:
            c, gr, r, b = l.strip().split('\t')
            if b == label:
                o_f.write('{}\t{}\t{}\n'.format(c,gr,r))
                
in_fname = 'tmp/telephone_train_q.tsv'
out_fname = 'tmp/telephone_train_q_e.tsv'
get_sample_of_label(in_fname, out_fname, 'entailment')
in_fname = 'tmp/telephone_train_q.tsv'
out_fname = 'tmp/telephone_train_q_c.tsv'
get_sample_of_label(in_fname, out_fname, 'contradiction')

### Make negative questions
e.g., "Do you like it?" -> "Don't you like it?"

In [8]:
!python3 src/transform_negative_for_mnli.py \
    --in-fname 'tmp/telephone_train_q_c.tsv' \
    --out-fname 'tmp/telephone_train_q_c_neg.tsv'

In [9]:
# we make it for unlikelihood training
!python3 src/transform_negative_for_mnli.py \
    --in-fname 'tmp/telephone_train_q_e.tsv' \
    --out-fname 'tmp/telephone_train_q_e_neg.tsv'

## 4. Make pairs for unlikelihood training
Some premise sentences in the MultiNLI corpus have both contradicting and entailing hypotheses.<br>
We call these samples paired samples and use them for unlikelihood training.<br>
The remaining non-paired samples are used as stimulus inputs for the experiments.

### Load preprocessed data

In [10]:
e_context_lis, e_response_lis, e_question_lis, e_negquestion_lis = [], [], [], []
with open('tmp/telephone_train_q_e_neg.tsv') as f:
    for l in f:
        e_context, e_response, e_question, e_negquestion = l.strip().split('\t')
        assert e_context and e_response and e_question and e_negquestion
        e_context_lis.append(e_context)
        e_response_lis.append(e_response)
        e_question_lis.append(e_question)
        e_negquestion_lis.append(e_negquestion)
assert len(e_context_lis) == len(set(e_context_lis))

c_context_lis, c_response_lis, c_question_lis, c_negquestion_lis = [], [], [], []
with open('tmp/telephone_train_q_c_neg.tsv') as f:
    for l in f:
        c_context, c_response, c_question, c_negquestion = l.strip().split('\t')
        assert c_context and c_response and c_question and c_negquestion
        c_context_lis.append(c_context)
        c_response_lis.append(c_response)
        c_question_lis.append(c_question)
        c_negquestion_lis.append(c_negquestion)
assert len(c_context_lis) == len(set(c_context_lis))

### Extract paired samples

In [11]:
with open('tmp/telephone_train_q_e_paired.tsv', 'w') as of_ep, \
        open('tmp/telephone_train_q_c_paired.tsv', 'w') as of_cp:
    for e_idx, (e_context, e_response, e_question, e_negquestion) \
            in enumerate(zip(e_context_lis, e_response_lis, e_question_lis, e_negquestion_lis)):
        if e_context in c_context_lis:
            of_ep.write(f"{e_context}\t{e_response}\t{e_question}\t{e_negquestion}\n")
            c_idx = c_context_lis.index(e_context)
            of_cp.write(f"{e_context}\t{c_response_lis[c_idx]}\t{c_question_lis[c_idx]}\t{c_negquestion_lis[c_idx]}\n")
            # remove saved text for next process
            e_context_lis[e_idx], e_response_lis[e_idx], e_question_lis[e_idx], e_negquestion_lis[e_idx] = '', '', '', ''
            c_context_lis[c_idx], c_response_lis[c_idx], c_question_lis[c_idx], c_negquestion_lis[c_idx] = '', '', '', ''

### Extract unpaired samples

In [12]:
with open('tmp/telephone_train_q_e_unpaired.tsv', 'w') as of_eup, \
        open('tmp/telephone_train_q_c_unpaired.tsv', 'w') as of_cup:
    for e_idx, (e_context, e_response, e_question, e_negquestion) \
            in enumerate(zip(e_context_lis, e_response_lis, e_question_lis, e_negquestion_lis)):
        if e_context:  # not null after the preceding process = unpaired sample
            assert e_response and e_question and e_negquestion
            of_eup.write(f"{e_context}\t{e_response}\t{e_question}\t{e_negquestion}\n")
    for c_idx, (c_context, c_response, c_question, c_negquestion) \
            in enumerate(zip(c_context_lis, c_response_lis, c_question_lis, c_negquestion_lis)):
        if c_context:  # not null after the preceding process = unpaired sample
            assert c_response and c_question and c_negquestion
            of_cup.write(f"{c_context}\t{c_response}\t{c_question}\t{c_negquestion}\n")

## 5. Prepare final data sets

### prepare training data for unlikelihood training

In [13]:
%%bash
mkdir final_data
head -n 1800 "tmp/telephone_train_q_c_paired.tsv" > "final_data/ul_train_data_from_nli_contradict.tsv"
head -n 1800 "tmp/telephone_train_q_e_paired.tsv" > "final_data/ul_train_data_from_nli_entailment.tsv"
head -n 2000 "tmp/telephone_train_q_c_paired.tsv" | tail -n 200 > "final_data/ul_valid_data_from_nli_contradict.tsv"
head -n 2000 "tmp/telephone_train_q_e_paired.tsv" | tail -n 200 > "final_data/ul_valid_data_from_nli_entailment.tsv"

### prepare test data for analyses

In [14]:
%%bash
head -n 2000 "tmp/telephone_train_q_c_unpaired.tsv" > "final_data/test_data_from_nli_contradict.tsv"
head -n 2000 "tmp/telephone_train_q_e_unpaired.tsv" > "final_data/test_data_from_nli_entailment.tsv"