# coLA

Taken from [https://nyu-mll.github.io/CoLA/](https://nyu-mll.github.io/CoLA/).

## Split

We have split the data into an in-domain set with sentences from 17 sources and an out-of-domain set with the remaining 6 sources. The in-domain set is split into train/dev/test sections, and the out-of-domain set is split into dev/test sections. The test sets are not made public. For convenience, each dataset is provided is provided twice, in raw form and in tokenized form (from the NLTK tokenizer). The public data is split into the following files:

- raw/in_domain_train.tsv (8551 lines)
- raw/in_domain_dev.tsv (527 lines)
- raw/out_of_domain_dev.tsv (516 lines)
- tokenized/in_domain_train.tsv (8551 lines)
- tokenized/in_domain_dev.tsv (527 lines)
- tokenized/out_of_domain_dev.tsv (516 lines)

## Data Format

Each line in the .tsv files consists of 4 tab-separated columns.
- Column 1:	the code representing the source of the sentence.
- Column 2:	the acceptability judgment label (0=unacceptable, 1=acceptable).
- Column 3:	the acceptability judgment as originally notated by the author.
- Column 4:	the sentence.


## Corpus Sample

- clc95	0	*	In which way is Sandy very anxious to see if the students will be able to solve the homework problem?
- c-05	1		The book was written by John.
- c-05	0	*	Books were sent to each other by the students.
- swb04	1		She voted for herself.
- swb04	1		I saw that gas can explode.

## Processing

During gathering of the data and processing, some sentences from the source documents may have been omitted or altered. We retained all acceptable examples, and excluded any examples given intermediate judgments such as “?” or “#”. In addition, we excluded examples of unacceptable sentences not suitable for the present task because they required reasoning about pragmatic violations, unavailable semantic readings, or nonexistent words. We take responsibility for any errors.

In [1]:
import pathlib as pl
import csv
from typing import Union, List, Tuple

In [2]:
cola_path = pl.Path("./cola_public/raw")

In [3]:
def read_cola_tsv(filepath: Union[str, pl.Path]) -> List[Tuple[str, str, str, str]]:
    data = []

    with open(filepath, "rt", newline='') as tsvfile:
        reader = csv.reader(tsvfile, delimiter="\t")

        for row in reader:
            data.append(row)
    
    return data

In [4]:
def save_cola_data_in_fairseq_format(data: List[Tuple[str, str, str, str]], filename: str, directory: str = "."):
    outdir = pl.Path(directory)
    sentence_file = outdir / filename
    label_file = outdir / (filename + ".lbl")
    
    with sentence_file.open("wt") as s_file, label_file.open("wt") as l_file:
        for _, label, _, sentence in data:
            s_file.write(sentence + "\n")
            l_file.write(label + "\n")

In [5]:
train_file = cola_path / "in_domain_train.tsv"
train_data = read_cola_tsv(train_file)
train_data[:10]

[['gj04',
  '1',
  '',
  "Our friends won't buy this analysis, let alone the next one we propose."],
 ['gj04', '1', '', "One more pseudo generalization and I'm giving up."],
 ['gj04', '1', '', "One more pseudo generalization or I'm giving up."],
 ['gj04', '1', '', 'The more we study verbs, the crazier they get.'],
 ['gj04', '1', '', 'Day by day the facts are getting murkier.'],
 ['gj04', '1', '', "I'll fix you a drink."],
 ['gj04', '1', '', 'Fred watered the plants flat.'],
 ['gj04', '1', '', 'Bill coughed his way out of the restaurant.'],
 ['gj04', '1', '', "We're dancing the night away."],
 ['gj04', '1', '', 'Herman hammered the metal flat.']]

In [6]:
save_cola_data_in_fairseq_format(train_data, filename="in_domain.train")  # will also create label file

In [7]:
%%bash
head in_domain.train
echo ""
head in_domain.train.lbl
echo ""
wc -l in_domain.train*

Our friends won't buy this analysis, let alone the next one we propose.
One more pseudo generalization and I'm giving up.
One more pseudo generalization or I'm giving up.
The more we study verbs, the crazier they get.
Day by day the facts are getting murkier.
I'll fix you a drink.
Fred watered the plants flat.
Bill coughed his way out of the restaurant.
We're dancing the night away.
Herman hammered the metal flat.

1
1
1
1
1
1
1
1
1
1

  8551 in_domain.train
  8551 in_domain.train.lbl
 17102 insgesamt


In [8]:
dev_file = cola_path / "in_domain_dev.tsv"
dev_data = read_cola_tsv(dev_file)
dev_data[:10]

[['gj04', '1', '', 'The sailors rode the breeze clear of the rocks.'],
 ['gj04', '1', '', 'The weights made the rope stretch over the pulley.'],
 ['gj04', '1', '', 'The mechanical doll wriggled itself loose.'],
 ['cj99', '1', '', 'If you had eaten more, you would want less.'],
 ['cj99', '0', '*', 'As you eat the most, you want the least.'],
 ['cj99', '0', '*', 'The more you would want, the less you would eat.'],
 ['cj99', '0', '*', 'I demand that the more John eat, the more he pays.'],
 ['cj99', '1', '', 'Mary listens to the Grateful Dead, she gets depressed.'],
 ['cj99', '1', '', 'The angrier Mary got, the more she looked at pictures.'],
 ['cj99', '1', '', 'The higher the stakes, the lower his expectations are.']]

In [9]:
save_cola_data_in_fairseq_format(dev_data, filename="in_domain.dev")

In [10]:
%%bash
head in_domain.dev
echo ""
head in_domain.dev.lbl
echo ""
wc -l in_domain.dev*

The sailors rode the breeze clear of the rocks.
The weights made the rope stretch over the pulley.
The mechanical doll wriggled itself loose.
If you had eaten more, you would want less.
As you eat the most, you want the least.
The more you would want, the less you would eat.
I demand that the more John eat, the more he pays.
Mary listens to the Grateful Dead, she gets depressed.
The angrier Mary got, the more she looked at pictures.
The higher the stakes, the lower his expectations are.

1
1
1
1
0
0
0
1
1
1

  527 in_domain.dev
  527 in_domain.dev.lbl
 1054 insgesamt


In [11]:
outdomain_dev_file = cola_path / "out_of_domain_dev.tsv"
outdomain_dev_data = read_cola_tsv(outdomain_dev_file)
outdomain_dev_data[:10]

[['clc95', '1', '', 'Somebody just left - guess who.'],
 ['clc95',
  '1',
  '',
  "They claimed they had settled on something, but it wasn't clear what they had settled on."],
 ['clc95', '1', '', 'If Sam was going, Sally would know where.'],
 ['clc95',
  '1',
  '',
  "They're going to serve the guests something, but it's unclear what."],
 ['clc95', '1', '', "She's reading. I can't imagine what."],
 ['clc95', '1', '', 'John said Joan saw someone from her graduating class.'],
 ['clc95', '0', '*', "John ate dinner but I don't know who."],
 ['clc95', '0', '*', "She mailed John a letter, but I don't know to whom."],
 ['clc95', '1', '', 'I served leek soup to my guests.'],
 ['clc95', '1', '', 'I served my guests.']]

In [12]:
save_cola_data_in_fairseq_format(outdomain_dev_data, filename="out_domain.dev")

In [13]:
%%bash
head out_domain.dev
echo ""
head out_domain.dev.lbl
echo ""
wc -l out_domain.dev*

Somebody just left - guess who.
They claimed they had settled on something, but it wasn't clear what they had settled on.
If Sam was going, Sally would know where.
They're going to serve the guests something, but it's unclear what.
She's reading. I can't imagine what.
John said Joan saw someone from her graduating class.
John ate dinner but I don't know who.
She mailed John a letter, but I don't know to whom.
I served leek soup to my guests.
I served my guests.

1
1
1
1
1
1
0
0
1
1

  516 out_domain.dev
  516 out_domain.dev.lbl
 1032 insgesamt
