Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

Use this notebook to pull in datasets and apply pre-processing.  Most grammar datasets unfortunately require preprocessing before being usable in training. (example - jfleg has 4 targets per input, so we have to rematch as 1:1 pairings) 

In [1]:
import csv
from datasets import load_metric, load_dataset
from pathlib import Path
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
list_replacements = [
  (" .", "."), 
  (" ,", ","),
  (" '", "'"),
  (" ?", "?"),
  (" !", "!"),
  (" :", ":"),
  (" ;", ";"),
  (" n't", "n't"),
  (" v", "v"),
  ("2 0 0 6", "2006"),
  ("5 5", "55"),
  ("4 0 0", "400"),
  ("1 7-5 0", "1750"),
  ("2 0 %", "20%"),
  ("5 0", "50"),
  ("1 2", "12"),
  ("1 0", "10"),
  ('" ballast water', '"ballast water')
  ]

In [3]:
def correct_spacing(item):
    """ we iterate through the list of all replacements per each item in dataset"""
    for fix in list_replacements:
        # print('correct_spacing : fix[0]', fix[0])
        # print('correct_spacing : fix[1]', fix[1])

        item = item.replace(fix[0], fix[1])
    return item



In [29]:
def generate_csv(csv_path, dataset):
    """ apply spacing corrections and save out matched pairs to csv file as dataset"""
    with open(csv_path, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["input", "target"])
        for case in dataset:
     	    # Adding the t5 task indication prefix to input 
            input_text = case["sentence"]
            input_text = correct_spacing(input_text)

            for correction in case["corrections"]:
              correction = correct_spacing(correction)
              # a few of the cases contain blank strings. 
              if input_text and correction:
                writer.writerow([input_text, correction])

# dataset
## 1) Jfleg dataset

In Jfleg  - validation will be used as 'train', test will be 'validation'


In [4]:
data = load_dataset("jfleg")
print(type(data))

Downloading readme: 100%|██████████| 5.95k/5.95k [00:00<00:00, 5.96MB/s]
Downloading data: 100%|██████████| 148k/148k [00:01<00:00, 86.6kB/s]
Downloading data: 100%|██████████| 141k/141k [00:01<00:00, 137kB/s]
Generating validation split: 100%|██████████| 755/755 [00:00<00:00, 21027.78 examples/s]
Generating test split: 100%|██████████| 748/748 [00:00<00:00, 107175.19 examples/s]

<class 'datasets.dataset_dict.DatasetDict'>





In [7]:
#dict_keys(['validation', 'test'])
data.keys()

dict_keys(['validation', 'test'])

In [8]:
train_dataset = load_dataset("jfleg", split='validation[:]') 
eval_dataset = load_dataset("jfleg", split='test[:]')


In [7]:
print(train_dataset)
print(eval_dataset)


Dataset({
    features: ['sentence', 'corrections'],
    num_rows: 755
})
Dataset({
    features: ['sentence', 'corrections'],
    num_rows: 748
})


In [17]:
print(len(train_dataset['sentence']))
print(len(eval_dataset['sentence']))
print(len(train_dataset['corrections']))
print(len(eval_dataset['corrections']))

755
748
755
748


In [22]:
# sentence
idx  = 9
print(train_dataset['sentence'][idx])
# corrections
for i, correction in enumerate(train_dataset['corrections'][idx]):
    print( i, correction)


There are several reason . 
0 There are several reasons . 
1 There are several reasons . 
2 There are several reasons . 
3 There are several reasons . 


In [9]:
clean22 = correct_spacing(train_dataset['sentence'][22])
clean22

'Students can focus on only a few subjects they are intwerested in and they will become an experts in those areas. '

In [26]:
jfleg_dir = Path.cwd()/'jfleg_dataset'  # if you only use 'jfleg', hf will try and use that and complain
jfleg_dir.mkdir(parents=True,exist_ok=True)
c4_dir = Path.cwd()/'c4_dataset'
c4_dir.mkdir(parents=True,exist_ok=True)

Process Jfleg data  

In [31]:
j_train_file = jfleg_dir/'jtrain.csv'
j_eval_file = jfleg_dir/'jeval.csv'

In [34]:
generate_csv(j_train_file, train_dataset)

In [35]:
generate_csv(j_eval_file, eval_dataset)

In [39]:
# with open(j_train_file) as f:
#     train_data = f.read()

train_data = pd.read_csv(j_train_file)
eval_data = pd.read_csv(j_eval_file)

print(len(train_data))
print(len(eval_data))



3016
2988


In [40]:
train_data.head()

Unnamed: 0,input,target
0,So I think we can not live if old people could...,So I think we would not be alive if our ancest...
1,So I think we can not live if old people could...,So I think we could not live if older people d...
2,So I think we can not live if old people could...,So I think we can not live if old people could...
3,So I think we can not live if old people could...,So I think we can not live if old people can n...
4,For not use car.,Not for use with a car.


In [41]:
eval_data.head()

Unnamed: 0,input,target
0,New and new technology has been introduced to ...,New technology has been introduced to society.
1,New and new technology has been introduced to ...,New technology has been introduced into the so...
2,New and new technology has been introduced to ...,Newer and newer technology has been introduced...
3,New and new technology has been introduced to ...,Newer and newer technology has been introduced...
4,One possible outcome is that an environmentall...,One possible outcome is that an environmentall...


## 2) C4_200M dataset
Process C4_200M (!) - we'll pull 10K to start

In [46]:
c4_dataset = load_dataset("liweili/c4_200m", streaming = True)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [47]:
c4_dataset.keys()

dict_keys(['train'])

In [53]:
iterator = iter(c4_dataset['train'])


In [None]:
print(len(iterator))
for data in iterator:
    print(data)
    break

In [56]:
def c4_generate_csv(csv_path, iterator, num_examples, encoding):
    with open(csv_path, 'w', newline='', encoding = encoding) as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["input", "target"])
        for i in range(0,num_examples):
          data = next(iterator)
          input_text = data["input"]
          input_text = correct_spacing(input_text)
          correction = correct_spacing(data["output"])
          if input_text and correction:
            writer.writerow([input_text, correction])

In [57]:
c4_dir = Path.cwd()/'c4_dataset'
c4_dir.mkdir(parents=True,exist_ok=True)

You can modify the following to make the csv file with desired number of instances, here we go for 10k to make a quick test

In [58]:
c4_filename = c4_dir/'c4train_10k.csv'

In [63]:
c4_generate_csv(c4_filename, iterator, num_examples=10000, encoding='UTF-8')

In [64]:
c4_data = pd.read_csv(c4_filename)

print(len(c4_data))

10000


In [65]:
c4_data.head()

Unnamed: 0,input,target
0,Found in the south of Dorset Bournemouth had 7...,"Found in the south of Dorset, Bournemouth has ..."
1,Morrison finished 17th out of 25. with a score...,Morrison finished 17th out of 25 with a score ...
2,Routinely (maybe weekly/monthly) check for new...,Routinely (maybe weekly/monthly) check for new...
3,First day of Rest=VX oil change oil filter and...,"First day of rest=VX oil change, oil filter an..."
4,Books; A good selection of _ ? books on many d...,Books; A good selection of Art books on many d...


## 3. merge dataset
Create a single training file by combining jtrain and c4train

In [67]:
merge_list = [j_train_file, c4_filename, ]

In [68]:
combined_csv = pd.concat([pd.read_csv(fn) for fn in merge_list])


In [69]:
merged_name = "gtrain_10k.csv"

In [70]:
combined_csv.to_csv(merged_name, index=False, encoding = 'utf-8-sig', )

In [71]:
train_data = pd.read_csv(merged_name)

print(len(train_data))
train_data.head()

13016


Unnamed: 0,input,target
0,So I think we can not live if old people could...,So I think we would not be alive if our ancest...
1,So I think we can not live if old people could...,So I think we could not live if older people d...
2,So I think we can not live if old people could...,So I think we can not live if old people could...
3,So I think we can not live if old people could...,So I think we can not live if old people can n...
4,For not use car.,Not for use with a car.


In [73]:
eval_name = "grammar_validation.csv"

In [74]:
eval_csv = pd.read_csv(j_eval_file)
eval_csv.to_csv(eval_name, index=False, encoding = 'utf-8-sig', )

In [75]:
eval_data = pd.read_csv(eval_name)

print(len(eval_data))
train_data.head()

2988


Unnamed: 0,input,target
0,So I think we can not live if old people could...,So I think we would not be alive if our ancest...
1,So I think we can not live if old people could...,So I think we could not live if older people d...
2,So I think we can not live if old people could...,So I think we can not live if old people could...
3,So I think we can not live if old people could...,So I think we can not live if old people can n...
4,For not use car.,Not for use with a car.
