# Dataset preprocessing

## Installing requirements (please install [pytorch](https://pytorch.org/get-started/locally/) compatible with your cuda version before)

In [None]:
!pip install transformers 
!pip install datasets
!pip install pandas

## Check your cuda and gpu (Although you will not need gpu for preprocessing, I just wanted to check everything works fine.)

In [1]:
!nvidia-smi

Wed Dec 28 14:15:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
|  0%   48C    P8    34W / 350W |   2802MiB / 24245MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [2]:
from transformers import (AutoConfig, AutoModel, AutoTokenizer)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Define the possible labels for each word.

In [None]:

special_labels = {',': 'I-COMMA',
                  '.': 'I-DOT',
                  '?': 'I-QMARK',
                  '!': 'I-EMARK',
                  ':': 'I-COLON',
                  ';': 'I-SEMICOLON'}
normal_label = 'O'

## Assigns a lable for each word in a line. We filter out paragraphs with less than 10 tokens (because it is probabily not a good sentece) or more than 510 tokens (because BERT cannot process sequences with more than 512 tokens).

In [4]:
def descrete_and_label(list_of_lines):
    list_of_lists = []
    for i, line in enumerate(list_of_lines):
        tkn_line = tokenizer.tokenize(line)
        if len(tkn_line) < 10 or len(tkn_line) > 510:
            continue
        for word in line.split():
            lbl = normal_label
            brek = False
            sl = special_labels.get(word, None)
            if sl:
                if list_of_lists:
                    list_of_lists[-1][2] = sl
                    brek = True
            if not brek:
                list_of_lists.append([i, word, lbl])
    return list_of_lists

In [5]:
from datasets import load_dataset, ReadInstruction

In [19]:
import pandas as pd

def save_dataset(ds, path):
    filtered = []
    filtered += [i['text'] for i in ds if len(i['text']) > 20]
    dataset_1 = descrete_and_label(filtered)
    train_data = pd.DataFrame(dataset_1, columns=["sentence_id", "words", "labels"])
    train_data.to_csv(path, index=False)

## We make 10 bins out of `train` split of the main dataset. We will train our models on the first 9 bins and test them on the last bin.

In [None]:
binz = 10
for i in range(binz):
    print(i * (100/binz), (i+1) * (100/binz))
    sub_dataset = load_dataset('wikitext', 'wikitext-103-v1', split=ReadInstruction(
      'train', from_=int(i * (100/binz)), to=int((i+1) * (100/binz)), unit='%', rounding='pct1_dropremainder'))
    print(len(sub_dataset))
    save_dataset(sub_dataset, f'./preprocessed_wikitext/train{i}-{binz}.csv')

## Although we did not use the `test` and `validation` split of the main dataset (wikitext-103-v1) in the final models, we were using while we were programming this research. You don't need to do the rest.

In [21]:
validation = load_dataset('wikitext', 'wikitext-103-v1', split='validation')
test = load_dataset('wikitext', 'wikitext-103-v1', split='test')

Found cached dataset wikitext (/home/mostafa/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Found cached dataset wikitext (/home/mostafa/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


In [22]:
len(validation), len(test)

(3760, 4358)

In [23]:
save_dataset(validation, f'./preprocessed_wikitext/validation.csv')

In [24]:
save_dataset(test, f'./preprocessed_wikitext/test.csv')