In [1]:
import pandas as pd
from tqdm import tqdm

import re
import unicodedata

**Set constants**

In [2]:
DATASET_PATH = f"../.data/miguel"
SRC_LANG = "en"
TRG_LANG = "es"

## Load dataset

This dataset is made up of two files (src, trg) for each split (train, dev and test). All files are in plain text with UTF-8 encoding.

In [3]:
def load_dataset(filename_src, filename_trg):
    file_src = open(filename_src, encoding='utf-8').read().split('\n')
    file_trg = open(filename_trg, encoding='utf-8').read().split('\n')
    assert len(file_src) == len(file_trg)
    return file_src, file_trg

In [4]:
(train_src, train_trg) = load_dataset(filename_src=f"{DATASET_PATH}/raw/europarl.en", filename_trg=f"{DATASET_PATH}/raw/europarl.es")
(dev_src, dev_trg) = load_dataset(filename_src=f"{DATASET_PATH}/raw/dev.en", filename_trg=f"{DATASET_PATH}/raw/dev.es")
(test_src, test_trg) = load_dataset(filename_src=f"{DATASET_PATH}/raw/test.en", filename_trg=f"{DATASET_PATH}/raw/test.es")

In [5]:
print(f"Train => Total sentences: SRC={len(train_src)} | TRG={len(train_trg)}")
print(f"Dev => Total sentences: SRC={len(dev_src)} | TRG={len(dev_trg)}")
print(f"Test => Total sentences: SRC={len(test_src)} | TRG={len(test_trg)}")

Train => Total sentences: SRC=1960642 | TRG=1960642
Dev => Total sentences: SRC=3004 | TRG=3004
Test => Total sentences: SRC=3001 | TRG=3001


## Qualitative exploration

By simply opening the text files and exploring them, I have noticed the following things:

**Train:**
- Two files (en, es)
- `<0 `
- Chars like: `NBSP`
- Short sentences, around 50-100 words
- Sentences starting with `(`
- UTF-8
- No tokenization done (like replacing numbers with `NUM`, dates with `DATE` or things like that)
- Words/Sentences hard to translate: `(H-0521/00)`, `78/319/CEE`
- Last line empty
- Some sentences end with period, others don't

**Dev:**
- Two files (en, es)
- Chars like: `NBSP`, `ZWSP`
- Short sentences, around 50-100 words
- Sentences starting with `"`
- UTF-8
- No tokenization done (like replacing numbers with `NUM`, dates with `DATE` or things like that)
- Words/Sentences hard to translate: `Wolfgang Schäuble`, `Hašek`
- Last line empty
- Some sentences end with period, others don't


**Test:**
- Two files (en, es)
- `<seg id=` tags
- Chars like: `NBSP`, `ZWSP`
- Short sentences, around 50-100 words
- Sentences starting with `"`
- UTF-8
- No tokenization done (like replacing numbers with `NUM`, dates with `DATE` or things like that)
- Words/Sentences hard to translate: `Nikolaev`, `www.kpks.cz`, `(0-0)`
- Last line empty
- Some sentences end with period, others don't
- Multiple formats for datates: `20.12.2012`


**File sizes:**

```bash
-rw-r--r-- 1 salvacarrion salvacarrion 373K ene 23  2013 dev.en
-rw-r--r-- 1 salvacarrion salvacarrion 417K ene 23  2013 dev.es
-rw-rw-r-- 1 salvacarrion salvacarrion 281M mar 31  2020 europarl.en
-rw-rw-r-- 1 salvacarrion salvacarrion 310M mar 31  2020 europarl.es
-rw-r--r-- 1 salvacarrion salvacarrion 329K mar 31  2020 test.en
-rw-r--r-- 1 salvacarrion salvacarrion 374K mar 31  2020 test.es
```

### Head and tails

Now, I want to see the first and last *n* pairs of sentences for each partition to check whether everything has been readed properly and matches the head and tail of the raw-text files.

Below we see that there is nothing strange here, except that we need to peform some cleaning (removing the empty line, xml tags, etc)

In [6]:
def view_raw(src_raw, trg_raw, indices):
    for i, idx in enumerate(indices):
        (src, trg) = src_raw[idx], trg_raw[idx]
        print(f"#{i+1}: " + "-"*20)
        print(f"src => {src}")
        print(f"trg => {trg}")
    print("")


In [7]:
n=3
print("Head: " + "#"*20)
print("(Firsts) Train dataset: " + "*"*20)
view_raw(train_src, train_trg, indices=range(0, n))

print("(Firsts) Dev dataset: " + "*"*20)
view_raw(dev_src, dev_trg, indices=range(0, n))

print("(Firsts) Test dataset: " + "*"*20)
view_raw(test_src, test_trg, indices=range(0, n))

print("Tail: " + "#"*20)
print("(Lasts) Train dataset: " + "*"*20)
view_raw(train_src, train_trg, indices=range(-1,-n-1,-1))

print("(Lasts) Dev dataset: " + "*"*20)
view_raw(dev_src, dev_trg, indices=range(-1,-n-1,-1))

print("(Lasts) Test dataset: " + "*"*20)
view_raw(test_src, test_trg, indices=range(-1,-n-1,-1))


Head: ####################
(Firsts) Train dataset: ********************
#1: --------------------
src => Resumption of the session
trg => Reanudación del período de sesiones
#2: --------------------
src => I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
trg => Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el viernes 17 de diciembre pasado, y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones.
#3: --------------------
src => Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
trg => Como todos han podido comprobar, el gran "efecto del año 2000" no se ha producido. En cambio, los ciudadanos de varios de nuestros países han sido víctimas de

## Preprocessing dataset

From the previous exploration we've seen that we need to perform a bit of preprocessing. Here, I'll apply the same cleaning to each partition, regaless the language, since it is a pretty general cleaning. Cleaning steps:

- Remove last empty row from: Train, Dev and Test
- Remove XML tags
- Remove multiple whitespaces
- Strip lines
- Remove pair, if any of the lines is empty

I could also set everything to lowercase, but I prefer to leave that for the tokenizer, and have the preprocessed files as "raw" as possible.


In [8]:
# Define regex patterns
p_xml = re.compile("^<seg id=\"\d+\">")
p_whitespace = re.compile(" +")

def preprocess_text(text):
    # Remove html
    text = p_xml.sub('', text)

    # Remove repeated whitespaces "   " => " "
    text = p_whitespace.sub(' ', text)

    # Normalization Form Compatibility Composition
    text = unicodedata.normalize("NFKC", text)

     # Strip whitespace
    text = text.strip()

    return text

In [9]:
def preprocess_dataset(data_src, data_trg):
    data_src_new, data_trg_new = [], []


    total = len(data_src)
    for i in tqdm(range(total), total=total):
        src, trg = data_src[i], data_trg[i]

        # Preprocess
        src = preprocess_text(src)
        trg = preprocess_text(trg)

        # Remove empty line
        if len(src) > 0 and len(trg) > 0:
            data_src_new.append(src)
            data_trg_new.append(trg)
    return data_src_new, data_trg_new

In [10]:
# Preprocess
train_src, train_trg = preprocess_dataset(train_src, train_trg)
dev_src, dev_trg = preprocess_dataset(dev_src, dev_trg)
test_src, test_trg = preprocess_dataset(test_src, test_trg)

100%|██████████| 1960642/1960642 [00:31<00:00, 63060.33it/s]
100%|██████████| 3004/3004 [00:00<00:00, 74557.90it/s]
100%|██████████| 3001/3001 [00:00<00:00, 80385.65it/s]


**Check the number of pairs again**

After the cleanning process, we see that only one pair from each file has been remove (the empty line).

In [11]:
print(f"Train => Total sentences: SRC={len(train_src)} | TRG={len(train_trg)}")
print(f"Dev => Total sentences: SRC={len(dev_src)} | TRG={len(dev_trg)}")
print(f"Test => Total sentences: SRC={len(test_src)} | TRG={len(test_trg)}")

Train => Total sentences: SRC=1960641 | TRG=1960641
Dev => Total sentences: SRC=3003 | TRG=3003
Test => Total sentences: SRC=3000 | TRG=3000


### From Pandas to CSV

Once we have the raw file "cleaned", we can convert them to Pandas and the to CSV.

Pandas is a astonishingly good library for working with tabular data. However, here I simply use it to save the CSV file. Finally, the reason behind using CSV is that it is an easy-to-read format, widely supported by many libraries in the Python data science stack. Additionally, it can be easily compressed with high storage savings.

In [12]:
train_raw = {SRC_LANG: train_src, TRG_LANG: train_trg}
train_df = pd.DataFrame(train_raw, columns=[SRC_LANG, TRG_LANG])

dev_raw = {SRC_LANG: dev_src, TRG_LANG: dev_trg}
dev_df = pd.DataFrame(dev_raw, columns=[SRC_LANG, TRG_LANG])

test_raw = {SRC_LANG: test_src, TRG_LANG: test_trg}
test_df = pd.DataFrame(test_raw, columns=[SRC_LANG, TRG_LANG])

**Preview pandas**

Now we take a look at the Pandas object before saving it to CSV

In [13]:
print("Train:")
print(train_df)

print("Dev:")
print(dev_df)

print("Test:")
print(test_df)

Train:
                                                        en  \
0                                Resumption of the session   
1        I declare resumed the session of the European ...   
2        Although, as you will have seen, the dreaded '...   
3        You have requested a debate on this subject in...   
4        In the meantime, I should like to observe a mi...   
...                                                    ...   
1960636  I would also like, although they are absent, t...   
1960637  I am not going to re-open the 'Millennium or n...   
1960638                         Adjournment of the session   
1960639  I declare the session of the European Parliame...   
1960640             (The sitting was closed at 10.50 a.m.)   

                                                        es  
0                      Reanudación del período de sesiones  
1        Declaro reanudado el período de sesiones del P...  
2        Como todos han podido comprobar, el gran "efec...  
3   

In [14]:
train_df.to_csv(f"{DATASET_PATH}/preprocessed/train.csv", index=False)
dev_df.to_csv(f"{DATASET_PATH}/preprocessed/dev.csv", index=False)
test_df.to_csv(f"{DATASET_PATH}/preprocessed/test.csv", index=False)
print("CSV files saved!")

CSV files saved!


**Save individual languages**

Now we take a look at the Pandas object before saving it to CSV

In [15]:
# For Training
train_df_src = train_df["en"]
train_df_trg = train_df["es"]
train_df_src.to_csv(f"{DATASET_PATH}/preprocessed/train_en.csv", index=False)
train_df_trg.to_csv(f"{DATASET_PATH}/preprocessed/train_es.csv", index=False)

# For testing
dev_df_src = dev_df["en"]
dev_df_trg = dev_df["es"]
dev_df_src.to_csv(f"{DATASET_PATH}/preprocessed/dev_en.csv", index=False)
dev_df_trg.to_csv(f"{DATASET_PATH}/preprocessed/dev_es.csv", index=False)
print("Individual languages saved!")


Individual languages saved!
