# Data preparation

The following notebook prepares the data for a parts-of-speech tagging model using the croatian language.

In [None]:
!git clone https://github.com/facebookresearch/fastText.git
!pip install fastText/.

Cloning into 'fastText'...
remote: Enumerating objects: 3995, done.[K
remote: Counting objects: 100% (1023/1023), done.[K
remote: Compressing objects: 100% (183/183), done.[K
remote: Total 3995 (delta 893), reused 863 (delta 835), pack-reused 2972[K
Receiving objects: 100% (3995/3995), 8.29 MiB | 15.24 MiB/s, done.
Resolving deltas: 100% (2531/2531), done.
Processing ./fastText
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext==0.9.2)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4214525 sha256=89e5016d7393b5852b2b25f8fbc8e1cdb41487e4097f2dc34cb98eb9df499172
  Stored in directory: /tmp/pip-ephem-wheel-cache-00fw

Import of needed libraries:

In [None]:
import fasttext
import fasttext.util
import gzip
import glob
import numpy as np
import os
import pandas as pd
import pickle
import shutil
from sklearn.model_selection import train_test_split
import string
from google.colab import drive
import xml.etree.ElementTree as ET

In [None]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


###Data Download

The data is found at https://www.clarin.si/repository/xmlui/handle/11356/1064 and contains 14 compressed xml files.

Due to the immense difference between the available compute and the size of the dataset, we will use just 0.14% of the full dataset.

If the full dataset were to be used, as you will see in more detail in the "training model" notebook, the training would take ~424 hours.

In [None]:
%cd /content/gdrive/MyDrive/opj/data
!ls

/content/gdrive/MyDrive/opj/data
checkpoints  fastText  hrWaC2.1.01_1.txt  hrWaC2.1.01_1.xml  hrWaC2.1.01.xml.gz  train	validation


In [None]:
!curl --remote-name-all https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1064{/hrWaC2.1.01.xml.gz}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  618M  100  618M    0     0   887k      0  0:11:53  0:11:53 --:--:--  249k


###Data Extraction

Lets unzip the file and split it into 100 000 sentence long chunks.

In [None]:
def gunzip_file(gz_file_path, xml_file_path):
    print(f"Gunzipping file: {gz_file_path}")
    with gzip.open(gz_file_path, 'rb') as f_in:
        with open(xml_file_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

def split_xml_into_chunks(original_xml_path, chunk_size):
    print(f"Splitting XML into chunks: {original_xml_path}")
    base_name, extension = os.path.splitext(os.path.basename(original_xml_path))
    current_chunk = 1
    sentences_in_current_chunk = 0

    with open(original_xml_path, 'r', encoding='utf-8') as original_file:
        with open(f'{base_name}_{current_chunk}{extension}', 'w', encoding='utf-8') as chunk_file:
            for line in original_file:
                chunk_file.write(line)

                if '</s>' in line:
                    sentences_in_current_chunk += 1
                    if sentences_in_current_chunk % int(chunk_size/4) == 0:
                        print(f"Processed {(sentences_in_current_chunk/chunk_size)*100}% of sentences")

                if sentences_in_current_chunk >= chunk_size:
                    current_chunk += 1
                    sentences_in_current_chunk = 0
                    chunk_file.close()
                    chunk_file = open(f'{base_name}_{current_chunk}{extension}', 'w', encoding='utf-8')

    open(original_xml_path, 'w').close()
    os.remove(original_xml_path)
    print(f"Original XML file removed: {original_xml_path}")

xml_gz_file_path = '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01.xml.gz'
xml_file_path = '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01.xml'
gunzip_file(xml_gz_file_path, xml_file_path)
split_xml_into_chunks(xml_file_path, chunk_size=100000)

Gunzipping file: /content/gdrive/MyDrive/opj/data/hrWaC2.1.01.xml.gz
Splitting XML into chunks: /content/gdrive/MyDrive/opj/data/hrWaC2.1.01.xml
Processed 25.0% of sentences
Processed 50.0% of sentences
Processed 75.0% of sentences
Processed 100.0% of sentences
Processed 25.0% of sentences
Processed 50.0% of sentences
Processed 75.0% of sentences
Processed 100.0% of sentences
Processed 25.0% of sentences
Processed 50.0% of sentences
Processed 75.0% of sentences
Processed 100.0% of sentences
Processed 25.0% of sentences
Processed 50.0% of sentences
Processed 75.0% of sentences
Processed 100.0% of sentences
Processed 25.0% of sentences
Processed 50.0% of sentences
Processed 75.0% of sentences
Processed 100.0% of sentences
Processed 25.0% of sentences
Processed 50.0% of sentences
Processed 75.0% of sentences
Processed 100.0% of sentences
Processed 25.0% of sentences
Processed 50.0% of sentences
Processed 75.0% of sentences
Processed 100.0% of sentences
Processed 25.0% of sentences
Process

For ease of data extraction, we convert all the xml files into txt:

In [None]:
directory = '/content/gdrive/MyDrive/opj/data/'
pattern = os.path.join(directory, '*.xml')

xml_files = glob.glob(pattern)

for xml_file in xml_files:
    txt_file = xml_file[:-4] + '.txt'  #replaces the extension
    os.rename(xml_file, txt_file)
    print(f"Renamed '{xml_file}' to '{txt_file}'")

Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_1.xml' to '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_1.txt'
Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_2.xml' to '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_2.txt'
Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_3.xml' to '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_3.txt'
Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_4.xml' to '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_4.txt'
Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_5.xml' to '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_5.txt'
Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_6.xml' to '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_6.txt'
Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_7.xml' to '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_7.txt'
Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_8.xml' to '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_8.txt'
Renamed '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_9.xml' to 

In [None]:
def load_and_process_data(filepath):
    sentences = []  # List to hold sentences, lists of (word, tag) tuples
    current_sentence = []

    with open(filepath, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()

            # Starting a new sentence
            if line == "<s>":
                current_sentence = []
            # End the current sentence
            elif line == "</s>":
                if current_sentence:
                    sentences.append(current_sentence)
                    current_sentence = []
            # Process lines that are not xml tags or sentence markers
            elif not line.startswith("<"):
                parts = line.split("	")
                if len(parts) >= 4:
                  word = parts[0]
                  tag = parts[3]  # Tag is the fourth element
                  current_sentence.append((word, tag))

    return sentences

filepath = '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_1.txt'
sentences = load_and_process_data(filepath)

Lets take a look at one of the sentences;

In [None]:
if sentences:
    first_sentence = sentences[0]
    sentence_str = ' '.join([word for word, tag in first_sentence])
    print("First sentence:", sentence_str)
    print("With tags:", first_sentence)
else:
    print("No sentences found in the data.")

First sentence: Adam se otpočetka nudi s tri benzinska motora : 1.2 70 KS , 1.4 87 KS i 1.4 100 KS .
With tags: [('Adam', 'Npmsn'), ('se', 'Px--sa'), ('otpočetka', 'Rgp'), ('nudi', 'Vmr3s'), ('s', 'Si'), ('tri', 'Mlc'), ('benzinska', 'Agpmsgn'), ('motora', 'Ncmsg'), (':', 'Z'), ('1.2', 'Mdc'), ('70', 'Mdc'), ('KS', 'Y'), (',', 'Z'), ('1.4', 'Mdc'), ('87', 'Mdc'), ('KS', 'Y'), ('i', 'Cc'), ('1.4', 'Mdc'), ('100', 'Mdc'), ('KS', 'Y'), ('.', 'Z')]


The next step is to embed the words into a numerical representation usable by the model.

We will be using a preexisting embedding found at https://fasttext.cc/docs/en/crawl-vectors.html

In [None]:
!curl https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hr.300.bin.gz --output cc.hr.300.bin.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4299M  100 4299M    0     0  38.1M      0  0:01:52  0:01:52 --:--:-- 38.5M


In [None]:
!gunzip cc.hr.300.bin.gz -d cc.hr.300.bin

gzip: cc.hr.300.bin: unknown suffix -- ignored


Lets load the embedding and double check the dimensions

In [None]:
ft = fasttext.load_model('cc.hr.300.bin')
ft.get_dimension()

300

Originally, I planned to reduce the embeddings to 100 dimensions, but resource constrains and fair performance on 300-dim vector, made me reconsider.

In [None]:
#fasttext.util.reduce_model(ft, 100)
#ft.get_dimension()

Testing out the embeddings:

In [None]:
ft.get_nearest_neighbors('zagreb')

[(0.687420666217804, 'zagreb.'),
 (0.6478269100189209, 'zagrebčki'),
 (0.6400067806243896, 'zagrebu'),
 (0.637527585029602, 'zagreb.hr'),
 (0.6205328106880188, '.zagreb'),
 (0.6144486665725708, 'zagrebli'),
 (0.6062915921211243, 'zagrebdox.net'),
 (0.6061169505119324, 'zagrebemo'),
 (0.6033006906509399, 'zagrebu.'),
 (0.5954848527908325, 'zagre-')]

Lets turn the words into embeddings and the tags into one of 12 possible classes. One for each type of part of speech.

In [None]:
def is_punctuation(word):
    return all(char in string.punctuation for char in word)

def tag_to_integer(tag):
    tag_to_int = {
        'N': 0,  # Noun
        'V': 1,  # Verb
        'A': 2,  # Adjective
        'P': 3,  # Pronoun
        'R': 4,  # Adverb
        'S': 5,  # Adposition
        'C': 6,  # Conjunction
        'M': 7,  # Numeral
        'Q': 8,  # Particle
        'I': 9,  # Interjection
        'Y': 10, # Abbreviation
        'X': 11 # Residual
    }

    return tag_to_int.get(tag, -1)

def load_and_process_data_with_embeddings(filepath, ft_model):
    sentences_with_embeddings = []  # List to hold sentences with embeddings

    with open(filepath, 'r', encoding='utf-8') as file:
        current_sentence = []  # Current sentence being processed
        for line in file:
            line = line.strip()

            if line == "<s>":
                current_sentence = []
            elif line == "</s>":
                if current_sentence:
                    sentences_with_embeddings.append(current_sentence)
                    current_sentence = []
            elif not line.startswith("<"):
                parts = line.split("\t")  # Ensure this matches the delimiter used in your file
                if len(parts) >= 4:
                    word = parts[0]
                    tag = parts[3]  # Assuming the tag is the fourth element
                    # Ignore punctuation
                    if not is_punctuation(word):
                        # Convert the word to its embedding
                        embedding = ft_model.get_word_vector(word)
                        current_sentence.append((embedding, tag_to_integer(tag[0])))

    return sentences_with_embeddings

filepath = '/content/gdrive/MyDrive/opj/data/hrWaC2.1.01_1.txt'
sentences_with_embeddings = load_and_process_data_with_embeddings(filepath, ft)

We must also split the data into chunks so that we can load each during training.

In [None]:
def save_in_chunks(data, base_filepath, chunk_size=10000):
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i+chunk_size]
        filepath = f"{base_filepath}_part_{i//chunk_size}.pkl"
        with open(filepath, 'wb') as file:
            pickle.dump(chunk, file)
        print(f"Saved chunk {i//chunk_size} to {filepath}")

base_filepath = '/content/gdrive/MyDrive/opj/data/embedded_chunked'
save_in_chunks(sentences_with_embeddings, base_filepath)

Saved chunk 0 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_0.pkl
Saved chunk 1 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_1.pkl
Saved chunk 2 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_2.pkl
Saved chunk 3 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_3.pkl
Saved chunk 4 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_4.pkl
Saved chunk 5 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_5.pkl
Saved chunk 6 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_6.pkl
Saved chunk 7 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_7.pkl
Saved chunk 8 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_8.pkl
Saved chunk 9 to /content/gdrive/MyDrive/opj/data/embedded_chunked_part_9.pkl


We will now pad the data to a len of 30, and split it into training and test datasets.

In [None]:
def pad_embeddings(sequences, max_len, dim=300):
    padded = np.zeros((len(sequences), max_len, dim))
    for i, sequence in enumerate(sequences):
        length = min(len(sequence), max_len)
        padded[i, :length, :] = sequence[:length]
    return padded

def pad_tags(sequences, max_len, pad_value=-1):
    padded = np.full((len(sequences), max_len), pad_value)
    for i, sequence in enumerate(sequences):
        length = min(len(sequence), max_len)
        padded[i, :length] = sequence[:length]
    return padded

def process_and_save_file(filepath, max_len, train_path, val_path, index):
    with open(filepath, 'rb') as file:
        sentences_with_embeddings = pickle.load(file)

        embeddings = [[emb for emb, _ in sentence] for sentence in sentences_with_embeddings]
        tags = [[tag for _, tag in sentence] for sentence in sentences_with_embeddings]

        # Padding
        embeddings_padded = pad_embeddings(embeddings, max_len)
        tags_padded = pad_tags(tags, max_len)

        X_train, X_val, y_train, y_val = train_test_split(embeddings_padded, tags_padded, test_size=0.1, random_state=42)

        train_filepath = os.path.join(train_path, f'train_{index}.pkl')
        val_filepath = os.path.join(val_path, f'validation_{index}.pkl')

        with open(train_filepath, 'wb') as f:
            pickle.dump((X_train, y_train), f)

        with open(val_filepath, 'wb') as f:
            pickle.dump((X_val, y_val), f)

        print(f"Processed and saved: {train_filepath} and {val_filepath}")


max_len = 30
directory = '/content/gdrive/MyDrive/opj/data/'
train_path = '/content/gdrive/MyDrive/opj/data/train/'
val_path = '/content/gdrive/MyDrive/opj/data/validation/'

os.makedirs(train_path, exist_ok=True)
os.makedirs(val_path, exist_ok=True)

filepaths = glob.glob(os.path.join(directory, '*.pkl'))
for i, filepath in enumerate(sorted(filepaths), start=1):
    process_and_save_file(filepath, max_len, train_path, val_path, i)

Processed and saved: /content/gdrive/MyDrive/opj/data/train/train_1.pkl and /content/gdrive/MyDrive/opj/data/validation/validation_1.pkl
Processed and saved: /content/gdrive/MyDrive/opj/data/train/train_2.pkl and /content/gdrive/MyDrive/opj/data/validation/validation_2.pkl
Processed and saved: /content/gdrive/MyDrive/opj/data/train/train_3.pkl and /content/gdrive/MyDrive/opj/data/validation/validation_3.pkl
Processed and saved: /content/gdrive/MyDrive/opj/data/train/train_4.pkl and /content/gdrive/MyDrive/opj/data/validation/validation_4.pkl
Processed and saved: /content/gdrive/MyDrive/opj/data/train/train_5.pkl and /content/gdrive/MyDrive/opj/data/validation/validation_5.pkl
Processed and saved: /content/gdrive/MyDrive/opj/data/train/train_6.pkl and /content/gdrive/MyDrive/opj/data/validation/validation_6.pkl
Processed and saved: /content/gdrive/MyDrive/opj/data/train/train_7.pkl and /content/gdrive/MyDrive/opj/data/validation/validation_7.pkl
Processed and saved: /content/gdrive/MyDr