In [1]:
import sys

sys.path.insert(0, "..")

In [2]:
from nltk.tag import HiddenMarkovModelTagger as ExternalHMM
from src.tagger import HiddenMarkovModelTrainer as LocalHMM

from src.scrapper import parse_conllu_file

In this notebook we analyse the performance of our algorithm in terms of speed and space (computationally-wise).
We will perform a set of experiments in which we will compare the performance of our implementation and the nltk version of the HMM.

In this analysis, we don't expect to prove that our algorithm is better since we are well aware that there are lots of optimizations that big libraries such as nltk have included and we lack. However, it is an interesting exercise and a good way to reflect what our weaknesses and strengths are, as well as thinking hypothetical future work and improvements that could be done if this project went further.

-> It is recommended not to execute again the cells below since results may vary slightly according to the architecture where the code is ran. However, the magnitudes and proportions between experiments should be preserved.

# Text processing

In [6]:
dataset = parse_conllu_file(filepath="../datasets/ca_ancora-ud-train.conllu")

s_dataset = dataset[:10]
m_dataset = dataset[:int(len(dataset)/2)]
l_dataset = dataset
xl_dataset = dataset*10

datasets = {
    'S': s_dataset, 
    'M': m_dataset, 
    'L': l_dataset, 
    'XL': xl_dataset
}

for name, data in datasets.items():
    print(f'Length size {name} = {len(data)}')

Length size S = 10
Length size M = 6561
Length size L = 13123
Length size XL = 131230


# Training Analysis

From now on, we will refer to:
* The models trained using the nltk library as `external model`.
* The models trained using our implementation as the `local implementation`
* The models trained with the different sizes of data as `S`, `M`, `L`, `XL`.

In [4]:
# external model
external_s_time = %timeit -o ExternalHMM.train(s_dataset)
external_m_time = %timeit -o ExternalHMM.train(m_dataset)
external_l_time = %timeit -o ExternalHMM.train(l_dataset)
external_xl_time = %timeit -o ExternalHMM.train(xl_dataset)

349 µs ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
157 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
305 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.03 s ± 4.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
# local
local_s_model = %timeit -o LocalHMM(s_dataset).train()
local_m_model = %timeit -o LocalHMM(m_dataset).train()
local_l_model = %timeit -o LocalHMM(l_dataset).train()
local_xl_model = %timeit -o LocalHMM(xl_dataset).train()

249 µs ± 851 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
105 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
206 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.99 s ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Predict Analysis

In [7]:
test_data = parse_conllu_file(filepath="../datasets/ca_ancora-ud-test.conllu")

In [9]:
external_s_model = ExternalHMM.train(s_dataset)
external_xl_model = ExternalHMM.train(xl_dataset)

# Extra - Memory Analysis Approximation