In [1]:
import sys

sys.path.insert(0, "..")

In [3]:
from nltk.tag import HiddenMarkovModelTagger as ExternalHMM
from src.tagger import HiddenMarkovModel as LocalHMM

from src.scrapper import parse_conllu_file

In this notebook we analyse the performance of our algorithm in terms of speed and space (computationally-wise).
We will perform a set of experiments in which we will compare the performance of our implementation and the nltk version of the HMM.

In this analysis, we don't expect to prove that our algorithm is better since we are well aware that there are lots of optimizations that big libraries such as nltk have included and we lack. However, it is an interesting exercise and a good way to reflect what our weaknesses and strengths are, as well as thinking hypothetical future work and improvements that could be done if this project went further.

-> It is recommended not to execute again the cells below since results may vary slightly according to the architecture where the code is ran. However, the magnitudes and proportions between experiments should be preserved.

# Text processing

In [4]:
dataset = parse_conllu_file(filepath="../datasets/ca_ancora-ud-train.conllu")

s_dataset = dataset[:10]
m_dataset = dataset[:int(len(dataset)/2)]
l_dataset = dataset
xl_dataset = dataset*10

datasets = {
    'S': s_dataset, 
    'M': m_dataset, 
    'L': l_dataset, 
    'XL': xl_dataset
}

for name, data in datasets.items():
    print(f'Length size {name} = {len(data)}')

Length size S = 10
Length size M = 6561
Length size L = 13123
Length size XL = 131230


# Training Analysis

From now on, we will refer to:
* The models trained using the nltk library as `external model`.
* The models trained using our implementation as the `local implementation`
* The models trained with the different sizes of data as `S`, `M`, `L`, `XL`.

In [5]:
# external model
external_s_time = %timeit -o ExternalHMM.train(s_dataset)
external_m_time = %timeit -o ExternalHMM.train(m_dataset)
external_l_time = %timeit -o ExternalHMM.train(l_dataset)
external_xl_time = %timeit -o ExternalHMM.train(xl_dataset)

355 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
162 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
309 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.11 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
# local
local_s_model = %timeit -o LocalHMM(s_dataset).train()
local_m_model = %timeit -o LocalHMM(m_dataset).train()
local_l_model = %timeit -o LocalHMM(l_dataset).train()
local_xl_model = %timeit -o LocalHMM(xl_dataset).train()

256 µs ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
108 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
217 ms ± 7.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.03 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Predict Analysis

In [7]:
test_data = parse_conllu_file(filepath="../datasets/ca_ancora-ud-test.conllu")

In [8]:
external_s_model = ExternalHMM.train(s_dataset)
external_xl_model = ExternalHMM.train(xl_dataset)

# Memory Analysis Approximation

# Carbon Print Analysis

Our model is a very simple approach that requires few reosurces to train and predict. However, it is interesting to check out the carbon footprint it may generate, if only for the sake of curiosity. 

In this section, we use a library that is in charge of tracking the emissions a function generates. Since the element we will use it's a decorator and we don't want to touch our source code, we have created a mock function that will basically call any method we send as parameter with the corresponding arguments and keyword arguments.

This way, we can check our carbon footprint without affecting the actual code.

⚠️ Note that the conclusions drawn in this section can be misleading if the code is re-run in another computer or with other data, so they must be taken as orientative and prone to change ⚠️

In [25]:
from codecarbon import track_emissions

In [20]:
@track_emissions()
def compute_emissions(function_to_track: callable, *args, **kwargs):
    return function_to_track(*args,**kwargs)

In [23]:
initialized_class = LocalHMM(l_dataset)
model = compute_emissions(initialized_class.train)

[codecarbon INFO @ 23:23:11] [setup] RAM Tracking...
[codecarbon INFO @ 23:23:11] [setup] GPU Tracking...
[codecarbon INFO @ 23:23:11] No GPU found.
[codecarbon INFO @ 23:23:11] [setup] CPU Tracking...
[codecarbon INFO @ 23:23:12] CPU Model on constant consumption mode: Apple M1
[codecarbon INFO @ 23:23:12] >>> Tracker's metadata:
[codecarbon INFO @ 23:23:12]   Platform system: macOS-13.5.2-arm64-arm-64bit
[codecarbon INFO @ 23:23:12]   Python version: 3.11.6
[codecarbon INFO @ 23:23:12]   CodeCarbon version: 2.3.1
[codecarbon INFO @ 23:23:12]   Available RAM : 16.000 GB
[codecarbon INFO @ 23:23:12]   CPU count: 8
[codecarbon INFO @ 23:23:12]   CPU model: Apple M1
[codecarbon INFO @ 23:23:12]   GPU count: None
[codecarbon INFO @ 23:23:12]   GPU model: None
[codecarbon INFO @ 23:23:15] 
Graceful stopping: collecting and writing information.
Please wait a few seconds...
[codecarbon INFO @ 23:23:15] Energy consumed for RAM : 0.000000 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 23:23:15] Ene

In [24]:
_ = compute_emissions(model.predict, l_dataset)

[codecarbon INFO @ 23:23:27] [setup] RAM Tracking...
[codecarbon INFO @ 23:23:27] [setup] GPU Tracking...
[codecarbon INFO @ 23:23:27] No GPU found.
[codecarbon INFO @ 23:23:27] [setup] CPU Tracking...
[codecarbon INFO @ 23:23:27] CPU Model on constant consumption mode: Apple M1
[codecarbon INFO @ 23:23:27] >>> Tracker's metadata:
[codecarbon INFO @ 23:23:27]   Platform system: macOS-13.5.2-arm64-arm-64bit
[codecarbon INFO @ 23:23:27]   Python version: 3.11.6
[codecarbon INFO @ 23:23:27]   CodeCarbon version: 2.3.1
[codecarbon INFO @ 23:23:27]   Available RAM : 16.000 GB
[codecarbon INFO @ 23:23:27]   CPU count: 8
[codecarbon INFO @ 23:23:27]   CPU model: Apple M1
[codecarbon INFO @ 23:23:27]   GPU count: None
[codecarbon INFO @ 23:23:27]   GPU model: None
[codecarbon INFO @ 23:23:31] 
Graceful stopping: collecting and writing information.
Please wait a few seconds...
[codecarbon INFO @ 23:23:31] Energy consumed for RAM : 0.000004 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 23:23:31] Ene

When this model is run in an Apple M1 CPU model without any usage of GPUs, the summarised metrics are in the table below:

|                                       | Training      | Predict       |
|---------------------------------------|---------------|---------------|
| Energy consumed for RAM               | 0.00 kWh      | 0.000004 kWh  |
| Energy consumed for all CPUs          | 0.00 kWh      | 0.000003 kWh  |
| Electricity used since the beginning  | 0.000001 kWh  | 0.000008      |



* As can be seen, our model's consumption - in the datasets we have used - is mostly negligible.
* Nonetheless, we can affirm that prediction is more expensive than training.

# Conclusions