In [56]:
import numpy as np
import pandas as pd

from functools import reduce
import plotly.graph_objects as go
import plotly.express as px

import os
import sys
root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(root)

from validation import get_error_propagation_prob, get_accuracy, get_precision, get_recall, get_f1
from src.scrapper import parse_conllu_file
from src.tagger import HiddenMarkovModel, HiddenMarkovModelTagger
# from src.visualization import plot_viterbi_path_binary, plot_viterbi_matrix

In [4]:
# Importing data
train = parse_conllu_file("../datasets/en_gum-ud-train.conllu") 
test = parse_conllu_file("../datasets/en_gum-ud-test.conllu")

In [5]:
# Training the HHM model:
tagger = HiddenMarkovModel(corpus=train).train()

In [8]:
# Evaluation of the model over the testing dataset:
test_predictions = tagger.predict(corpus=test)
print(f"""We have trained the model with {len(train)} sentences in english and the dataset contains {len(test)} sentences""")


We have trained the model with 8548 sentences in english and the dataset contains 1096 sentences



## Confussion Matrix

In [35]:
cm = tagger.get_confusion_matrix(corpus=test, corpus_prediction=test_predictions)

In [22]:
fig = go.Figure(
        data=go.Heatmap(
            z=cm,
            x=tagger.tagset,
            y=tagger.tagset,
            colorscale="Thermal",
        )
    )
fig.show(legend=False)

As we can check, in general most of the predictions are good.
But the model have some problems detecting Nouns as Nouns, there are some case wich they are detected as PropN, Verb or Adj. Probably this is due to the dataset is unbalanced and there are more examples of Nouns than other tags.

We are going to go deeper in this information using the precision metric.

## Accuracy

In [34]:
acc = get_accuracy(cm)

print(f"Accuracy: {100*acc:.2f}%")

Accuracy: 82.13%


## Precision

In [48]:
print("Precision per tag:\n")

precs, p_preds, p_micro, p_macro = get_precision(cm)
for tag, precision, predictions in zip(tagger.tagset, precs, p_preds):
    print(f"{tag:<5} {100*precision:.2f}% over {int(predictions)} predictions.")

print()
print(f"Micro precision: {100*p_micro:.2f} %")
print(f"Macro precision: {100*p_macro:.2f} %")
print()


Precision per tag:

noun  61.19% over 5377 predictions.
_     100.00% over 179 predictions.
adj   87.30% over 1008 predictions.
punct 99.14% over 2562 predictions.
det   96.40% over 1722 predictions.
verb  79.36% over 1899 predictions.
part  99.19% over 123 predictions.
cconj 99.13% over 686 predictions.
num   100.00% over 237 predictions.
sym   100.00% over 8 predictions.
pron  91.04% over 1551 predictions.
x     100.00% over 5 predictions.
intj  100.00% over 75 predictions.
sconj 97.62% over 84 predictions.
aux   83.54% over 972 predictions.
adp   82.61% over 2456 predictions.
adv   89.68% over 727 predictions.
propn 79.20% over 500 predictions.

Micro precision: 84.59 %
Macro precision: 91.41 %



In [57]:
data = pd.DataFrame(zip(tagger.tagset, precs), columns=['Tag', 'Precision'])

fig = px.bar(data, x='Tag', y='Precision')
fig.update_layout(xaxis_title='Tag', yaxis_title='Precision')

fig.show()

As we find the lowest precision in the metric that contains most examples Noun (61.19% and 5377 examples), the micro precision is lower than the macro precision.

#### How could we solve the problem with Nouns?
In my opinion we should find a bigger dataset with a lot of examples that could be Adj, PropN or Verb but there are Nouns. With all this examples probably the precision would increase.

Also we have to take in count that Hidden Markov Models are small models are have some limitations, using more complex models also could improve the precision.

## Recall

In [59]:
print("Recall per tag:\n")

recalls, r_preds, r_micro, r_macro = get_recall(cm)
for tag, recall, predictions in zip(tagger.tagset, recalls, r_preds):
    print(f"{tag:<5} {100*recall:.2f}% over {int(predictions)} predictions.")

print()
print(f"Micro recall: {100*r_micro:.2f} %")
print(f"Macro recall: {100*r_macro:.2f} %")
print()



Recall per tag:

noun  93.79% over 3508 predictions.
_     68.85% over 260 predictions.
adj   66.31% over 1327 predictions.
punct 100.00% over 2540 predictions.
det   96.46% over 1721 predictions.
verb  72.11% over 2090 predictions.
part  29.05% over 420 predictions.
cconj 97.14% over 700 predictions.
num   59.10% over 401 predictions.
sym   23.53% over 34 predictions.
pron  98.47% over 1434 predictions.
x     17.86% over 28 predictions.
intj  50.68% over 148 predictions.
sconj 32.80% over 250 predictions.
aux   87.69% over 926 predictions.
adp   99.12% over 2047 predictions.
adv   70.49% over 925 predictions.
propn 28.05% over 1412 predictions.

Micro recall: 88.18 %
Macro recall: 66.19 %



In [60]:
data = pd.DataFrame(zip(tagger.tagset, recalls), columns=['Tag', 'Recall'])

fig = px.bar(data, x='Tag', y='Recall')
fig.update_layout(xaxis_title='Tag', yaxis_title='Recall')

fig.show()

Talking about the recall, the macro recall is considerabily lower than the micro, mainly this is because the tags that are lowly represented in the dataset (part, sym, intj, sconj) have a low recall. 

And also we have again the previously mentioned problem with Adj, and PropN, the model is not able to differ them to a Noun. I propose the same solution, giving the model a bigger dataset and having a bigger vocabulary.

## Probability of error acumulation
As you know the Hidden Markov Models are based on a first order approximation of the Markov's chain. This implies that the ith prediction is going to be dependant to the (i-1)th prediction.

That's why it could be interesting to develop a metric for measuring the probability of having 2 errors together, this metric measures la probabiliy of having a wrong prediction in the (i+1)th token if the previous prediction was wrong (ith).

In [15]:
error_probability = get_error_propagation_prob(test, test_predictions)
print(f"The probability of making a wrong prediction and then make another wrong prediction in the next token is {100*error_probability:.2f} %")

The probability of making a wrong prediction and then make another wrong prediction in the next token is 24.76 %


#### Let's check an example:
Imagine that we have the sentece "the blue cat" (Det Adj Noun)
And the model makes a wrong prediction for the first token ("The" = Noun).

The probability of having a wrong prediction for "blue" is 24.76%, and 75.24% of predicting "blue" as Adj.

This results are overall good if we have consider that the prediction is based on a wrong prediction.