Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Levenshtein Distance and Segment Error Rate are computed wrong when labels are integers #373

Closed
NickleDave opened this issue Oct 1, 2021 · 3 comments
Assignees
Labels
BUG Something isn't working

Comments

@NickleDave
Copy link
Collaborator

NickleDave commented Oct 1, 2021

because the labels are concatenated into strings where the individual integers can no longer be differentiated

e.g.,
four consecutive labels 12, 12, 12, 13 become a string of 8 characters '12121213'.

I can verify this using a config for canary song from @yardencsGitHub and putting a breakpoint into the segment_error_rate function, and seeing what the inputs look like on the validation step.

By the time we get into the function
https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/metrics/distance/functional.py#L87

things already look like this:

(Pdb) y_true
'6111111111111111111111111133333333333333333333333333333366666661111118181818181818181818181818181818181818183333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333338888888888888888888888812121212121212121213'

I am pretty sure the culprit is the lbl_tb2labels function that gets called by the model._eval method:
https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/engine/model.py#L283

Note that if I call lbl_tb2labels In the same pdb session I get one big string, which is not what I expected:

(Pdb) lbl_tb2labels(y.cpu().numpy(), eval_data.dataset.labelmap)
'6111111111111111111111111133333333333333333333333333333366666661111118181818181818181818181818181818181818183333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333338888888888888888888888812121212121212121213'

What I think was supposed to happen is that this function would detect when all the labels were int and in that case not concatenate them:

https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/labeled_timebins.py#L160

        elif all([type(el) is int for el in labels]):
            return labels

but that's not what's happening now, because the inverse_label_mapping a few lines above converts the integer labels to their string equivalents:
https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/labeled_timebins.py#L139

    labels = [inverse_labels_mapping[label] for label in labels]

I think this happens because I wrote a converter that coerces the int labels to strings, without realizing the effect it would have--pretty sure I was trying to have a consistent type for labels to avoid errors elsewhere.
https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/converters.py#L141

Not sure what the best fix is right now. I need to come back to this with a fresh mind.

@NickleDave
Copy link
Collaborator Author

NickleDave commented Oct 1, 2021

The quickest fix (not necessarily the best, long-term) would be to have lbl_tb2labels statically map any integer whose string representation has multiple characters to a pre-determined single character, e.g. using the Python built-in chr.

This will need to happen in a way that we don't duplicate labels that might already be in the labelset.

A way to do this is to have a semi-anonymous helper function that does the deterministic "conversion" of the labelset

It should happen before we invert the mapping from integer classes the network predicts back to string labels, so that those string labels can be safely concatenated:
https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/labeled_timebins.py#L139

We can for now insert some semi-anonymous helper function like this:


VALID_SINGLE_CHARS = range(33-96) + range(171-299)

def _multi_char_labels_to_single_char(labels_mapping):
    # function here
    return new_labels_mapping

that will get called by lbl_tb2labels

NickleDave added a commit that referenced this issue Oct 2, 2021
add helper function that coerces any string labels
in `labels_mapping` with multiple characters to
dummy single-string characters

this ensures correct computation of string-based metrics
like Levenshtein distance
@NickleDave
Copy link
Collaborator Author

NickleDave commented Oct 2, 2021

Actually better to do this remapping just once inside of the Model._eval function, so that we know we always pass the exact same labels_mapping to lbl_tb2labels, instead of having lbl_tb2labels do the re-mapping every time

@NickleDave
Copy link
Collaborator Author

NickleDave commented Oct 2, 2021

I'm pasting what we did in a previous script in the tweetynet repo as I mentioned in yardencsGitHub/tweetynet#139.

I kind of prefer something like this since it's more readable, esp if you stop to debug.

At least, probably makes sense to start with capital letters and pick other characters that are easier to read than some of the really weird ascii characters that are hangovers from the days of text as graphics

ALPHANUMERIC = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'


def remap(labelmap):
    """map integer labels to alphanumeric characters so we can compute edit distance metrics.
    The mapping can be arbitrary as long as it is constant across all times we compute the metric.
    """
    return {ALPHANUMERIC[ind]: val for ind, (key, val) in enumerate(labelmap.items())}


def map_number_labels_to_alphanumeric(labelvec):
    """Take a vector of 'str' labels that are all numbers and replace them with a string of characters 
    """
    return ''.join([ALPHANUMERIC[int(x)] for x in labelvec])

NickleDave added a commit that referenced this issue Oct 8, 2021
add helper function that coerces any string labels
in `labels_mapping` with multiple characters to
dummy single-string characters

this ensures correct computation of string-based metrics
like Levenshtein distance
NickleDave added a commit that referenced this issue Oct 8, 2021
add helper function that coerces any string labels
in `labels_mapping` with multiple characters to
dummy single-string characters

this ensures correct computation of string-based metrics
like Levenshtein distance

also add test for lbl_tb2labels to tests/test_labeled_timebins.py
NickleDave added a commit that referenced this issue Oct 8, 2021
change how `lbl_tb2labels` maps ints to str, fix #373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant