New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Levenshtein Distance and Segment Error Rate are computed wrong when labels are integers #373
Comments
The quickest fix (not necessarily the best, long-term) would be to have This will need to happen in a way that we don't duplicate labels that might already be in the A way to do this is to have a semi-anonymous helper function that does the deterministic "conversion" of the It should happen before we invert the mapping from integer classes the network predicts back to string labels, so that those string labels can be safely concatenated: We can for now insert some semi-anonymous helper function like this:
that will get called by |
add helper function that coerces any string labels in `labels_mapping` with multiple characters to dummy single-string characters this ensures correct computation of string-based metrics like Levenshtein distance
Actually better to do this remapping just once inside of the |
I'm pasting what we did in a previous script in the tweetynet repo as I mentioned in yardencsGitHub/tweetynet#139. I kind of prefer something like this since it's more readable, esp if you stop to debug. At least, probably makes sense to start with capital letters and pick other characters that are easier to read than some of the really weird ascii characters that are hangovers from the days of text as graphics
|
add helper function that coerces any string labels in `labels_mapping` with multiple characters to dummy single-string characters this ensures correct computation of string-based metrics like Levenshtein distance
add helper function that coerces any string labels in `labels_mapping` with multiple characters to dummy single-string characters this ensures correct computation of string-based metrics like Levenshtein distance also add test for lbl_tb2labels to tests/test_labeled_timebins.py
change how `lbl_tb2labels` maps ints to str, fix #373
because the labels are concatenated into strings where the individual integers can no longer be differentiated
e.g.,
four consecutive labels
12, 12, 12, 13
become a string of 8 characters'12121213'
.I can verify this using a config for canary song from @yardencsGitHub and putting a breakpoint into the
segment_error_rate
function, and seeing what the inputs look like on the validation step.By the time we get into the function
https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/metrics/distance/functional.py#L87
things already look like this:
I am pretty sure the culprit is the
lbl_tb2labels
function that gets called by themodel._eval
method:https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/engine/model.py#L283
Note that if I call
lbl_tb2labels
In the same pdb session I get one big string, which is not what I expected:What I think was supposed to happen is that this function would detect when all the labels were
int
and in that case not concatenate them:https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/labeled_timebins.py#L160
but that's not what's happening now, because the
inverse_label_mapping
a few lines above converts the integer labels to their string equivalents:https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/labeled_timebins.py#L139
I think this happens because I wrote a converter that coerces the
int
labels to strings, without realizing the effect it would have--pretty sure I was trying to have a consistent type for labels to avoid errors elsewhere.https://github.com/NickleDave/vak/blob/1235c1bac125ebaa73385092b688a5d92f3982ce/src/vak/converters.py#L141
Not sure what the best fix is right now. I need to come back to this with a fresh mind.
The text was updated successfully, but these errors were encountered: