Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intersection: issue on consecutive duplicate words #45

Closed
scottmk opened this issue Jan 23, 2023 · 6 comments
Closed

intersection: issue on consecutive duplicate words #45

scottmk opened this issue Jan 23, 2023 · 6 comments

Comments

@scottmk
Copy link

scottmk commented Jan 23, 2023

Today I encountered an issue with the behavior of intersection.

Say I have a WORD tier that looks like this:

UPON | A | A | TIME

And I have a PHONE tier that looks like this:

AH0 | P | AA1 | N | AH0 | EY1 | T | AY1 | M

Assuming these are time-aligned correctly, when I call intersection, I get a list that looks something like this:

['UPON-AH0', 'UPON-P', 'UPON-AA1', 'UPON-N', 'A-AH0', 'A-EY1', 'TIME-T', 'TIME-AY1', 'TIME-M']

Because I have two intervals in the WORD tier which have the same label, from this intersection I can't really tell if I have two distinct words "A" that have the respective transcriptions "AH0" and "EY1", or if I have one distinct word "A" transcribed as "AH0 EY1".

Obviously, there is no right way to solve this, but I would suggest that since we do know that the word entries are distinct, that perhaps instead the label should be the WORD label plus a tuple of all the PHONE labels that coincide with it. Something like this:

['UPON-(AH0, P, AA1, N)', 'A-(AH0)', 'A-(EY1)', 'TIME-(T, AY1, M)']

This would also mean that the interval boundaries would be the boundaries of the left-hand side tier. So my example would be for

word_tier.intersection(phone_tier)

If you instead did

phone_tier.intersection(word_tier)

you would get

['AH0-UPON', 'P-UPON', 'AA1-UPON', 'N-UPON', 'AH0-A', 'EY1-A', 'T-TIME', 'AY1-TIME', 'M-TIME']

What do you think?

@scottmk
Copy link
Author

scottmk commented Jan 23, 2023

Alternatively, it could behave as it currently does, and instead you could add the entryList entry index from each tier to the intersection entries.

@timmahrt
Copy link
Owner

I think you first solution sounds reasonable. I'll take a look and see what I can do.

@timmahrt
Copy link
Owner

timmahrt commented Feb 4, 2023

When I went to implement the changes, I realized that my original intention with intersection() wouldn't be compatible with the changes you suggested.

For example, what would happen for a phone list, where only some of the phones are listed for each word, e.g.
[(0, 1, "hello")] and [(0.1, 0.2, 'e'), (0.7, 1.0, 'o')]
What would the expected output be? Under the existing intersection method, there would be two intervals output [(0.1, 0.2, "hello(e)"), (0.7, 1.0, "hello(o)")], but I think one could argue that in some cases only one interval is wanted (0, 1, "hello(e,o)")--which is more in line with your use case.

I wondered how I could accommodate these two scenarios--parameterize intersection()?

I decided that a simpler solution was to create a different method mergeLabels(). I implemented that in #47 I also added some documentation to the existing intersection().

What do you think? Does mergeLabels() work for your use case?

@timmahrt
Copy link
Owner

timmahrt commented Feb 5, 2023

I merged my PR and built a release. I've been sitting on a lot of code since November which I really shouldn't have done.

Reviews on the merged PR are still welcome--I can make a follow-up PR. 🙇

@scottmk
Copy link
Author

scottmk commented Feb 5, 2023

Thanks for this! I think creating the new method is a great compromise and this helps my use case a lot.

I'll take a look at the PR and see if I have any comments to make.

Thanks for the quick response!

@scottmk scottmk closed this as completed Feb 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants