Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I fill the blanks in the tier by extending the existing intervals? #25

Closed
GalaxieT opened this issue Apr 3, 2021 · 21 comments
Closed

Comments

@GalaxieT
Copy link

GalaxieT commented Apr 3, 2021

I noticed that when saving the textgrid file, praatio would try to fill up the tiers with new blank intervals, which seems not to be quite friendly to automatic aligning. I am not sure but what would happen if the tier is left there and not filled up? Or can I use praatio to fill the blanks by extending the existing intervals (which would not change the total number of intervals, making it much easier for machines to recognize)?

@timmahrt
Copy link
Owner

timmahrt commented Apr 3, 2021

Someone recently had a similar request when reading in textgrids in praat and I added a "readRaw" boolean parameter. I can do the same for the save function. If "writeRaw" is true, it won't insert blanks.

In your case the automatic aligner is assuming the blank label is a phone? I used one a long time ago that required blank intervals to be labeled "sp" (small pause)--IIRC. I imagine the exact behaviour is vendor specific.

When a segment is deleted in praat, praat inserts a space. I followed that behaviour when I first wrote praatio, but there are other use cases of course.

I should be able to add this tomorrow.

(which would not change the total number of intervals, making it much easier for machines to recognize)?

I don't understand the use case. Without knowing your case more, I would assume that changing the duration of the intervals would invalidate the meaning of the labeled interval (phone, word, phrase).

But, you can use praatio to do this if it would help (this is off the top of my head--I will double check tomorrow)

entryList = tier.entryList
newEntryList = []
for i in range(len(entryList) - 1):
  newEntryList.append([entryList[i][0], entryList[i+1][0], entryList[i][2]])

newEntryList.append([entryList[-1][0], tier.maxTimestamp, entryList[-1][2]])
newEntryList[0][0] = tier.minTimestamp

newTier = tier.new(entryList=newEntryList)
tg.replaceTier(tier.name, newTier)

@timmahrt
Copy link
Owner

timmahrt commented Apr 3, 2021

When you say automatically align, you're talking about forced aligners or something else? What tool are you using?

@GalaxieT
Copy link
Author

GalaxieT commented Apr 3, 2021

Thank you so much for the quick response!

In your case the automatic aligner is assuming the blank label is a phone?

Yes, and actually it usually contains a set of phones, in contrast to the silent parts. And the blank areas in the tier are often very short, 0.001 second for example. They exist because I am combining some tg files together, and due to the accuracy of floats some of the digits are lost.

I am using SPPAS to do some forced aligning work (just learnt this term :). But I wrote my own scripts to process the outcome of the aligner. The tokens and phones are originally corresponding, and silent parts are marked with "#", A blank interval would occur only when the SPPAS aligner failed, so some additional (blank or not) intervals produced by praatio would surprise my script.

@timmahrt
Copy link
Owner

timmahrt commented Apr 4, 2021

I have used spass and praatio together before and did not encounter the problem you are having. But, I have not used sppas for several years now and things may have changed.

From the documentation, it seems that '#' is used to mark silence:
http://www.sppas.org/documentation_03_annotations.html#overview-7

I'm working on the two changes requests now.

@timmahrt
Copy link
Owner

timmahrt commented Apr 4, 2021

And the blank areas in the tier are often very short, 0.001 second for example.

If you suspect those are errors, it may be better to correct them. Textgrid.save() will automatically do this. It has an argument minimumIntervalLength. Any intervals smaller than that threshold will be deleted and the interval will be added to the interval before it (similar to the code snippet I shared earlier).

By default its very short (0.00000001 second) and is used to fix floating point rounding errors.
http://timmahrt.github.io/praatIO/tgio.html#praatio.tgio.Textgrid.save

@GalaxieT
Copy link
Author

GalaxieT commented Apr 4, 2021

Thank you for your instructions! This can help about the problem a lot.
As for sppas, it doesn't cause any problem with praatio.
I separate a wav file into several parts and have sppas to align them one by one. Then I load the output textgrid files and combine them together. Some digits were lost because the textgrid file cannot hold the accurate value of the floats. The save-and-load process produces larger inaccuracy than normal floating point rounding errors.
I wonder if it will be good to provide a parameter in the save function to control the behavior of dealing with blanks. For example, 'new' by default to insert new blank intervals, 'exception' to raise errors with tips to keep this condition for the user, and 'extend' to extend nearby existing intervals.

@timmahrt
Copy link
Owner

timmahrt commented Apr 4, 2021

The save-and-load process produces larger inaccuracy than normal floating point rounding errors.

For a textgrid produced by praat, praatio should be able to load and save the textgrid without modifying the original timestamps. If you have an example where save-and-load is distorting the textgrid, please share. That's a bug that would need to be fixed.

As for sppas, it doesn't cause any problem with praatio.

I think I misunderstood your original problem then. You first wrote:

I noticed that when saving the textgrid file, praatio would try to fill up the tiers with new blank intervals, which seems not to be quite friendly to automatic aligning.

By that, I thought you meant that praatio was making the use of SPPAS more difficult? Do you mean that it is more annoying to work with your textgrids in python? Or?

For example, 'new' by default to insert new blank intervals, 'exception' to raise errors with tips to keep this condition for the user, and 'extend' to extend nearby existing intervals.

I'm not sure about the behaviour of extend. It's already supported by minimumIntervalLength, right? exception could be useful for debugging purposes. And I guess there should be another option None which will cause praatio to leave unannotated segments alone (not insert empty intervals)?

@GalaxieT
Copy link
Author

GalaxieT commented Apr 4, 2021

            for name in name_list:
                tg_dict = tgio.openTextgrid(f'{cwd}\\{SPPAS_DIR_NAME}\\{SPPAS_OUTPUT_FOLDER_NAME}\\{name}-palign.TextGrid').tierDict
                tg_dict_tokens = tgio.openTextgrid(f'{cwd}\\{SPPAS_DIR_NAME}\\{SPPAS_OUTPUT_FOLDER_NAME}\\{name}-token.TextGrid', readRaw=True).tierDict
                phon_tier_list.append(tg_dict['PhonAlign'])
                token_tier_list.append(tg_dict_tokens['Tokens'])
            phon_base_tier = phon_tier_list[0]
            for tier in phon_tier_list[1:]:
                phon_base_tier = phon_base_tier.appendTier(tier)
            token_base_tier = token_tier_list[0]
            for tier in token_tier_list[1:]:
                token_base_tier = token_base_tier.appendTier(tier)

            final_tg = tgio.Textgrid()
            final_tg.addTier(phon_base_tier)
            final_tg.addTier(token_base_tier)
            final_tg.save(f'{cwd}\\{SPPAS_DIR_NAME}\\{SPPAS_OUTPUT_FOLDER_NAME}\\{basename}-palign.TextGrid')

The related files are attached below. A short blank interval in the end of the 'PhonAlign' tier in the textgrid with the filename without 'seg'.files.zip

Do you mean that it is more annoying to work with your textgrids in python?

Kind of, but not annoying in deed. I didn't expect this behavior. I though that an interval would be there only because of two reasons: my program added it, or it originally existed. This might be less a thing if the saved textgrid is only for manual use.

I'm not sure about the behaviour of extend. It's already supported by minimumIntervalLength, right? exception could be useful for debugging purposes. And I guess there should be another option None which will cause praatio to leave unannotated segments alone (not insert empty intervals)?

The option of None will be very useful if the tier which is not filled-up doesn't cause errors. As for the "extend", this option can handle the blank areas much longer than minimumIntervalLength, though largely changing the bounders of the intervals to fill the long blank areas might not be a good choice for accurate marking.

@timmahrt
Copy link
Owner

timmahrt commented Apr 4, 2021

Thanks, I'll give it a try.

I've got the code done for both change requests and should be pushing out a release in a bit.

@timmahrt
Copy link
Owner

timmahrt commented Apr 4, 2021

The related files are attached below.

Ah, I don't see the attached files. Nvm. I got it. 😅

@timmahrt
Copy link
Owner

timmahrt commented Apr 4, 2021

I've released praatio 4.3.0. Textgrid.save() has a new parameter ignoreBlankSpaces. By default it is False but if True, it should implement the functionality you were looking for.

I already had written this before our latest conversation. What is missing is the exception throwing behaviour and extend. I'm leaning towards not integrating extend into the save functionality, mostly because I think it doesn't apply to the general case.

For the exception throwing behaviour, I can see it useful for debugging but I don't quite understand when we would expect to use it. "I'd like to save this textgrid that I expect is full, but if there are any holes, throw an exception." Ideally praatio doesn't create holes.

I didn't have a chance to look the files you provided yet but that might illustrate the need for it.

Kind of, but not annoying in deed. I didn't expect this behavior. I though that an interval would be there only because of two reasons: my program added it, or it originally existed. This might be less a thing if the saved textgrid is only for manual use.

There is a lot of complex behaviour in loading and saving; and lots of configurations. I've been thinking of how I can distill that down and make it simpler. I've been thinking about adjusting the interface for some time now. No concrete ideas or plans though.

@timmahrt
Copy link
Owner

timmahrt commented Apr 4, 2021

It's late here so I only took a cursory look at your data. Just to be clear, your concern is the very final entry (and not other blank entries)?

126.60643700000001
130.730312
"#"
130.730312
130.7303125
""

That does indeed look like some sort of truncation problem. I'll investigate further--tomorrow if I have time.

@GalaxieT
Copy link
Author

GalaxieT commented Apr 4, 2021

The work is so fast and good!
The reason I recommended the exception-throwing is that I thought a tier with holes would cause problems for praat to read, and so it would be illegal to leave any holes there. If the holes do not cause problems or errors (probably), such a design will be unnecessary.

Ideally praatio doesn't create holes.

Is it possible to add nonadjacent Intervals directly to an newly created IntervalTier? If so, holes might be manually created by such low-level operations.
As for the simplicity, I don't have very good ideas as well. One small thought is that by default, the program should do as little as possible (avoiding errors) to the original data. Other more complex behaviors can be called by additional parameters.
The part of file you show is just where the problem happened.

@timmahrt
Copy link
Owner

timmahrt commented Apr 4, 2021

The reason I recommended the exception-throwing is that I thought a tier with holes would cause problems for praat to read, and so it would be illegal to leave any holes there. If the holes do not cause problems or errors (probably), such a design will be unnecessary.

Holes cannot be selected in praat, so it is a problem. I think there are two kinds of holes: intentional holes (eg imagine a long pause between two sentences--I think it should be filled in with a blank) and unintentional holes (rounding errors etc).

Is it possible to add nonadjacent Intervals directly to an newly created IntervalTier? If so, holes might be manually created by such low-level operations.

That is absolutely possible.

One small thought is that by default, the program should do as little as possible (avoiding errors) to the original data.

Praatio's behavior is mostly based on my original use case 🙈 --I worked with long recordings which naturally have lots of holes and post-processed the textgrids manually in praat, so I wanted the holes filled. Of course, this does won't apply to everyone/most people.

I'm going to think on it a bit. 🤔

The part of file you show is just where the problem happened.

I found the problem. 家传秘方_Recording_10_seg_3-palign.TextGrid and 家传秘方_Recording_10_seg_3-token.TextGrid have a different max length (19.791687 and 19.7916875 respectively).

So 3-palign has an interval that runs to the end of the file. When the two tiers are appended to the textgrid, the textgrid forces all to have the same maxTimestamp. This causes the palign tier to become 0.0000005 seconds longer, creating a microgap [EDIT: between the final entry in the tier, which ends at the old end of the file, and the new end of the file].

Did those come from SPPAS directly?

If you already have a lot of data beyond just these files, I would run a function that checks the length of each pair.

For praatio, I think some sort of optional validation is needed but I'm not sure what it should look like. I'm going to open a new issue for validation.
#26

@GalaxieT
Copy link
Author

GalaxieT commented Apr 5, 2021

I think filling the holes is better for manual processes, while not filling them is better for programs to recognize [EDIT: because the program importing praatio.save as a black box can hardly predict whether a hole will be filled].

The reason of the problem is totally different from what I thought 🤣 The files were produced by SPPAS directly.
I'm not sure what praatio can do to this problem, but it would be very nice if it can deal with such conditions.

@timmahrt
Copy link
Owner

timmahrt commented Apr 5, 2021

I think filling the holes is better for manual processes, while not filling them is better for programs to recognize [EDIT: because the program importing praatio.save as a black box can hardly predict whether a hole will be filled].

One solution could be to make the optional paramaters non-optional [eg ignoreBlankSpaces]. This would raise awareness that save could potentially modify the user's data.

The files were produced by SPPAS directly.

In that case, I think its worth contacting the SPPAS author with the input and output files explaining the situation. It seems like a bug.

@GalaxieT
Copy link
Author

GalaxieT commented Apr 5, 2021

One solution could be to make the optional paramaters non-optional.

That's a concise solution. If so, it's better that most of the potential modifications can be reflected by the parameters.

In that case, I think its worth contacting the SPPAS author with the input and output files explaining the situation.

OK, I'm doing it later. I've contacted her for other issues in SPPAS. She is also a very kind and helpful developer😉

@timmahrt
Copy link
Owner

timmahrt commented Apr 5, 2021

That's a concise solution. If so, it's better that most of the potential modifications can be reflected by the parameters.

This will break backwards compatibility. There are some other breaking changes I want to make, so I will bundle them together in a major change (praatio 5.0). I will try to push out a release this month if I can. I've made an issue: #27

@GalaxieT
Copy link
Author

GalaxieT commented Apr 5, 2021

That's great. Thanks for your efforts and patience!

@timmahrt
Copy link
Owner

timmahrt commented Aug 9, 2021

One solution could be to make the optional parameters non-optional [eg ignoreBlankSpaces]. This would raise awareness that save could potentially modify the user's data.

I've just released Praatio 5.0 which adds two new required parameters: format and includeBlankSpaces

tg.save(
  fn=name,
  format= "short_textgrid",
  includeBlankSpaces= False
)

Thank you for your help in guiding this new feature!

@GalaxieT
Copy link
Author

GalaxieT commented Aug 9, 2021

I'm so glad to make some contributions!

@GalaxieT GalaxieT closed this as completed Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants