Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phone times? #20

Closed
eremingt opened this issue Jan 22, 2021 · 5 comments
Closed

Phone times? #20

eremingt opened this issue Jan 22, 2021 · 5 comments

Comments

@eremingt
Copy link

Would it be straightforward to modify Allosaurus to return the approximate times of the recognized phones?

Also, I’m a novice in this area, but for what it’s worth, very impressive tool!

@xinjli
Copy link
Owner

xinjli commented Jan 22, 2021

Hi,

Thanks for your comments!
The underlying recognition was trained by CTC, which means the timestamp might not be very accurate, so the timestamp here
is just an approximation.
But it should not take much effort to do this.
You can modify the following part in lm/decoder.py where the decoding is taking place.

        # find all emitting frames
        for i in range(len(logits)):

            logit = logits[i]
            logit[0] /= blank_factor

            arg_max = np.argmax(logit)

            # this is an emitting frame
            if arg_max != cur_max_arg and arg_max != 0:
                emit_frame_idx.append(i)
                cur_max_arg = arg_max

Basically, the iterator i is the index of frame, which is the indicator of timestamp, each frame has a duration of 200ms and shifts by 100ms per frame.

For each recognized phone, there is usually an emitting frame corresponding to it. You can find its frame index and compute the timestamp using the 100ms shift.

@eremingt
Copy link
Author

Thanks! Will try this.

@eremingt
Copy link
Author

Started looking into this - returning emit_frame_idx through lm/decoder.compute() and then app.recognize(). However I'm getting much higher indexes than I'd expect for 100 ms shift per frame (ones that give times much longer than the length of the sample).

Is it possible that the frame duration is actually 75 ms with a shift of 30 ms? Looking at pm/feature.mfcc(), the default winstep is 0.01 and winlen is 0.025. Then in pm/mfcc.compute(), there is this block:

    # subsampling and windowing
    if self.feature_window == 3:
        feat = feature_window(feat)

Does pm/utils.feature_window() concatenate frames in groups of three?

@xinjli
Copy link
Owner

xinjli commented Jan 25, 2021

Hi,

Yeah, sorry, you are correct.
The pretrained model is concatenating 3 frame into 1 frame, so the observed frame has actually 66ms duration and 33ms shift

@eremingt
Copy link
Author

I decided to return (approximate) relative position, rather than index, which I assume will be robust to any change in step size parameters.

    # find all emitting frames
    for i in range(len(logits)):

        logit = logits[i]
        logit[0] /= blank_factor

        arg_max = np.argmax(logit)

        # this is an emitting frame
        if arg_max != cur_max_arg and arg_max != 0:
            emit_frame_idx.append(i)
            cur_max_arg = arg_max

    # Position of emitted frame in recording (don't need to know step size)
    emit_frame_position = [idx/len(logits) for idx in emit_frame_idx]

Thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants