Phone times? #20

eremingt · 2021-01-22T14:31:22Z

Would it be straightforward to modify Allosaurus to return the approximate times of the recognized phones?

Also, I’m a novice in this area, but for what it’s worth, very impressive tool!

xinjli · 2021-01-22T20:33:02Z

Hi,

Thanks for your comments!
The underlying recognition was trained by CTC, which means the timestamp might not be very accurate, so the timestamp here
is just an approximation.
But it should not take much effort to do this.
You can modify the following part in lm/decoder.py where the decoding is taking place.

        # find all emitting frames
        for i in range(len(logits)):

            logit = logits[i]
            logit[0] /= blank_factor

            arg_max = np.argmax(logit)

            # this is an emitting frame
            if arg_max != cur_max_arg and arg_max != 0:
                emit_frame_idx.append(i)
                cur_max_arg = arg_max

Basically, the iterator i is the index of frame, which is the indicator of timestamp, each frame has a duration of 200ms and shifts by 100ms per frame.

For each recognized phone, there is usually an emitting frame corresponding to it. You can find its frame index and compute the timestamp using the 100ms shift.

eremingt · 2021-01-22T21:03:00Z

Thanks! Will try this.

eremingt · 2021-01-25T15:30:45Z

Started looking into this - returning emit_frame_idx through lm/decoder.compute() and then app.recognize(). However I'm getting much higher indexes than I'd expect for 100 ms shift per frame (ones that give times much longer than the length of the sample).

Is it possible that the frame duration is actually 75 ms with a shift of 30 ms? Looking at pm/feature.mfcc(), the default winstep is 0.01 and winlen is 0.025. Then in pm/mfcc.compute(), there is this block:

    # subsampling and windowing
    if self.feature_window == 3:
        feat = feature_window(feat)

Does pm/utils.feature_window() concatenate frames in groups of three?

xinjli · 2021-01-25T20:01:10Z

Hi,

Yeah, sorry, you are correct.
The pretrained model is concatenating 3 frame into 1 frame, so the observed frame has actually 66ms duration and 33ms shift

eremingt · 2021-01-29T13:16:27Z

I decided to return (approximate) relative position, rather than index, which I assume will be robust to any change in step size parameters.

    # find all emitting frames
    for i in range(len(logits)):

        logit = logits[i]
        logit[0] /= blank_factor

        arg_max = np.argmax(logit)

        # this is an emitting frame
        if arg_max != cur_max_arg and arg_max != 0:
            emit_frame_idx.append(i)
            cur_max_arg = arg_max

    # Position of emitted frame in recording (don't need to know step size)
    emit_frame_position = [idx/len(logits) for idx in emit_frame_idx]

Thanks again for your help!

eremingt closed this as completed Jan 29, 2021

turian mentioned this issue Apr 3, 2021

Timestamps for phones? #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phone times? #20

Phone times? #20

eremingt commented Jan 22, 2021

xinjli commented Jan 22, 2021 •

edited

eremingt commented Jan 22, 2021

eremingt commented Jan 25, 2021

xinjli commented Jan 25, 2021

eremingt commented Jan 29, 2021

Phone times? #20

Phone times? #20

Comments

eremingt commented Jan 22, 2021

xinjli commented Jan 22, 2021 • edited

eremingt commented Jan 22, 2021

eremingt commented Jan 25, 2021

xinjli commented Jan 25, 2021

eremingt commented Jan 29, 2021

xinjli commented Jan 22, 2021 •

edited