Always chunking because of len #44

zmlee0514 · 2023-12-19T03:50:48Z

I have tested the whisper_online.py on a 10min video, but it always chunks when the buffer reaches 30s not sentence.
Running log: AI_pin.txt

I found it is due to the string comparison in the words_to_sentences function, so add strip() into it. I wonder if it is right because I want Whisper to output one sentence per line.

def words_to_sentences(self, words):
    """Uses self.tokenizer for sentence segmentation of words.
    Returns: [(beg,end,"sentence 1"),...]
    """
    
    cwords = [w for w in words]
    t = " ".join(o[2] for o in cwords)
    s = self.tokenizer.split(t)
    out = []
    while s:
        beg = None
        end = None
        sent = s.pop(0).strip()
        fsent = sent
        while cwords:
            b,e,w = cwords.pop(0)
            if beg is None and sent.startswith(w.strip()):
                beg = b
            elif end is None and sent == w.strip():
                end = e
                out.append((beg,end,fsent))
                break
            sent = sent[len(w):].strip()
    return out

The result log: AI_pin.txt

The text was updated successfully, but these errors were encountered:

Gldkslfmsd · 2023-12-19T11:51:17Z

Hi,
can you please specify what back-end do you use? faster-whisper or whisper_timestamped? I think it has an impact

Gldkslfmsd · 2023-12-19T12:39:01Z

OK, thanks for feedback. #36 is a correct fix, I'm merging it

zmlee0514 · 2023-12-21T06:49:01Z

It still fails when I run with Whisper large-v3. My command simply uses the default for all arguments except for the model:
python whisper_online.py audio.wav --model large-v3

Because the large-v3 model of faster-whisper was just supported recently, it might be not tested. But I found the cause is the abnormal string like " . Welcome to Humane. This is the Humane AI pin.". It would be split into 3 sentences: [" .", "Welcome to Humane.", "This is the Humane AI pin."]. The first sentence " ." would only assign beg, but not end. This gap will destroy all remaining processes.
It can be simply solved by handling the case of the one-word sentence, but not elegant. I should study your algorithm first.

Gldkslfmsd · 2023-12-21T10:51:10Z

OK. Please reopen if you're sure you found a bug.

v3 is unrelated issue -- #45

Gldkslfmsd closed this as completed Dec 19, 2023

Gldkslfmsd mentioned this issue Dec 21, 2023

large-v3 model #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always chunking because of len #44

Always chunking because of len #44

zmlee0514 commented Dec 19, 2023

Gldkslfmsd commented Dec 19, 2023

Gldkslfmsd commented Dec 19, 2023

zmlee0514 commented Dec 21, 2023

Gldkslfmsd commented Dec 21, 2023

Always chunking because of len #44

Always chunking because of len #44

Comments

zmlee0514 commented Dec 19, 2023

Gldkslfmsd commented Dec 19, 2023

Gldkslfmsd commented Dec 19, 2023

zmlee0514 commented Dec 21, 2023

Gldkslfmsd commented Dec 21, 2023