## Using Stable-ts with any ASR
Stable-ts can be used for other ASR models by wrapping their outputs as a `WhisperResult` object.

In [1]:
import stable_whisper
assert int(stable_whisper.__version__.replace('.', '')) >= 261, f"Requires Stable-ts 2.6.1+. Current version is {stable_whisper.__version__}."

<br />

In order to initialize `WhisperResult` from any data, it will need to follow this mapping.

In [2]:
demo = [
    [   # 1st Segment
        {'word': ' And', 'start': 0.0, 'end': 1.28}, 
        {'word': ' when', 'start': 1.28, 'end': 1.52}, 
        {'word': ' no', 'start': 1.52, 'end': 2.26}, 
        {'word': ' ocean,', 'start': 2.26, 'end': 2.68},
        {'word': ' mountain,', 'start': 3.28, 'end': 3.58}
    ], 
    [   # 2nd Segment
        {'word': ' or', 'start': 4.0, 'end': 4.08}, 
        {'word': ' sky', 'start': 4.08, 'end': 4.56}, 
        {'word': ' could', 'start': 4.56, 'end': 4.84}, 
        {'word': ' contain', 'start': 4.84, 'end': 5.26}, 
        {'word': ' us,', 'start': 5.26, 'end': 6.27},
        {'word': ' our', 'start': 6.27, 'end': 6.58}, 
        {'word': ' gaze', 'start': 6.58, 'end': 6.98}, 
        {'word': ' hungered', 'start': 6.98, 'end': 7.88}, 
        {'word': ' starward.', 'start': 7.88, 'end': 8.64}
    ]
]

<br />

If word timings are not available they can be omitted, but operations that can be performed on this data will be limited.

In [3]:
no_word_demo = [
    {
        'start': 0.0, 
        'end': 3.58, 
        'text': ' And when no ocean, mountain,',
    }, 
    {
        'start': 4.0, 
        'end': 8.64, 
        'text': ' or sky could contain us, our gaze hungered starward.', 
    }
]

<br />

Below is the full mapping for normal Stable-ts results. `None` takes the place of any omitted values except for `start`, `end`, and `text`/`word` which are required.

In [4]:
full_demo = {
    'language': 'en',
    'text': ' And when no ocean, mountain, or sky could contain us, our gaze hungered starward.', 
    'segments': [
        {
            'seek': 0.0, 
            'start': 0.0, 
            'end': 3.58, 
            'text': ' And when no ocean, mountain,', 
            'tokens': [400, 562, 572, 7810, 11, 6937, 11], 
            'temperature': 0.0, 
            'avg_logprob': -0.48702024376910663, 
            'compression_ratio': 1.0657894736842106, 
            'no_speech_prob': 0.3386174440383911, 
            'id': 0, 
            'words': [
                {'word': ' And', 'start': 0.04, 'end': 1.28, 'probability': 0.6481522917747498, 'tokens': [400]}, 
                {'word': ' when', 'start': 1.28, 'end': 1.52, 'probability': 0.9869539141654968, 'tokens': [562]}, 
                {'word': ' no', 'start': 1.52, 'end': 2.26, 'probability': 0.57384192943573, 'tokens': [572]}, 
                {'word': ' ocean,', 'start': 2.26, 'end': 2.68, 'probability': 0.9484889507293701, 'tokens': [7810, 11]},
                {'word': ' mountain,', 'start': 3.28, 'end': 3.58, 'probability': 0.9581122398376465, 'tokens': [6937, 11]}
            ]
        }, 
        {
            'seek': 0.0, 
            'start': 4.0, 
            'end': 8.64, 
            'text': ' or sky could contain us, our gaze hungered starward.', 
            'tokens': [420, 5443, 727, 5304, 505, 11, 527, 24294, 5753, 4073, 3543, 1007, 13], 
            'temperature': 0.0, 
            'avg_logprob': -0.48702024376910663, 
            'compression_ratio': 1.0657894736842106, 
            'no_speech_prob': 0.3386174440383911, 
            'id': 1, 
            'words': [
                {'word': ' or', 'start': 4.0, 'end': 4.08, 'probability': 0.9937937259674072, 'tokens': [420]}, 
                {'word': ' sky', 'start': 4.08, 'end': 4.56, 'probability': 0.9950089454650879, 'tokens': [5443]}, 
                {'word': ' could', 'start': 4.56, 'end': 4.84, 'probability': 0.9915681481361389, 'tokens': [727]}, 
                {'word': ' contain', 'start': 4.84, 'end': 5.26, 'probability': 0.898974597454071, 'tokens': [5304]}, 
                {'word': ' us,', 'start': 5.26, 'end': 6.27, 'probability': 0.999351441860199, 'tokens': [505, 11]},
                {'word': ' our', 'start': 6.27, 'end': 6.58, 'probability': 0.9634224772453308, 'tokens': [527]}, 
                {'word': ' gaze', 'start': 6.58, 'end': 6.98, 'probability': 0.8934874534606934, 'tokens': [24294]}, 
                {'word': ' hungered', 'start': 6.98, 'end': 7.88, 'probability': 0.7424876093864441, 'tokens': [5753, 4073]}, 
                {'word': ' starward.', 'start': 7.88, 'end': 8.64, 'probability': 0.464096799492836, 'tokens': [3543, 1007, 13]}
            ]
        }
    ]
}

The data can now be loaded as a `WhisperResult` instance. *Note: `demo` can also be the path of a JSON file of the data in one of the above mappings.*

In [5]:
result = stable_whisper.WhisperResult(demo)

<br />

We can perform all the operations on this data like normal stable-ts results.
One of those operations is post-inference silence suppression (which requires the audio file of this data).

In [6]:
audio = './demo.wav'

#### Non-VAD Suppression

In [7]:
from stable_whisper.stabilization import wav2mask, mask2timing

In [8]:
nonvad_silent_timings = mask2timing(wav2mask('./demo.wav'))
nonvad_silent_timings

(array([0.  , 0.38, 0.78, 1.06, 1.72, 5.92, 6.2 , 8.9 ]),
 array([0.04, 0.56, 0.96, 1.14, 2.  , 5.96, 6.36, 9.48]))

In [9]:
result.suppress_silence(*nonvad_silent_timings)

<stable_whisper.result.WhisperResult at 0x106fee670>

In [10]:
for new_seg, old_seg in zip(result.segments, demo):
    for new_word, old_word in zip(new_seg.words, old_seg):
        if new_word.start != old_word['start'] or new_word.end != old_word['end']:
            print(f"word: {new_word.word}\n"
                  f"start: {old_word['start']} -> {new_word.start}\n"
                  f"end: {old_word['end']} -> {new_word.end}\n")

word:  And
start: 0.0 -> 0.04
end: 1.28 -> 1.28

word:  us,
start: 5.26 -> 5.26
end: 6.27 -> 6.2

word:  our
start: 6.27 -> 6.36
end: 6.58 -> 6.58



#### VAD Suppression

In [11]:
from stable_whisper.stabilization import get_vad_silence_func

In [12]:
vad_silent_timings = get_vad_silence_func(verbose=None)(audio)
vad_silent_timings

Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /Users/will/.cache/torch/hub/master.zip


(array([0.   , 2.014, 3.07 , 6.046, 9.342]),
 array([1.122, 2.018, 3.074, 6.402, 9.483]))

In [13]:
result.suppress_silence(*vad_silent_timings)

<stable_whisper.result.WhisperResult at 0x106fee670>

In [14]:
for new_seg, old_seg in zip(result.segments, demo):
    for new_word, old_word in zip(new_seg.words, old_seg):
        if new_word.start != old_word['start'] or new_word.end != old_word['end']:
            print(f"word: {new_word.word}\n"
                  f"start: {old_word['start']} -> {new_word.start}\n"
                  f"end: {old_word['end']} -> {new_word.end}\n")

word:  And
start: 0.0 -> 1.122
end: 1.28 -> 1.28

word:  us,
start: 5.26 -> 5.26
end: 6.27 -> 6.046

word:  our
start: 6.27 -> 6.402
end: 6.58 -> 6.58



Another operation is regrouping the words.

In [15]:
for i, seg in enumerate(result.segments):
    print(f'{i}: {seg.start} -> {seg.end} {seg.text}')

0: 1.122 -> 3.58  And when no ocean, mountain,
1: 4.0 -> 8.64  or sky could contain us, our gaze hungered starward.


In [16]:
result.regroup()

<stable_whisper.result.WhisperResult at 0x106fee670>

In [17]:
for i, seg in enumerate(result.segments):
    print(f'{i}: {seg.start} -> {seg.end} {seg.text}')

0: 1.122 -> 2.68  And when no ocean,
1: 3.28 -> 3.58  mountain,
2: 4.0 -> 6.046  or sky could contain us,
3: 6.402 -> 8.64  our gaze hungered starward.
