<a href="https://colab.research.google.com/github/robgon-art/faster-whisper-test/blob/main/CTranslate2_Whisper_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install ctranslate2 transformers

In [None]:
!wget https://github.com/jordimas/whisper-cpp-error/raw/main/15GdH9-curt.mp3

In [None]:
!ct2-transformers-converter --model openai/whisper-tiny --output_dir whisper-tiny-ct2
!ct2-transformers-converter --model openai/whisper-tiny --output_dir whisper-tiny-ct2-int8 --quantization int8

In [None]:
import ctranslate2
import librosa
import transformers
import logging

# Load and resample the audio file.
audio, _ = librosa.load("/content/15GdH9-curt.mp3", sr=16000, mono=True)

# Compute the features of the first 30 seconds of audio.
processor = transformers.WhisperProcessor.from_pretrained("openai/whisper-tiny")
inputs = processor(audio, return_tensors="np", sampling_rate=16000)
features = ctranslate2.StorageView.from_array(inputs.input_features)

# Describe the task in the prompt.
# See the prompt format in https://github.com/openai/whisper.
prompt = processor.tokenizer.convert_tokens_to_ids(
    [
        "<|startoftranscript|>",
        "<|ca|>",
        "<|transcribe|>",
        "<|notimestamps|>",  # Remove this token to generate timestamps.
    ]
)

In [6]:
# Load the int8 model on CPU.
model = ctranslate2.models.Whisper("whisper-tiny-ct2-int8")

# Detect the language.
results = model.detect_language(features)
language, probability = results[0][0]
print("Detected language: %s with probability: %f" % (language, probability))

# Show the encoder results
encoder_results = model.encode(features)
print("Encoder results:", encoder_results)

# Run generation for the 30-second window.
results = model.generate(features, [prompt], sampling_temperature=0, return_scores=True)
transcription = processor.decode(results[0].sequences_ids[0])
print(transcription)

Detected language: <|ca|> with probability: 0.966133
Encoder results:  0.102781 0.128416 0.00438075 ... 0.0382776 -0.605718 -0.177774
[cpu:0 float32 storage viewed as 1x1500x384]
 15 glaçons d'hydrogen. El podcast de l'engüa i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no passa, no passa en que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.


In [7]:
# Initial run
# Detected language: <|ca|> with probability: 0.974995
# Encoder results:  0.0998689 0.12546 0.0150658 ... 0.0410112 -0.61152 -0.173796
# [cpu:0 float32 storage viewed as 1x1500x384]
#  15 glaçons d'hydrogen. El podcast de l'engüa i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no passa, no passa en que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.

# Disconnect and rerun
# Detected language: <|ca|> with probability: 0.966133
# Encoder results:  0.102781 0.128416 0.00438075 ... 0.0382776 -0.605718 -0.177774
# [cpu:0 float32 storage viewed as 1x1500x384]
#  15 glaçons d'hydrogen. El podcast de l'engüa i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no passa, no passa en que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.

In [8]:
# Load the int8 model on GPU.
model = ctranslate2.models.Whisper("whisper-tiny-ct2-int8", device="cuda")

# Detect the language.
results = model.detect_language(features)
language, probability = results[0][0]
print("Detected language: %s with probability: %f" % (language, probability))

# Show the encoder results
encoder_results = model.encode(features)
print("Encoder results:", encoder_results)

# Run generation for the 30-second window.
results = model.generate(features, [prompt], sampling_temperature=0, return_scores=True)
transcription = processor.decode(results[0].sequences_ids[0])
print(transcription)

Detected language: <|ca|> with probability: 0.974822
Encoder results:  0.0937386 0.123954 0.0130554 ... 0.0373841 -0.595213 -0.162236
[cuda:0 float32 storage viewed as 1x1500x384]
 15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, aneu passant, aneu passant que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.


In [9]:
# Initial run
# Detected language: <|ca|> with probability: 0.978478
# Encoder results:  0.094764 0.123045 0.0108672 ... 0.0404294 -0.597082 -0.161252
# [cuda:0 float32 storage viewed as 1x1500x384]
#  15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, aneu passant, aneu passant que la taverna s'ompla. Sembla que no, però ja som a la pisoni nou.

# Disconnect and rerun
# Detected language: <|ca|> with probability: 0.974822
# Encoder results:  0.0937386 0.123954 0.0130554 ... 0.0373841 -0.595213 -0.162236
# [cuda:0 float32 storage viewed as 1x1500x384]
#  15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, aneu passant, aneu passant que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.

In [10]:
# Load the fp32 model on CPU.
model = ctranslate2.models.Whisper("whisper-tiny-ct2")

# Detect the language.
results = model.detect_language(features)
language, probability = results[0][0]
print("Detected language: %s with probability: %f" % (language, probability))

# Show the encoder results
encoder_results = model.encode(features)
print("Encoder results:", encoder_results)

# Run generation for the 30-second window.
results = model.generate(features, [prompt], sampling_temperature=0, return_scores=True)
transcription = processor.decode(results[0].sequences_ids[0])
print(transcription)

Detected language: <|ca|> with probability: 0.974038
Encoder results:  0.12497 0.117955 0.0298516 ... 0.0300566 -0.589291 -0.168419
[cpu:0 float32 storage viewed as 1x1500x384]
 15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no ha passat, no ha passat que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.


In [11]:
# Initial run
# Detected language: <|ca|> with probability: 0.974038
# Encoder results:  0.12497 0.117955 0.0298516 ... 0.0300566 -0.589291 -0.168419
# [cpu:0 float32 storage viewed as 1x1500x384]
#  15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no ha passat, no ha passat que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.

# Disconnect and rerun
# etected language: <|ca|> with probability: 0.974038
# Encoder results:  0.12497 0.117955 0.0298516 ... 0.0300567 -0.589291 -0.168419
# [cpu:0 float32 storage viewed as 1x1500x384]
#  15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no ha passat, no ha passat que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.

In [12]:
# Load the fp32 model on GPU.
model = ctranslate2.models.Whisper("whisper-tiny-ct2", device="cuda")

# Detect the language.
results = model.detect_language(features)
language, probability = results[0][0]
print("Detected language: %s with probability: %f" % (language, probability))

# Show the encoder results
encoder_results = model.encode(features)
print("Encoder results:", encoder_results)

# Run generation for the 30-second window.
results = model.generate(features, [prompt], sampling_temperature=0, return_scores=True)
transcription = processor.decode(results[0].sequences_ids[0])
print(transcription)

Detected language: <|ca|> with probability: 0.974039
Encoder results:  0.124969 0.117955 0.029852 ... 0.0300568 -0.589291 -0.168419
[cuda:0 float32 storage viewed as 1x1500x384]
 15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no ha passat, no ha passat que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.


In [13]:
# Initial run
# Detected language: <|ca|> with probability: 0.974039
# Encoder results:  0.124969 0.117955 0.0298519 ... 0.0300569 -0.589291 -0.168419
# [cuda:0 float32 storage viewed as 1x1500x384]
#  15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no ha passat, no ha passat que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.

# Disconnect and rerun
# Detected language: <|ca|> with probability: 0.974039
# Encoder results:  0.124969 0.117955 0.029852 ... 0.0300568 -0.589291 -0.168419
# [cuda:0 float32 storage viewed as 1x1500x384]
#  15 glaçons d'edragem. El podcast de l'engua i tecnologia de sof català. Amb la és ideal. Vinga, vinga, no ha passat, no ha passat que la taverna s'ompla. Sembla que no, però ja som a la pisòdinau.