# Ratchada Whisper, Developing-Demo Guide

Ratchada Whisper is a Automatic Speech Recognition (ASR) Model. Finetuning from and its derivative of Whisper Model.

Finetuning using Ratchada-STT, Audio datasets collected in finiancial Youtube video.

This model is develop and publish under Thinking Machines Data Science Inc. 

## Dependencies

In [None]:
!pip install -q -U transformers torch torchaudio tf-keras ipywebrtc ipywidgets

## Load with Pipeline

**Recommended**. You can use this model with the standard Transformers pipeline

In [None]:
from transformers import pipeline
import torch

user_name = "ThinkingMachinesDataScience"
model_name = "Ratchada-Fang-Thon-Whisper"
repo_name = user_name + "/" + model_name  

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
        'automatic-speech-recognition', 
        model=repo_name, 
        device=device, 
        generate_kwargs={"language": "th", "task": "transcribe"}
    )

### Demo with Voice Recorder

In [None]:
from ipywebrtc import AudioRecorder, CameraStream
import torchaudio
from IPython.display import Audio

In [None]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

In [None]:
record.save("audio.wav")

In [None]:
result = pipe(
    "audio.wav"
)["text"]

result

## Load with Transformer

You can use this model from Transfomers module driectly

This method is recommneded to use our `ratchda-utils` lib on Pypi package

In [None]:
!pip install -q -U ratchada-utils

In [None]:
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

user_name = "ThinkingMachinesDataScience"
model_name = "Ratchada-Fang-Thon-Whisper"
repo_name = user_name + "/" + model_name  

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

processor = AutoProcessor.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper")
model = AutoModelForSpeechSeq2Seq.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper").to(device)

# waveform is numpy that obtain from Audio processor lib i.e. librosa, torchaudio

input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features.to(device)

with torch.no_grad():
    predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] # best choice of batches

from ratchada_utils.processor import tokenize_text # strongly recommend post-processor

processed_text = tokenize_text(transcription) # cut the text into splited component and process it (see github)

result = "".join(processed_text)

print(result)
