## Whisper

[official site: how to install and basic usage](https://github.com/openai/whisper)

### way 1: command

`whisper input/voice_test.m4a --model small --language zh`<br>

> 1. model: model size: we chose small , the other options: tiny ,large, etc.<br>
> 2. language: zh , sound source is Chinese.

### way 2: api

In [2]:
import whisper

In [3]:
model = whisper.load_model('small')

In [4]:
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("./input/audio_water.m4a")
audio = whisper.pad_or_trim(audio)

In [5]:
audio.shape

(480000,)

In [6]:
import IPython.display as ipd

In [7]:
audio

array([-3.0517578e-05, -3.0517578e-05, -3.0517578e-05, ...,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00], dtype=float32)

In [8]:
sample_rate = 16000

In [9]:
ipd.Audio([audio], rate=sample_rate)

In [10]:
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

In [11]:
mel.shape

torch.Size([80, 3000])

In [12]:
type(mel)

torch.Tensor

In [13]:
model.device

device(type='cuda', index=0)

> declare target Language is Chinese.

In [14]:
options = whisper.DecodingOptions(language='zh')

In [15]:
result = whisper.decode(model, mel,options)

In [16]:
result.text

'我出门买了一瓶矿泉水'

### recognize on line

> read voice

In [1]:
import pyaudio 

In [2]:
import wave

In [3]:
def get_audio(sec, path, format=pyaudio.paInt16, channels=1, rate=44100, chunk=1024):

    # define audio object.
    p = pyaudio.PyAudio()
    # define steam
    stream = p.open(format=format,           # audio format
                   channels=channels,       # mono audio
                   rate=rate,               # audio sample rate
                   input=True,              # input audio
                   frames_per_buffer=chunk  # number of audio samples per frame, 1kb
                  )
    # create audio file
    wf = wave.open(path, 'wb')
    
    # set property of file:match params above.
    wf.setnchannels(channels)
    wf.setsampwidth(p.get_sample_size(format))
    wf.setframerate(rate)
    
    print('begin record...')
    
    for w in range(  int( rate*sec / chunk)   ):
        data = stream.read(chunk, exception_on_overflow=False)
        wf.writeframes(data)
        
    print('record over...')
    stream.stop_stream()
    stream.close()
    p.terminate()
    wf.close()

In [4]:
get_audio(6, './output/test1.mav')

ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pulse.c:243:(pulse_connect) PulseAudio: Unable to connect: Connection refused

ALSA lib pulse.c:243:(pulse_connect) PulseAudio: Unable to connect: Connection refused

ALSA lib pulse.c:243:(pulse_connect) PulseAudio: Unable to connect: Connection refused

ALSA lib pulse.c:243:(pulse_connect) PulseAudio: Unable to connect: Connection refused

Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is n

begin record...
record over...


> send data to server

In [17]:
import base64
import json

from urllib.request import urlopen
from urllib.request import Request

In [6]:
path_voice = './input/voice_test.m4a'

In [12]:
format_ = path_voice[-3:]

> transcoding to base64

In [8]:
data = open(path_voice, 'rb').read()

In [9]:
base_data = base64.b64encode(data).decode('utf-8')

In [11]:
length = len(base_data)

In [13]:
params = {
    'format': format_,
    'rate':44100,
    'channel': 1,
    'len': length,
    'speech': base_data
}

In [16]:
data = json.dumps(params, sort_keys=False)

In [18]:
# encode to utf-8
#Request(url, data.encode('utf-8')

In [20]:
# server end: decode
#eval(line.decode('utf-9'))