## reference

### - ASR [whisper(OpenAI)](https://github.com/openai/whisper)
### - NLU [thkkvui/xlm-roberta-base-finetuned-massive](https://huggingface.co/thkkvui/xlm-roberta-base-finetuned-massive)
### - Record [pyaudio](https://people.csail.mit.edu/hubert/pyaudio/docs/)
### - TTS [TTS(coqui-ai)](https://github.com/coqui-ai/TTS)
### - [kunishou/Talking_Robot(GitHub)](https://github.com/kunishou/Talking_Robot)

In [2]:
!pip install -Uq pip
!pip install -q openai-whisper
!pip install -q transformers
!pip install -q datasets
!pip install -q torch
!pip install -q pyaudio
!pip install -q mecab-python3
!pip install -q alkana
!pip install -q unidic-lite
!pip install -q TTS

In [3]:
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

## ASR

In [4]:
import whisper
asr_model = whisper.load_model("base")

## NLU

### load model (from huggingface hub)

In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "thkkvui/xlm-roberta-base-finetuned-massive"
nlu_model = (AutoModelForSequenceClassification.from_pretrained(model_name).to(device))

In [8]:
nlu_model.config.id2label

{0: 'datetime_query',
 1: 'iot_hue_lightchange',
 2: 'transport_ticket',
 3: 'takeaway_query',
 4: 'qa_stock',
 5: 'general_greet',
 6: 'recommendation_events',
 7: 'music_dislikeness',
 8: 'iot_wemo_off',
 9: 'cooking_recipe',
 10: 'qa_currency',
 11: 'transport_traffic',
 12: 'general_quirky',
 13: 'weather_query',
 14: 'audio_volume_up',
 15: 'email_addcontact',
 16: 'takeaway_order',
 17: 'email_querycontact',
 18: 'iot_hue_lightup',
 19: 'recommendation_locations',
 20: 'play_audiobook',
 21: 'lists_createoradd',
 22: 'news_query',
 23: 'alarm_query',
 24: 'iot_wemo_on',
 25: 'general_joke',
 26: 'qa_definition',
 27: 'social_query',
 28: 'music_settings',
 29: 'audio_volume_other',
 30: 'calendar_remove',
 31: 'iot_hue_lightdim',
 32: 'calendar_query',
 33: 'email_sendemail',
 34: 'iot_cleaning',
 35: 'audio_volume_down',
 36: 'play_radio',
 37: 'cooking_query',
 38: 'datetime_convert',
 39: 'qa_maths',
 40: 'iot_hue_lightoff',
 41: 'iot_hue_lighton',
 42: 'transport_query',
 43:

## 性能確認

In [9]:
from transformers import pipeline

classifier = pipeline("text-classification", model=model_name)

text = ["今日の天気を教えて", "ニュースある？", "予定をチェックして", "ドル円は？"]

for t in text:
    output = classifier(t)
    print(output)

[{'label': 'weather_query', 'score': 0.9735569953918457}]
[{'label': 'news_query', 'score': 0.9358323812484741}]
[{'label': 'calendar_query', 'score': 0.9178861975669861}]
[{'label': 'qa_currency', 'score': 0.8823915719985962}]


## Record

In [7]:
# example utterances

# "今日の天気を教えて"
# "ニュースある？"
# "予定をチェックして？"
# "ドル円は？"

In [13]:
import pyaudio
import wave

record_time = 8
record_filepath = "record.wav"

FORMAT = pyaudio.paInt16        
rate = 44100
chunk = 2**10
audio = pyaudio.PyAudio()

stream = audio.open(format=FORMAT,
                    input=True,
                    rate=rate, 
                    frames_per_buffer=chunk,
                    channels=1,
)

print(f"Speak to your microphone for {record_time} sec...")
frames = []
for i in range(0, int(rate / chunk * record_time)):
    data = stream.read(chunk)
    frames.append(data) 
print ("Great!")

stream.stop_stream()
stream.close()
audio.terminate()

wf = wave.open(record_filepath, 'wb')
wf.setnchannels(1)
wf.setsampwidth(audio.get_sample_size(FORMAT))
wf.setframerate(rate)
wf.writeframes(b''.join(frames))
wf.close()

Speak to your microphone for 8 sec...
Great!


## data preprocessing

In [20]:
asr_text = asr_model.transcribe(record_filepath, verbose=False, language="ja")
print(f'{asr_text["text"]}')

In [19]:
import re
import MeCab
import alkana
import pandas as pd

# トークン化
al_re = re.compile(r'^[a-zA-Z]+$')
def is_al(text):
    return al_re.match(text) is not None

tmp_text = asr_text["text"] #"helloテレビを見ました"
wakati = MeCab.Tagger('-Owakati')
wakati_output = wakati.parse(tmp_text)
print(wakati_output)


# 英語検索
df = pd.DataFrame(wakati_output.split(" "),columns=["word"])
df["en_word"] = df["word"].apply(is_al)
df["katakana"] = df["word"].apply(alkana.get_kana)
print(df)
print(" ")
# カタカナ変換
df = df[df["en_word"] == True]
dict_rep = dict(zip(df["word"], df["katakana"]))

if len(df) > 0:
    for word, katakana in dict_rep.items():
        asr_text = tmp_text.replace(word, katakana)
else:
    asr_text = tmp_text
    
print(asr_text)

ドル エン の 値段 は 

  word  en_word katakana
0   ドル    False     None
1   エン    False     None
2    の    False     None
3   値段    False     None
4    は    False     None
5   \n    False     None
 
ドルエンの値段は


## TTS

In [21]:
from TTS.api import TTS

tts = TTS()
tts.list_models()

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`



['tts_models/multilingual/multi-dataset/xtts_v1',
 'tts_models/multilingual/multi-dataset/your_tts',
 'tts_models/multilingual/multi-dataset/bark',
 'tts_models/bg/cv/vits',
 'tts_models/cs/cv/vits',
 'tts_models/da/cv/vits',
 'tts_models/et/cv/vits',
 'tts_models/ga/cv/vits',
 'tts_models/en/ek1/tacotron2',
 'tts_models/en/ljspeech/tacotron2-DDC',
 'tts_models/en/ljspeech/tacotron2-DDC_ph',
 'tts_models/en/ljspeech/glow-tts',
 'tts_models/en/ljspeech/speedy-speech',
 'tts_models/en/ljspeech/tacotron2-DCA',
 'tts_models/en/ljspeech/vits',
 'tts_models/en/ljspeech/vits--neon',
 'tts_models/en/ljspeech/fast_pitch',
 'tts_models/en/ljspeech/overflow',
 'tts_models/en/ljspeech/neural_hmm',
 'tts_models/en/vctk/vits',
 'tts_models/en/vctk/fast_pitch',
 'tts_models/en/sam/tacotron-DDC',
 'tts_models/en/blizzard2013/capacitron-t2-c50',
 'tts_models/en/blizzard2013/capacitron-t2-c150_v2',
 'tts_models/en/multi-dataset/tortoise-v2',
 'tts_models/en/jenny/jenny',
 'tts_models/es/mai/tacotron2-DD

In [22]:
# Japanese
tts_model = tts.list_models()[39]

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`



In [24]:
# Download model
tts = TTS(tts_model)

## Inference

In [25]:
sample_outputs = {"weather_query":"今日は晴れ、予想最高気温は21℃です。",
                 "news_query":"オリンピック陸上100メートル決勝は雨天順延となりました。",
                 "qa_currency":"今日のドル円は150円です。",
                 "calendar_query":"12時から会議、17時から東京で会食、が予定されています。"}

In [26]:
from transformers import pipeline

model_name = "thkkvui/xlm-roberta-base-finetuned-massive"
classifier = pipeline("text-classification", model=model_name)

output = classifier(asr_text)
answer_text = sample_outputs[output[0]["label"]]

In [27]:
answer_text

'今日のドル円は150円です。'

## output

In [28]:
tts_filepath = "output.wav"
tts.tts_to_file(answer_text, file_path=tts_filepath, progress_bar=False, gpu=False)

 > Text splitted to sentences.
['今日のドル円は150円です。']
 > Processing time: 0.3338449001312256
 > Real-time factor: 0.11499125293510254


'output.wav'

In [30]:
import librosa
import IPython

def sound():
    y, sr = librosa.load(tts_filepath)
    return IPython.display.Audio(data=y, rate=sr)

sound()