## reference

### - ASR [whisper(OpenAI)](https://github.com/openai/whisper)
### - NLU [thkkvui/xlm-roberta-base-finetuned-JaQuAD(HuggingFace)](https://huggingface.co/thkkvui/xlm-roberta-base-finetuned-JaQuAD)
### - Record [pyaudio](https://people.csail.mit.edu/hubert/pyaudio/docs/)
### - TTS [TTS(coqui-ai)](https://github.com/coqui-ai/TTS)
### - [kunishou/Talking_Robot(GitHub)](https://github.com/kunishou/Talking_Robot)

In [22]:
# 必要なモジュールのインストール

!pip install -q openai-whisper
!pip install -q torch
!pip install -q transformers
!pip install -q datasets

In [1]:
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
device

device(type='mps')

## ASR

In [1]:
import whisper
asr_model = whisper.load_model("base")

## NLU

### load model (from huggingface hub)

In [5]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "thkkvui/xlm-roberta-base-finetuned-JaQuAD"
nlu_model = (AutoModelForQuestionAnswering.from_pretrained(model_name).to(device))
tokenizer = AutoTokenizer.from_pretrained(model_name)

### load model (from local)

In [2]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "xlm-roberta-base"
nlu_model = (AutoModelForQuestionAnswering.from_pretrained("./output").to(device))
tokenizer = AutoTokenizer.from_pretrained(model_name)

## 性能確認

In [4]:
text = "私は音声アシスタントです。この7月で5歳になりました。今札幌に住んでいます。昨日は帯広に出かけました。好きなイベントはバルーンフェスティバルです。好きな食べ物はバタークッキーで趣味はカヌーです。"
questions = ["昨日はどこへ出かけましたか？", "あなたの名前は何ですか？", "何歳ですか？", "あなたの趣味を教えてください。",  "あなたが好きなイベントは何ですか？"]

for question in questions:
    
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt").to(device)
    
    with torch.no_grad():
        output = nlu_model(**inputs)

    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)

    answer_tokens = inputs.input_ids[0, answer_start : answer_end + 1]
    answer = tokenizer.decode(answer_tokens)

    print(f"質問: {question} -> 回答: {answer}")

質問: 昨日はどこへ出かけましたか？ -> 回答: 帯広
質問: あなたの名前は何ですか？ -> 回答: 音声アシスタント
質問: 何歳ですか？ -> 回答: 5歳
質問: あなたの趣味を教えてください。 -> 回答: カヌー
質問: あなたが好きなイベントは何ですか？ -> 回答: バルーンフェスティバル


## Record

In [25]:
# 必要なモジュールのインストール

!pip install -q pyaudio

In [14]:
# questions

# "昨日はどこへ出かけましたか？"
# "あなたの名前は何ですか？"
# "何歳ですか？"
# "あなたの趣味を教えてください。"
# "あなたが好きなイベントは何ですか？"

In [5]:
import pyaudio
import wave

record_time = 8
record_filepath = "record.wav"

FORMAT = pyaudio.paInt16        
rate = 44100
chunk = 2**10
audio = pyaudio.PyAudio()

stream = audio.open(format=FORMAT,
                    input=True,
                    rate=rate, 
                    frames_per_buffer=chunk,
                    channels=1,
)

print(f"Speak to your microphone for {record_time} sec...")
frames = []
for i in range(0, int(rate / chunk * record_time)):
    data = stream.read(chunk)
    frames.append(data) 
print ("Great!")

stream.stop_stream()
stream.close()
audio.terminate()

wf = wave.open(record_filepath, 'wb')
wf.setnchannels(1)
wf.setsampwidth(audio.get_sample_size(FORMAT))
wf.setframerate(rate)
wf.writeframes(b''.join(frames))
wf.close()

Speak to your microphone for 8 sec...
Great!


### data preprocessing

In [26]:
# 必要なモジュールのインストール

!pip install -q mecab-python3
!pip install -q alkana
!pip install -q unidic-lite

In [3]:
asr_text = asr_model.transcribe(record_filepath, verbose=False, language="ja")
print(f'{asr_text["text"]}')

In [7]:
import re
import MeCab
import alkana
import pandas as pd

# トークン化
al_re = re.compile(r'^[a-zA-Z]+$')
def is_al(text):
    return al_re.match(text) is not None

tmp_text = asr_text["text"] #"helloテレビを見ました"
wakati = MeCab.Tagger('-Owakati')
wakati_output = wakati.parse(tmp_text)
print(wakati_output)


# 英語検索
df = pd.DataFrame(wakati_output.split(" "),columns=["word"])
df["en_word"] = df["word"].apply(is_al)
df["katakana"] = df["word"].apply(alkana.get_kana)
print(df)
print(" ")
# カタカナ変換
df = df[df["en_word"] == True]
dict_rep = dict(zip(df["word"], df["katakana"]))

if len(df) > 0:
    for word, katakana in dict_rep.items():
        asr_text = tmp_text.replace(word, katakana)
else:
    asr_text = tmp_text
    
print(asr_text)

あなた の 名前 は 

  word  en_word katakana
0  あなた    False     None
1    の    False     None
2   名前    False     None
3    は    False     None
4   \n    False     None
 
あなたの名前は


## TTS

In [4]:
# 必要なモジュールのインストール

!pip install -q TTS

In [8]:
from TTS.api import TTS

for i, name in enumerate(TTS.list_models()[:]):
    print(f"{i}: {name}")

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`

0: tts_models/multilingual/multi-dataset/your_tts
1: tts_models/multilingual/multi-dataset/bark
2: tts_models/bg/cv/vits
3: tts_models/cs/cv/vits
4: tts_models/da/cv/vits
5: tts_models/et/cv/vits
6: tts_models/ga/cv/vits
7: tts_models/en/ek1/tacotron2
8: tts_models/en/ljspeech/tacotron2-DDC
9: tts_models/en/ljspeech/tacotron2-DDC_ph
10: tts_models/en/ljspeech/glow-tts
11: tts_models/en/ljspeech/speedy-speech
12: tts_models/en/ljspeech/tacotron2-DCA
13: tts_models/en/ljspeech/vits
14: tts_models/en/ljspeech/vits--neon
15: tts_models/en/ljspeech/fast_pitch
16: tts_models/en/ljspeech/overflow
17: tts_models/en/ljspeech/neural_hmm
18: tts_models/en/vctk/vits
19: tts_models/en/vctk/fast_pitch
20: tts_models/en/sam/tacotron-DDC
21: tts_models/en/blizzard2013/capacitron-t2-c50
22: tts_models/en/blizzard2013/capac

In [9]:
# Japanese
tts_model = TTS.list_models()[38]

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`



In [10]:
# Download model
tts = TTS(tts_model)

 > tts_models/ja/kokoro/tacotron2-DDC is already downloaded.
 > vocoder_models/ja/kokoro/hifigan_v1 is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:/Users/tthkky/Library/Application Support/tts/tts_models--ja--kokoro--tacotron2-DDC/scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model's

### Inference

In [11]:
# output

inputs = tokenizer.encode_plus(asr_text, text, add_special_tokens=True, return_tensors="pt").to(device)

with torch.no_grad():
    output = nlu_model(**inputs)

answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)

answer_tokens = inputs.input_ids[0, answer_start : answer_end + 1]
answer_text = tokenizer.decode(answer_tokens)

answer_text

'音声アシスタント'

In [12]:
print(f"text: {text}")
print(f"question: {asr_text}")
print(f"answer: {answer_text}")

text: 私は音声アシスタントです。この7月で5歳になりました。今札幌に住んでいます。昨日は帯広に出かけました。好きなイベントはバルーンフェスティバルです。好きな食べ物はバタークッキーで趣味はカヌーです。
question: あなたの名前は
answer: 音声アシスタント


### output

In [13]:
tts_filepath = "output.wav"
tts.tts_to_file(answer_text, file_path=tts_filepath, progress_bar=False, gpu=False)

 > Text splitted to sentences.
['音声アシスタント']
 > Processing time: 0.353074312210083
 > Real-time factor: 0.1767455635722923


'output.wav'

In [5]:
import librosa
import IPython

def sound():
    y, sr = librosa.load(tts_filepath)
    return IPython.display.Audio(data=y, rate=sr)

sound()