## reference

### - ASR [whisper(OpenAI)](https://github.com/openai/whisper)
### - NLU [thkkvui/mDeBERTa-v3-base-finetuned-nli-jnli](https://huggingface.co/thkkvui/mDeBERTa-v3-base-finetuned-nli-jnli)
### - Record [pyaudio](https://people.csail.mit.edu/hubert/pyaudio/docs/)
### - TTS [TTS(coqui-ai)](https://github.com/coqui-ai/TTS)
### - [kunishou/Talking_Robot(GitHub)](https://github.com/kunishou/Talking_Robot)

In [1]:
!pip install -Uq pip
!pip install -q openai-whisper
!pip install -q transformers
!pip install -q datasets
!pip install -q torch
!pip install -q pyaudio
!pip install -q mecab-python3
!pip install -q alkana
!pip install -q unidic-lite
!pip install -q TTS

In [2]:
import torch
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

## ASR

In [3]:
import whisper
asr_model = whisper.load_model("base")

## NLU

### load model (from huggingface hub)

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "thkkvui/mDeBERTa-v3-base-finetuned-nli-jnli"
nlu_model = (AutoModelForSequenceClassification.from_pretrained(model_name).to(device))

## ÊÄßËÉΩÁ¢∫Ë™ç

In [6]:
labels = ["Â§©Ê∞ó", "„Éã„É•„Éº„Çπ", "‰∫àÂÆö", "„Éû„Éº„Ç±„ÉÉ„Éà"]

In [1]:
from transformers import pipeline

model_name = "thkkvui/mDeBERTa-v3-base-finetuned-nli-jnli"
classifier = pipeline("zero-shot-classification", model=model_name)

text = ["‰ªäÊó•„ÅÆÂ§©Ê∞ó„ÇíÊïô„Åà„Å¶", "„Éã„É•„Éº„Çπ„ÅÇ„ÇãÔºü", "‰∫àÂÆö„Çí„ÉÅ„Çß„ÉÉ„ÇØ„Åó„Å¶", "„Éâ„É´ÂÜÜ„ÅØÔºü"]

for t in text:
    output = classifier(t, labels, multi_label=False)
    print(output)

## Record

In [8]:
# example utterances 

# "‰ªäÊó•„ÅÆÂ§©Ê∞ó„ÇíÊïô„Åà„Å¶"
# "„Éã„É•„Éº„Çπ„ÅÇ„ÇãÔºü"
# "‰∫àÂÆö„Çí„ÉÅ„Çß„ÉÉ„ÇØ„Åó„Å¶Ôºü"
# "„Éâ„É´ÂÜÜ„ÅØÔºü"

In [9]:
import pyaudio
import wave

record_time = 8
record_filepath = "record.wav"

FORMAT = pyaudio.paInt16        
rate = 44100
chunk = 2**10
audio = pyaudio.PyAudio()

stream = audio.open(format=FORMAT,
                    input=True,
                    rate=rate, 
                    frames_per_buffer=chunk,
                    channels=1,
)

print(f"Speak to your microphone for {record_time} sec...")
frames = []
for i in range(0, int(rate / chunk * record_time)):
    data = stream.read(chunk)
    frames.append(data) 
print ("Great!")

stream.stop_stream()
stream.close()
audio.terminate()

wf = wave.open(record_filepath, 'wb')
wf.setnchannels(1)
wf.setsampwidth(audio.get_sample_size(FORMAT))
wf.setframerate(rate)
wf.writeframes(b''.join(frames))
wf.close()

## data preprocessing

In [11]:
asr_text = asr_model.transcribe(record_filepath, verbose=False, language="ja")
print(f'{asr_text["text"]}')

In [12]:
import re
import MeCab
import alkana
import pandas as pd

# „Éà„Éº„ÇØ„É≥Âåñ
al_re = re.compile(r'^[a-zA-Z]+$')
def is_al(text):
    return al_re.match(text) is not None

tmp_text = asr_text["text"] #"hello„ÉÜ„É¨„Éì„ÇíË¶ã„Åæ„Åó„Åü"
wakati = MeCab.Tagger('-Owakati')
wakati_output = wakati.parse(tmp_text)
print(wakati_output)


# Ëã±Ë™ûÊ§úÁ¥¢
df = pd.DataFrame(wakati_output.split(" "),columns=["word"])
df["en_word"] = df["word"].apply(is_al)
df["katakana"] = df["word"].apply(alkana.get_kana)
print(df)
print(" ")
# „Ç´„Çø„Ç´„ÉäÂ§âÊèõ
df = df[df["en_word"] == True]
dict_rep = dict(zip(df["word"], df["katakana"]))

if len(df) > 0:
    for word, katakana in dict_rep.items():
        asr_text = tmp_text.replace(word, katakana)
else:
    asr_text = tmp_text
    
print(asr_text)

‰∫àÂÆö „Çí „ÉÅ„Çß„ÉÉ„ÇØ „Åó „Å¶ 

   word  en_word katakana
0    ‰∫àÂÆö    False     None
1     „Çí    False     None
2  „ÉÅ„Çß„ÉÉ„ÇØ    False     None
3     „Åó    False     None
4     „Å¶    False     None
5    \n    False     None
 
‰∫àÂÆö„Çí„ÉÅ„Çß„ÉÉ„ÇØ„Åó„Å¶


## TTS

In [13]:
from TTS.api import TTS

tts = TTS()
tts.list_models()

No API token found for üê∏Coqui Studio voices - https://coqui.ai 
Visit üîóhttps://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`



['tts_models/multilingual/multi-dataset/xtts_v1',
 'tts_models/multilingual/multi-dataset/your_tts',
 'tts_models/multilingual/multi-dataset/bark',
 'tts_models/bg/cv/vits',
 'tts_models/cs/cv/vits',
 'tts_models/da/cv/vits',
 'tts_models/et/cv/vits',
 'tts_models/ga/cv/vits',
 'tts_models/en/ek1/tacotron2',
 'tts_models/en/ljspeech/tacotron2-DDC',
 'tts_models/en/ljspeech/tacotron2-DDC_ph',
 'tts_models/en/ljspeech/glow-tts',
 'tts_models/en/ljspeech/speedy-speech',
 'tts_models/en/ljspeech/tacotron2-DCA',
 'tts_models/en/ljspeech/vits',
 'tts_models/en/ljspeech/vits--neon',
 'tts_models/en/ljspeech/fast_pitch',
 'tts_models/en/ljspeech/overflow',
 'tts_models/en/ljspeech/neural_hmm',
 'tts_models/en/vctk/vits',
 'tts_models/en/vctk/fast_pitch',
 'tts_models/en/sam/tacotron-DDC',
 'tts_models/en/blizzard2013/capacitron-t2-c50',
 'tts_models/en/blizzard2013/capacitron-t2-c150_v2',
 'tts_models/en/multi-dataset/tortoise-v2',
 'tts_models/en/jenny/jenny',
 'tts_models/es/mai/tacotron2-DD

In [14]:
# Japanese
tts_model = tts.list_models()[39]

No API token found for üê∏Coqui Studio voices - https://coqui.ai 
Visit üîóhttps://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`



In [16]:
# Download model
tts = TTS(tts_model)

## Inference

In [17]:
labels = ["Â§©Ê∞ó", "„Éã„É•„Éº„Çπ", "‰∫àÂÆö", "„Éû„Éº„Ç±„ÉÉ„Éà"]

sample_outputs = {"Â§©Ê∞ó":"‰ªäÊó•„ÅØÊô¥„Çå„ÄÅ‰∫àÊÉ≥ÊúÄÈ´òÊ∞óÊ∏©„ÅØ21‚ÑÉ„Åß„Åô„ÄÇ",
                  "„Éã„É•„Éº„Çπ":"„Ç™„É™„É≥„Éî„ÉÉ„ÇØÈô∏‰∏ä100„É°„Éº„Éà„É´Ê±∫Âãù„ÅØÈõ®Â§©È†ÜÂª∂„Å®„Å™„Çä„Åæ„Åó„Åü„ÄÇ",
                  "„Éû„Éº„Ç±„ÉÉ„Éà":"‰ªäÊó•„ÅÆ„Éâ„É´ÂÜÜ„ÅØ150ÂÜÜ„Åß„Åô„ÄÇ",
                  "‰∫àÂÆö":"12ÊôÇ„Åã„Çâ‰ºöË≠∞„ÄÅ17ÊôÇ„Åã„ÇâÊù±‰∫¨„Åß‰ºöÈ£ü„ÄÅ„Åå‰∫àÂÆö„Åï„Çå„Å¶„ÅÑ„Åæ„Åô„ÄÇ"}

In [18]:
from transformers import pipeline

model_name = "thkkvui/mDeBERTa-v3-base-finetuned-nli-jnli"
classifier = pipeline("zero-shot-classification", model=model_name)

output = classifier(asr_text, labels, multi_label=False)
answer_text = sample_outputs[output["labels"][0]]

In [19]:
answer_text

'12ÊôÇ„Åã„Çâ‰ºöË≠∞„ÄÅ17ÊôÇ„Åã„ÇâÊù±‰∫¨„Åß‰ºöÈ£ü„ÄÅ„Åå‰∫àÂÆö„Åï„Çå„Å¶„ÅÑ„Åæ„Åô„ÄÇ'

## output

In [20]:
tts_filepath = "output.wav"
tts.tts_to_file(answer_text, file_path=tts_filepath, progress_bar=False, gpu=False)

 > Text splitted to sentences.
['12ÊôÇ„Åã„Çâ‰ºöË≠∞„ÄÅ17ÊôÇ„Åã„ÇâÊù±‰∫¨„Åß‰ºöÈ£ü„ÄÅ„Åå‰∫àÂÆö„Åï„Çå„Å¶„ÅÑ„Åæ„Åô„ÄÇ']
 > Processing time: 0.665740966796875
 > Real-time factor: 0.12224840371311703


'output.wav'

In [22]:
import librosa
import IPython

def sound():
    y, sr = librosa.load(tts_filepath)
    return IPython.display.Audio(data=y, rate=sr)

sound()