AI-based chatbot trained on specific waifus' speech
Set up / Installation
- Install Kaldi
- Install ffmpeg and mkvtoolnix (
sudo apt-get install mkvtoolnix ffmpeg)
- Install python dependencies (
pip install -r requirements.txt)
- Download VoxCeleb1 and VoxCeleb2 datasets and train the voxceleb Kaldi DNN on them (by running
- OR -
Download a pretrained network like  and install it by replacing the files in
kaldi/egs/voxceleb/v2with the downloaded ones
- Prepare the data following the guidelines in the Data section
The system takes unmodified anime episodes as input. It assumes these are in
.mkv format and that the first audio and susbtitle tracks, as listed by
mkvmerge --identify, are the ones that should be used to build the model. This is usually the case and the only issue that may arise because of this assumption is that, if a series has it's first subtitle track in japanese and the second one in english, the model is going to use the japanese one, resulting in a chatbot that speaks japanese. If that's the case, the subtitle track being used can be changed by modifying the definition of the variable
main.py, setting it to the 0-based index of the subtitle track (so if it's set to
0, the system is gonna use the first subtitle track listed by
mkvmerge --identify, if it's set to
2, it's gonna use the third subtitle track listed by
mkvmerge --identify, and so on).
These episodes should be in the
series folder, organized by series, so that all the episodes from a series are in the same folder, thus following this folder structure:
series/Shingeki no Kyojin
series/Shingeki no Kyojin/1x01.mkv
series/Shingeki no Kyojin/1x02.mkv
series/Shingeki no Kyojin/1x03.mkv
series/Shingeki no Kyojin/2x01.mkv
series/Shingeki no Kyojin/2x02.mkv
series/Sword Art Online
series/Sword Art Online/Episode 1x01.mkv
series/Sword Art Online/Episode 1x02.mkv
series/Sword Art Online/Episode 1x03.mkv
How does it work?
- Extract the audio and subtitle tracks from anime episodes.
- Using timing information from the subtitles, extract the audio segment associated with each subtitle so that we have a list of (subtitle text, audio segment) tuples for all the subtitles in a series.
- Extract the xvector associated with each audio segment. This is done by extracting a set of features from the audio segment and then feeding them to a neural network to extract an embedding called xvector. The original research behind this can be found in . We use an implementation of it in Kaldi using a pretrained DNN that was trained on the VoxCeleb datasets.
- Use DBSCAN to perform clustering on the xvectors in order to group together the audio segments spoken by the same speaker. Ater this step we should have a list of (subtitle text, speaker) tuples for all subtitles.
- Select the speaker or set of speakers to mimic by using user-input
- Train a seq2seq model
- Serve the chatbot as a discord bot
No waifu no laifu
- Adolf Hitler
 Snyder, D. and Garcia-Romero, D. and Sell, G. and Povey, D. and Khudanpur, S. 2018. X-vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
 http://kaldi-asr.org/models/m7 (VoxCeleb Xvector System 1a), retrieved the 19th of December of 2018.
 Nagrani, A. and Chung, J.~S. and Zisserman, A. 2017. VoxCeleb: a large-scale speaker identification dataset. INTERSPEECH.
 J. S. Chung, A. Nagrani, A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. INTERSPEECH.
 Ilya Sutskever, Oriol Vinyals and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. Conference and Workshop on Neural Information Processing Systems (NIPS).