This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This project aims to predict how well a Japanese sentence is similar to some Japanese classical author's ones. The Japanese authors include Soseki Natsume(夏目漱石), Ogai Mori(森鴎外), Ryunosuke Akutagawa(芥川龍之介) and so on.

Getting data from Aozora-bunko(青空文庫)

I'm getting training data from Aozora-bunko, which is a public repository for Japanese classical pieces with outdated copyright.

In order to download text pieces from Aozora-bunko, I made It downloads all the pieces in zip file of certain authors, specified in target_author.csv, extracts the zips to SHIFT-JIS encoded text files, converts them to utf-8 and finally changes the file format to csv. The last csv has each lines splitted by "。", the Japanese form of period, and new line feed.

Training the model

Thanks to the following paper and blog, I use character-level convolurional neural network to train the classification model for Japanese sentences.

After downloading and changing the text to csv, run the to generate classification model.


As you have finished generating the model, now you are ready to use it.