This project aims to predict how well a Japanese sentence is similar to some Japanese classical author's ones. The Japanese authors include Soseki Natsume(夏目漱石), Ogai Mori(森鴎外), Ryunosuke Akutagawa(芥川龍之介) and so on.
Getting data from Aozora-bunko(青空文庫)
I'm getting training data from Aozora-bunko, which is a public repository for Japanese classical pieces with outdated copyright.
In order to download text pieces from Aozora-bunko, I made aozora_scrape.py. It downloads all the pieces in zip file of certain authors, specified in target_author.csv, extracts the zips to SHIFT-JIS encoded text files, converts them to utf-8 and finally changes the file format to csv. The last csv has each lines splitted by "。", the Japanese form of period, and new line feed.
Training the model
Thanks to the following paper and blog, I use character-level convolurional neural network to train the classification model for Japanese sentences.
After downloading and changing the text to csv, run the aozora_cnn.py to generate classification model.
As you have finished generating the model, now you are ready to use it.