"Working with Corpora" final project - lyrics-based Japanese music classification.
Lyric-based Music Classification

How to use

  1. Scrape the data: run python3 scrape_lyrics.py
  2. Preprocess the data: copy the folders generated and run python3 pool.py [folder] [tokenizer] where tokenizer is either "jumanpp" or "janome". Janome is recommended, since it Juman++ is still flaky and quite slow. You need to have janome and Juman++, as well as pyknp, Juman++'s Python interface, installed.
  3. To train neural network: run python3 train.py training_config.json. Again, you need to copy the generated folders over. This trains quite quickly if you have a GPU (even a home-use GPU). 100 epochs takes just a few minutes, although it seems that >= 500 epochs give better results. You will see evaluation results at the end. You need TensorFlow, NumPy and Pandas. You also need a word2vec embedding for Japanese; you can find instructions on how to train one here.
  4. To train naive Bayes classifier: run python3 classify.py. You will see evaluation results at the end. This is very fast, just a few seconds. You need SciKit-Learn and nothing else.