# fastText, MeCabで文書分類

## 参考

- [DeepAge文書分類に挑戦する](https://deepage.net/bigdata/machine_learning/2016/08/28/fast_text_facebook.html#文書分類に挑戦する)
- [YoshihitoAso/gist:9048005 ubuntu環境にmecabをインストールする方法](https://gist.github.com/YoshihitoAso/9048005)

## 事前準備

livedoorのニュースコーパスLDCCをfastTextの規定行フォーマットに直す。

    __label__ラベル名 , 単語をスペース区切りにした本文

In [1]:
# livedoor.py

# import, def
import MeCab
import os
from os import listdir, walk
from os.path import join, isfile

mecab = MeCab.Tagger('mecabrc')

labels = [
    'dokujo-tsushin',
    'it-life-hack',
    'kaden-channel',
    'livedoor-homme',
    'movie-enter',
    'peachy',
    'smax',
    'sports-watch',
    'topic-news'
]

def tokenize(text):
    node = mecab.parseToNode(text.strip())
    tokens = []
    while node:
        tokens.append(node.surface)
        node = node.next
    return ' '.join(tokens)

def is_post(directory, filename):
    if isfile(join(directory, filename)):
        if filename != 'LICENSE.txt':
            return True
    return False

def read_content(directory, filename):
    print(join(directory, filename))
    body = [l.strip() for i, l in 
            enumerate(open(join(directory, filename), 'r', encoding="utf-8")) if i > 1]
    text = ''.join(body)
    return tokenize(text)

def read_posts(directory):
    if os.path.exists(join(directory, 'LICENSE.txt')):
        files = [f for f in listdir(directory) if is_post(directory, f)]
        for f in files:
            yield read_content(directory, f)

def read_corpus(input_directory, output_file):
    for (dirpath, _, _) in walk(input_directory):
        label_name = os.path.basename(dirpath)
        with open(output_file, "a", encoding="utf-8") as file:
            for post in read_posts(dirpath):
                file.write('__label__{} , {}'.format(labels.index(label_name), post) + "\n")

In [2]:
# 実行（corpora/ldcc/textフォルダに格納）
!mkdir -p data && rm -f data/ldcc.txt
read_corpus('corpora/ldcc/text/', 'data/ldcc.txt')

corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4948016.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5797904.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5368649.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5220452.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5765250.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5866028.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5403411.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6624494.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4788357.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4971076.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6233611.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5143945.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6246330.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4832209.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6076462.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5659126.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-

corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6133730.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5688259.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5858691.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5961210.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5292138.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6765514.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5230237.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5368729.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6416239.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6613586.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6029295.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6820686.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6881520.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5492401.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6597056.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6408147.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-

corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6431821.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5666210.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5612929.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5858077.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6388523.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5927658.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4827888.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5539799.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6915005.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5700711.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5494962.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6778211.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6812547.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6371775.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5815286.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6546795.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-

corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6543226.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5821538.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5667700.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5795847.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5741627.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4990884.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4916133.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5982537.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6281676.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6154590.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6157285.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6325130.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5022310.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5705310.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6229498.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5842914.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-

corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5397222.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4866266.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6851257.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5785306.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6649357.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5971298.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6152168.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6157503.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6384000.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6533086.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-4819443.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6446026.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5353302.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5993505.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6313075.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6863452.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-

corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5724435.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6514148.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6506771.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5641553.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5005846.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5490268.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5005857.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6492698.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5913812.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6480135.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6530289.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5578197.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5283177.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5851239.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-6109935.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-tsushin-5079275.txt
corpora/ldcc/text/dokujo-tsushin/dokujo-

corpora/ldcc/text/movie-enter/movie-enter-5893366.txt
corpora/ldcc/text/movie-enter/movie-enter-6641170.txt
corpora/ldcc/text/movie-enter/movie-enter-6564488.txt
corpora/ldcc/text/movie-enter/movie-enter-6361857.txt
corpora/ldcc/text/movie-enter/movie-enter-6245687.txt
corpora/ldcc/text/movie-enter/movie-enter-5910270.txt
corpora/ldcc/text/movie-enter/movie-enter-6317558.txt
corpora/ldcc/text/movie-enter/movie-enter-6324648.txt
corpora/ldcc/text/movie-enter/movie-enter-6532347.txt
corpora/ldcc/text/movie-enter/movie-enter-5961892.txt
corpora/ldcc/text/movie-enter/movie-enter-6660936.txt
corpora/ldcc/text/movie-enter/movie-enter-6649752.txt
corpora/ldcc/text/movie-enter/movie-enter-5850745.txt
corpora/ldcc/text/movie-enter/movie-enter-6571818.txt
corpora/ldcc/text/movie-enter/movie-enter-6338863.txt
corpora/ldcc/text/movie-enter/movie-enter-6653375.txt
corpora/ldcc/text/movie-enter/movie-enter-5931721.txt
corpora/ldcc/text/movie-enter/movie-enter-6391931.txt
corpora/ldcc/text/movie-ente

corpora/ldcc/text/movie-enter/movie-enter-6724187.txt
corpora/ldcc/text/movie-enter/movie-enter-6265669.txt
corpora/ldcc/text/movie-enter/movie-enter-6600684.txt
corpora/ldcc/text/movie-enter/movie-enter-6725384.txt
corpora/ldcc/text/movie-enter/movie-enter-6278767.txt
corpora/ldcc/text/movie-enter/movie-enter-6671599.txt
corpora/ldcc/text/movie-enter/movie-enter-6595602.txt
corpora/ldcc/text/movie-enter/movie-enter-6565605.txt
corpora/ldcc/text/movie-enter/movie-enter-6354484.txt
corpora/ldcc/text/movie-enter/movie-enter-6569323.txt
corpora/ldcc/text/movie-enter/movie-enter-6568920.txt
corpora/ldcc/text/movie-enter/movie-enter-5966479.txt
corpora/ldcc/text/movie-enter/movie-enter-6383037.txt
corpora/ldcc/text/movie-enter/movie-enter-5962597.txt
corpora/ldcc/text/movie-enter/movie-enter-6582445.txt
corpora/ldcc/text/movie-enter/movie-enter-6537974.txt
corpora/ldcc/text/movie-enter/movie-enter-6389848.txt
corpora/ldcc/text/movie-enter/movie-enter-6632298.txt
corpora/ldcc/text/movie-ente

corpora/ldcc/text/movie-enter/movie-enter-6750298.txt
corpora/ldcc/text/movie-enter/movie-enter-6751039.txt
corpora/ldcc/text/movie-enter/movie-enter-6010541.txt
corpora/ldcc/text/movie-enter/movie-enter-6839823.txt
corpora/ldcc/text/movie-enter/movie-enter-6459217.txt
corpora/ldcc/text/movie-enter/movie-enter-5952392.txt
corpora/ldcc/text/movie-enter/movie-enter-6119832.txt
corpora/ldcc/text/movie-enter/movie-enter-5946996.txt
corpora/ldcc/text/movie-enter/movie-enter-6486859.txt
corpora/ldcc/text/movie-enter/movie-enter-6125085.txt
corpora/ldcc/text/movie-enter/movie-enter-6602556.txt
corpora/ldcc/text/movie-enter/movie-enter-6228153.txt
corpora/ldcc/text/movie-enter/movie-enter-6072412.txt
corpora/ldcc/text/movie-enter/movie-enter-6000744.txt
corpora/ldcc/text/movie-enter/movie-enter-6631900.txt
corpora/ldcc/text/movie-enter/movie-enter-6219105.txt
corpora/ldcc/text/movie-enter/movie-enter-6242229.txt
corpora/ldcc/text/movie-enter/movie-enter-6771207.txt
corpora/ldcc/text/movie-ente

corpora/ldcc/text/movie-enter/movie-enter-6590065.txt
corpora/ldcc/text/movie-enter/movie-enter-6274851.txt
corpora/ldcc/text/movie-enter/movie-enter-5988596.txt
corpora/ldcc/text/movie-enter/movie-enter-5914722.txt
corpora/ldcc/text/movie-enter/movie-enter-5974873.txt
corpora/ldcc/text/movie-enter/movie-enter-6167984.txt
corpora/ldcc/text/movie-enter/movie-enter-5939788.txt
corpora/ldcc/text/movie-enter/movie-enter-6063215.txt
corpora/ldcc/text/movie-enter/movie-enter-5981842.txt
corpora/ldcc/text/movie-enter/movie-enter-6322901.txt
corpora/ldcc/text/movie-enter/movie-enter-6309409.txt
corpora/ldcc/text/movie-enter/movie-enter-6418661.txt
corpora/ldcc/text/movie-enter/movie-enter-6045331.txt
corpora/ldcc/text/movie-enter/movie-enter-6312748.txt
corpora/ldcc/text/movie-enter/movie-enter-6609100.txt
corpora/ldcc/text/movie-enter/movie-enter-6317409.txt
corpora/ldcc/text/movie-enter/movie-enter-6332442.txt
corpora/ldcc/text/movie-enter/movie-enter-5924578.txt
corpora/ldcc/text/movie-ente

corpora/ldcc/text/peachy/peachy-5314792.txt
corpora/ldcc/text/peachy/peachy-4931393.txt
corpora/ldcc/text/peachy/peachy-5038282.txt
corpora/ldcc/text/peachy/peachy-6684851.txt
corpora/ldcc/text/peachy/peachy-5029985.txt
corpora/ldcc/text/peachy/peachy-5608046.txt
corpora/ldcc/text/peachy/peachy-6017152.txt
corpora/ldcc/text/peachy/peachy-5983252.txt
corpora/ldcc/text/peachy/peachy-4569185.txt
corpora/ldcc/text/peachy/peachy-6886196.txt
corpora/ldcc/text/peachy/peachy-4722751.txt
corpora/ldcc/text/peachy/peachy-5869312.txt
corpora/ldcc/text/peachy/peachy-6006848.txt
corpora/ldcc/text/peachy/peachy-5391098.txt
corpora/ldcc/text/peachy/peachy-6136783.txt
corpora/ldcc/text/peachy/peachy-4555381.txt
corpora/ldcc/text/peachy/peachy-4986056.txt
corpora/ldcc/text/peachy/peachy-4737880.txt
corpora/ldcc/text/peachy/peachy-4505609.txt
corpora/ldcc/text/peachy/peachy-6075529.txt
corpora/ldcc/text/peachy/peachy-5157972.txt
corpora/ldcc/text/peachy/peachy-5084377.txt
corpora/ldcc/text/peachy/peachy-

corpora/ldcc/text/peachy/peachy-6093673.txt
corpora/ldcc/text/peachy/peachy-6662103.txt
corpora/ldcc/text/peachy/peachy-6317921.txt
corpora/ldcc/text/peachy/peachy-4497382.txt
corpora/ldcc/text/peachy/peachy-5404292.txt
corpora/ldcc/text/peachy/peachy-4498426.txt
corpora/ldcc/text/peachy/peachy-5200746.txt
corpora/ldcc/text/peachy/peachy-4554951.txt
corpora/ldcc/text/peachy/peachy-6481150.txt
corpora/ldcc/text/peachy/peachy-4721837.txt
corpora/ldcc/text/peachy/peachy-6907491.txt
corpora/ldcc/text/peachy/peachy-4917177.txt
corpora/ldcc/text/peachy/peachy-5897616.txt
corpora/ldcc/text/peachy/peachy-6104139.txt
corpora/ldcc/text/peachy/peachy-6395811.txt
corpora/ldcc/text/peachy/peachy-6617204.txt
corpora/ldcc/text/peachy/peachy-4926769.txt
corpora/ldcc/text/peachy/peachy-4844138.txt
corpora/ldcc/text/peachy/peachy-5785872.txt
corpora/ldcc/text/peachy/peachy-5056841.txt
corpora/ldcc/text/peachy/peachy-4567316.txt
corpora/ldcc/text/peachy/peachy-6353784.txt
corpora/ldcc/text/peachy/peachy-

corpora/ldcc/text/peachy/peachy-6018386.txt
corpora/ldcc/text/peachy/peachy-5183028.txt
corpora/ldcc/text/peachy/peachy-6584106.txt
corpora/ldcc/text/peachy/peachy-4506355.txt
corpora/ldcc/text/peachy/peachy-5164920.txt
corpora/ldcc/text/peachy/peachy-4804202.txt
corpora/ldcc/text/peachy/peachy-6436065.txt
corpora/ldcc/text/peachy/peachy-4997019.txt
corpora/ldcc/text/peachy/peachy-4508336.txt
corpora/ldcc/text/peachy/peachy-4549159.txt
corpora/ldcc/text/peachy/peachy-4791639.txt
corpora/ldcc/text/peachy/peachy-4576896.txt
corpora/ldcc/text/peachy/peachy-4443176.txt
corpora/ldcc/text/peachy/peachy-5140181.txt
corpora/ldcc/text/peachy/peachy-5384015.txt
corpora/ldcc/text/peachy/peachy-6640009.txt
corpora/ldcc/text/peachy/peachy-6439762.txt
corpora/ldcc/text/peachy/peachy-6680385.txt
corpora/ldcc/text/peachy/peachy-6184515.txt
corpora/ldcc/text/peachy/peachy-5981741.txt
corpora/ldcc/text/peachy/peachy-6331719.txt
corpora/ldcc/text/peachy/peachy-6496165.txt
corpora/ldcc/text/peachy/peachy-

corpora/ldcc/text/peachy/peachy-5071580.txt
corpora/ldcc/text/peachy/peachy-6480255.txt
corpora/ldcc/text/peachy/peachy-5342062.txt
corpora/ldcc/text/peachy/peachy-5317349.txt
corpora/ldcc/text/peachy/peachy-4779790.txt
corpora/ldcc/text/peachy/peachy-4760150.txt
corpora/ldcc/text/peachy/peachy-5104964.txt
corpora/ldcc/text/peachy/peachy-4645799.txt
corpora/ldcc/text/peachy/peachy-4994814.txt
corpora/ldcc/text/peachy/peachy-6136755.txt
corpora/ldcc/text/peachy/peachy-6372255.txt
corpora/ldcc/text/peachy/peachy-5962941.txt
corpora/ldcc/text/peachy/peachy-4575000.txt
corpora/ldcc/text/peachy/peachy-4560813.txt
corpora/ldcc/text/peachy/peachy-6821596.txt
corpora/ldcc/text/peachy/peachy-5858720.txt
corpora/ldcc/text/peachy/peachy-6821701.txt
corpora/ldcc/text/peachy/peachy-4634646.txt
corpora/ldcc/text/peachy/peachy-5277777.txt
corpora/ldcc/text/peachy/peachy-6732499.txt
corpora/ldcc/text/peachy/peachy-5160910.txt
corpora/ldcc/text/peachy/peachy-5164654.txt
corpora/ldcc/text/peachy/peachy-

corpora/ldcc/text/livedoor-homme/livedoor-homme-5288157.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6214759.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6076393.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5421723.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6033246.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-4690346.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5769353.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6068780.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5933079.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-4826452.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5850380.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5736692.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5736745.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5590182.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5492081.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5690853.txt
corpora/ldcc/text/livedoor-homme/livedoo

corpora/ldcc/text/livedoor-homme/livedoor-homme-5929933.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6076318.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5182557.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5289571.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5398130.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5769318.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6307462.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5034493.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5086175.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6033379.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5619552.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5825834.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5608639.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5842557.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5056598.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5736681.txt
corpora/ldcc/text/livedoor-homme/livedoo

corpora/ldcc/text/livedoor-homme/livedoor-homme-5475486.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5497695.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5695260.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5934016.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5625149.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-4846086.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-4778609.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5769425.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-4921988.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-4697938.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5023303.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5953988.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-4740345.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-5825828.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6047883.txt
corpora/ldcc/text/livedoor-homme/livedoor-homme-6261782.txt
corpora/ldcc/text/livedoor-homme/livedoo

corpora/ldcc/text/sports-watch/sports-watch-6274355.txt
corpora/ldcc/text/sports-watch/sports-watch-6311178.txt
corpora/ldcc/text/sports-watch/sports-watch-6122002.txt
corpora/ldcc/text/sports-watch/sports-watch-6583940.txt
corpora/ldcc/text/sports-watch/sports-watch-5165756.txt
corpora/ldcc/text/sports-watch/sports-watch-5577670.txt
corpora/ldcc/text/sports-watch/sports-watch-6647596.txt
corpora/ldcc/text/sports-watch/sports-watch-6208877.txt
corpora/ldcc/text/sports-watch/sports-watch-5681363.txt
corpora/ldcc/text/sports-watch/sports-watch-5923406.txt
corpora/ldcc/text/sports-watch/sports-watch-6363473.txt
corpora/ldcc/text/sports-watch/sports-watch-4866959.txt
corpora/ldcc/text/sports-watch/sports-watch-4703658.txt
corpora/ldcc/text/sports-watch/sports-watch-6216242.txt
corpora/ldcc/text/sports-watch/sports-watch-4970500.txt
corpora/ldcc/text/sports-watch/sports-watch-6248348.txt
corpora/ldcc/text/sports-watch/sports-watch-6463563.txt
corpora/ldcc/text/sports-watch/sports-watch-5306

corpora/ldcc/text/sports-watch/sports-watch-5535448.txt
corpora/ldcc/text/sports-watch/sports-watch-6022786.txt
corpora/ldcc/text/sports-watch/sports-watch-5669385.txt
corpora/ldcc/text/sports-watch/sports-watch-5910333.txt
corpora/ldcc/text/sports-watch/sports-watch-6726708.txt
corpora/ldcc/text/sports-watch/sports-watch-5387061.txt
corpora/ldcc/text/sports-watch/sports-watch-5272092.txt
corpora/ldcc/text/sports-watch/sports-watch-6134059.txt
corpora/ldcc/text/sports-watch/sports-watch-4753557.txt
corpora/ldcc/text/sports-watch/sports-watch-6796211.txt
corpora/ldcc/text/sports-watch/sports-watch-4749582.txt
corpora/ldcc/text/sports-watch/sports-watch-5710714.txt
corpora/ldcc/text/sports-watch/sports-watch-6217502.txt
corpora/ldcc/text/sports-watch/sports-watch-5855668.txt
corpora/ldcc/text/sports-watch/sports-watch-5316383.txt
corpora/ldcc/text/sports-watch/sports-watch-6615750.txt
corpora/ldcc/text/sports-watch/sports-watch-6871227.txt
corpora/ldcc/text/sports-watch/sports-watch-6828

corpora/ldcc/text/sports-watch/sports-watch-5872002.txt
corpora/ldcc/text/sports-watch/sports-watch-6607026.txt
corpora/ldcc/text/sports-watch/sports-watch-6324914.txt
corpora/ldcc/text/sports-watch/sports-watch-5462204.txt
corpora/ldcc/text/sports-watch/sports-watch-5578259.txt
corpora/ldcc/text/sports-watch/sports-watch-6294303.txt
corpora/ldcc/text/sports-watch/sports-watch-4759172.txt
corpora/ldcc/text/sports-watch/sports-watch-6233095.txt
corpora/ldcc/text/sports-watch/sports-watch-6622996.txt
corpora/ldcc/text/sports-watch/sports-watch-5871224.txt
corpora/ldcc/text/sports-watch/sports-watch-5732485.txt
corpora/ldcc/text/sports-watch/sports-watch-6288475.txt
corpora/ldcc/text/sports-watch/sports-watch-5745595.txt
corpora/ldcc/text/sports-watch/sports-watch-6298624.txt
corpora/ldcc/text/sports-watch/sports-watch-6399003.txt
corpora/ldcc/text/sports-watch/sports-watch-6370065.txt
corpora/ldcc/text/sports-watch/sports-watch-5890429.txt
corpora/ldcc/text/sports-watch/sports-watch-6769

corpora/ldcc/text/sports-watch/sports-watch-4659287.txt
corpora/ldcc/text/sports-watch/sports-watch-5207160.txt
corpora/ldcc/text/sports-watch/sports-watch-6162603.txt
corpora/ldcc/text/sports-watch/sports-watch-4865730.txt
corpora/ldcc/text/sports-watch/sports-watch-5270759.txt
corpora/ldcc/text/sports-watch/sports-watch-4994682.txt
corpora/ldcc/text/sports-watch/sports-watch-6401705.txt
corpora/ldcc/text/sports-watch/sports-watch-6150974.txt
corpora/ldcc/text/sports-watch/sports-watch-5654955.txt
corpora/ldcc/text/sports-watch/sports-watch-5482074.txt
corpora/ldcc/text/sports-watch/sports-watch-5496791.txt
corpora/ldcc/text/sports-watch/sports-watch-4736427.txt
corpora/ldcc/text/sports-watch/sports-watch-5137919.txt
corpora/ldcc/text/sports-watch/sports-watch-6427340.txt
corpora/ldcc/text/sports-watch/sports-watch-6636520.txt
corpora/ldcc/text/sports-watch/sports-watch-5525752.txt
corpora/ldcc/text/sports-watch/sports-watch-5048627.txt
corpora/ldcc/text/sports-watch/sports-watch-6116

corpora/ldcc/text/topic-news/topic-news-6301890.txt
corpora/ldcc/text/topic-news/topic-news-6111664.txt
corpora/ldcc/text/topic-news/topic-news-6316919.txt
corpora/ldcc/text/topic-news/topic-news-6167947.txt
corpora/ldcc/text/topic-news/topic-news-6584350.txt
corpora/ldcc/text/topic-news/topic-news-5952476.txt
corpora/ldcc/text/topic-news/topic-news-6353761.txt
corpora/ldcc/text/topic-news/topic-news-6578941.txt
corpora/ldcc/text/topic-news/topic-news-5985275.txt
corpora/ldcc/text/topic-news/topic-news-5944310.txt
corpora/ldcc/text/topic-news/topic-news-6591348.txt
corpora/ldcc/text/topic-news/topic-news-5942965.txt
corpora/ldcc/text/topic-news/topic-news-6792801.txt
corpora/ldcc/text/topic-news/topic-news-6096688.txt
corpora/ldcc/text/topic-news/topic-news-6374961.txt
corpora/ldcc/text/topic-news/topic-news-6431446.txt
corpora/ldcc/text/topic-news/topic-news-6670196.txt
corpora/ldcc/text/topic-news/topic-news-6776809.txt
corpora/ldcc/text/topic-news/topic-news-6662611.txt
corpora/ldcc

corpora/ldcc/text/topic-news/topic-news-5936557.txt
corpora/ldcc/text/topic-news/topic-news-6105006.txt
corpora/ldcc/text/topic-news/topic-news-6201015.txt
corpora/ldcc/text/topic-news/topic-news-6873254.txt
corpora/ldcc/text/topic-news/topic-news-5968874.txt
corpora/ldcc/text/topic-news/topic-news-6069763.txt
corpora/ldcc/text/topic-news/topic-news-6740348.txt
corpora/ldcc/text/topic-news/topic-news-6165732.txt
corpora/ldcc/text/topic-news/topic-news-5903225.txt
corpora/ldcc/text/topic-news/topic-news-6903121.txt
corpora/ldcc/text/topic-news/topic-news-6855176.txt
corpora/ldcc/text/topic-news/topic-news-5933360.txt
corpora/ldcc/text/topic-news/topic-news-5931334.txt
corpora/ldcc/text/topic-news/topic-news-6364292.txt
corpora/ldcc/text/topic-news/topic-news-6257809.txt
corpora/ldcc/text/topic-news/topic-news-5974081.txt
corpora/ldcc/text/topic-news/topic-news-6014599.txt
corpora/ldcc/text/topic-news/topic-news-5927968.txt
corpora/ldcc/text/topic-news/topic-news-6103737.txt
corpora/ldcc

corpora/ldcc/text/topic-news/topic-news-5949110.txt
corpora/ldcc/text/topic-news/topic-news-6520582.txt
corpora/ldcc/text/topic-news/topic-news-6544734.txt
corpora/ldcc/text/topic-news/topic-news-6694210.txt
corpora/ldcc/text/topic-news/topic-news-6047734.txt
corpora/ldcc/text/topic-news/topic-news-5992147.txt
corpora/ldcc/text/topic-news/topic-news-6382493.txt
corpora/ldcc/text/topic-news/topic-news-6586368.txt
corpora/ldcc/text/topic-news/topic-news-6169165.txt
corpora/ldcc/text/topic-news/topic-news-6084672.txt
corpora/ldcc/text/topic-news/topic-news-5927215.txt
corpora/ldcc/text/topic-news/topic-news-6518334.txt
corpora/ldcc/text/topic-news/topic-news-6338538.txt
corpora/ldcc/text/topic-news/topic-news-6069918.txt
corpora/ldcc/text/topic-news/topic-news-5996126.txt
corpora/ldcc/text/topic-news/topic-news-6041233.txt
corpora/ldcc/text/topic-news/topic-news-6381982.txt
corpora/ldcc/text/topic-news/topic-news-5972429.txt
corpora/ldcc/text/topic-news/topic-news-6559973.txt
corpora/ldcc

corpora/ldcc/text/smax/smax-6512377.txt
corpora/ldcc/text/smax/smax-6761618.txt
corpora/ldcc/text/smax/smax-6774533.txt
corpora/ldcc/text/smax/smax-6817865.txt
corpora/ldcc/text/smax/smax-6701563.txt
corpora/ldcc/text/smax/smax-6537923.txt
corpora/ldcc/text/smax/smax-6815664.txt
corpora/ldcc/text/smax/smax-6913645.txt
corpora/ldcc/text/smax/smax-6648738.txt
corpora/ldcc/text/smax/smax-6631150.txt
corpora/ldcc/text/smax/smax-6792463.txt
corpora/ldcc/text/smax/smax-6572740.txt
corpora/ldcc/text/smax/smax-6569026.txt
corpora/ldcc/text/smax/smax-6750093.txt
corpora/ldcc/text/smax/smax-6563515.txt
corpora/ldcc/text/smax/smax-6543512.txt
corpora/ldcc/text/smax/smax-6842269.txt
corpora/ldcc/text/smax/smax-6511548.txt
corpora/ldcc/text/smax/smax-6856578.txt
corpora/ldcc/text/smax/smax-6862961.txt
corpora/ldcc/text/smax/smax-6702477.txt
corpora/ldcc/text/smax/smax-6545687.txt
corpora/ldcc/text/smax/smax-6594409.txt
corpora/ldcc/text/smax/smax-6530489.txt
corpora/ldcc/text/smax/smax-6782470.txt


corpora/ldcc/text/smax/smax-6552402.txt
corpora/ldcc/text/smax/smax-6520016.txt
corpora/ldcc/text/smax/smax-6737569.txt
corpora/ldcc/text/smax/smax-6599746.txt
corpora/ldcc/text/smax/smax-6684339.txt
corpora/ldcc/text/smax/smax-6780539.txt
corpora/ldcc/text/smax/smax-6726081.txt
corpora/ldcc/text/smax/smax-6549551.txt
corpora/ldcc/text/smax/smax-6736017.txt
corpora/ldcc/text/smax/smax-6838218.txt
corpora/ldcc/text/smax/smax-6762863.txt
corpora/ldcc/text/smax/smax-6575124.txt
corpora/ldcc/text/smax/smax-6783204.txt
corpora/ldcc/text/smax/smax-6538588.txt
corpora/ldcc/text/smax/smax-6872861.txt
corpora/ldcc/text/smax/smax-6732193.txt
corpora/ldcc/text/smax/smax-6869011.txt
corpora/ldcc/text/smax/smax-6577502.txt
corpora/ldcc/text/smax/smax-6574916.txt
corpora/ldcc/text/smax/smax-6514482.txt
corpora/ldcc/text/smax/smax-6781533.txt
corpora/ldcc/text/smax/smax-6620243.txt
corpora/ldcc/text/smax/smax-6671169.txt
corpora/ldcc/text/smax/smax-6834562.txt
corpora/ldcc/text/smax/smax-6907687.txt


corpora/ldcc/text/smax/smax-6872210.txt
corpora/ldcc/text/smax/smax-6623564.txt
corpora/ldcc/text/smax/smax-6578513.txt
corpora/ldcc/text/smax/smax-6900354.txt
corpora/ldcc/text/smax/smax-6682299.txt
corpora/ldcc/text/smax/smax-6819141.txt
corpora/ldcc/text/smax/smax-6701039.txt
corpora/ldcc/text/smax/smax-6644378.txt
corpora/ldcc/text/smax/smax-6719283.txt
corpora/ldcc/text/smax/smax-6671898.txt
corpora/ldcc/text/smax/smax-6512217.txt
corpora/ldcc/text/smax/smax-6719108.txt
corpora/ldcc/text/smax/smax-6730319.txt
corpora/ldcc/text/smax/smax-6667535.txt
corpora/ldcc/text/smax/smax-6753126.txt
corpora/ldcc/text/smax/smax-6826514.txt
corpora/ldcc/text/smax/smax-6741096.txt
corpora/ldcc/text/smax/smax-6807630.txt
corpora/ldcc/text/smax/smax-6794878.txt
corpora/ldcc/text/smax/smax-6895468.txt
corpora/ldcc/text/smax/smax-6603882.txt
corpora/ldcc/text/smax/smax-6641862.txt
corpora/ldcc/text/smax/smax-6894879.txt
corpora/ldcc/text/smax/smax-6742773.txt
corpora/ldcc/text/smax/smax-6799152.txt


corpora/ldcc/text/smax/smax-6559533.txt
corpora/ldcc/text/smax/smax-6605183.txt
corpora/ldcc/text/smax/smax-6770542.txt
corpora/ldcc/text/smax/smax-6526832.txt
corpora/ldcc/text/smax/smax-6680196.txt
corpora/ldcc/text/smax/smax-6858955.txt
corpora/ldcc/text/smax/smax-6789801.txt
corpora/ldcc/text/smax/smax-6736756.txt
corpora/ldcc/text/smax/smax-6793152.txt
corpora/ldcc/text/smax/smax-6817640.txt
corpora/ldcc/text/smax/smax-6615051.txt
corpora/ldcc/text/smax/smax-6635291.txt
corpora/ldcc/text/smax/smax-6695147.txt
corpora/ldcc/text/smax/smax-6703170.txt
corpora/ldcc/text/smax/smax-6604645.txt
corpora/ldcc/text/smax/smax-6645676.txt
corpora/ldcc/text/smax/smax-6669873.txt
corpora/ldcc/text/smax/smax-6880417.txt
corpora/ldcc/text/smax/smax-6802794.txt
corpora/ldcc/text/smax/smax-6709742.txt
corpora/ldcc/text/smax/smax-6544357.txt
corpora/ldcc/text/smax/smax-6551994.txt
corpora/ldcc/text/smax/smax-6890633.txt
corpora/ldcc/text/smax/smax-6697784.txt
corpora/ldcc/text/smax/smax-6884447.txt


corpora/ldcc/text/kaden-channel/kaden-channel-6907482.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5792969.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6385485.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5961830.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6040069.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6065971.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5795135.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5979337.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5976932.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6180996.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5999828.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6028153.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6786021.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5825177.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6021242.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6108505.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6226557.txt
corpora/ldcc/t

corpora/ldcc/text/kaden-channel/kaden-channel-5845631.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5860708.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6086270.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6880433.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6578529.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6143631.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6349738.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6126542.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6291490.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6195750.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6011396.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5812802.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6180432.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5774093.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5965300.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6712758.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6250348.txt
corpora/ldcc/t

corpora/ldcc/text/kaden-channel/kaden-channel-6163877.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5971869.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6338063.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5913423.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6400674.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6146116.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6635481.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6139731.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5916012.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6110982.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6342265.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6914536.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6123560.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6872681.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6737960.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6069650.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5827057.txt
corpora/ldcc/t

corpora/ldcc/text/kaden-channel/kaden-channel-5775172.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6295084.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6025656.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6912640.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6011019.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6817385.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6178568.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5777269.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6217967.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5962352.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6265537.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6736034.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5930544.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6457205.txt
corpora/ldcc/text/kaden-channel/kaden-channel-5778048.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6853386.txt
corpora/ldcc/text/kaden-channel/kaden-channel-6391796.txt
corpora/ldcc/t

corpora/ldcc/text/it-life-hack/it-life-hack-6718721.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6844464.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6409860.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6487777.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6766700.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6720731.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6314499.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6651014.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6371966.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6503460.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6732196.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6686840.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6505621.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6380347.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6586747.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6686479.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6532010.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6318

corpora/ldcc/text/it-life-hack/it-life-hack-6359629.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6486243.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6303524.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6783296.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6579691.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6753266.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6471517.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6695154.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6302814.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6836455.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6737709.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6390865.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6429762.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6551783.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6350606.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6740487.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6388496.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6721

corpora/ldcc/text/it-life-hack/it-life-hack-6306744.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6789805.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6365378.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6526318.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6646390.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6771373.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6906431.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6375726.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6623842.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6763072.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6542449.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6326089.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6902676.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6809216.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6757202.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6896032.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6470626.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6802

corpora/ldcc/text/it-life-hack/it-life-hack-6691049.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6660114.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6331762.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6630489.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6570980.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6895229.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6750406.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6882227.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6838706.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6683639.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6786542.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6723983.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6815655.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6682577.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6527513.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6369088.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6379949.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6471

corpora/ldcc/text/it-life-hack/it-life-hack-6443684.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6839558.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6600578.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6473542.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6854373.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6389515.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6728289.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6656557.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6459379.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6917568.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6462896.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6494106.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6572234.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6330313.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6701234.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6703751.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6712770.txt
corpora/ldcc/text/it-life-hack/it-life-hack-6391

## 訓練用と学習用のデータを分ける

訓練用のデータセットと学習用のデータ・セットを9:1で分ける

In [3]:
# rand_split.py
import numpy as np
import os
from os.path import join

def append_newline(filename, line):
    with open(filename, 'a', encoding="utf-8") as file:
        file.write(line)

def split_random(filename):
    dirname = os.path.dirname(filename)
    comps = os.path.basename(filename).split('.')
    train_name = '{}_train.{}'.format(comps[0], comps[1])
    test_name = '{}_test.{}'.format(comps[0], comps[1])
    for l in open(filename, 'r', encoding="utf-8"):
        choice = np.random.choice([True, False], p=[0.9, 0.1])
        if choice: # write train
            append_newline(join(dirname, train_name), l)
        else: # write test
            append_newline(join(dirname, test_name), l)

In [4]:
# 実行：ldcc.textをldcc_train.txtとldcc_test.txtに分ける
!rm -f data/ldcc_{train,test}.txt
split_random('data/ldcc.txt')

## fastText学習

準備が整ったので、fastTextに学習を開始させる。

In [None]:
%env PATH=/workspace/bin:$PATH

env: PATH=/workspace/bin:$PATH


In [None]:
!fasttext supervised -input data/ldcc_train.txt -output data/ldcc_fasttext_supervised \
-dim 10 -lr 0.1 -wordNgrams 2 -minCount 1 -bucket 10000000 -epoch 100 -thread 4

Read 26M words
Number of words:  75927
Number of labels: 9


Progress: 1.7%  words/sec/thread: 2560510  lr: 0.098344  loss: 0.986300  eta: 0h4m 3h-14m 0.0%  words/sec/thread: 2098371  lr: 0.099999  loss: 1.967087  eta: 0h5m 0.0%  words/sec/thread: 2650454  lr: 0.099990  loss: 2.171220  eta: 0h4m 0.0%  words/sec/thread: 2691005  lr: 0.099981  loss: 2.182055  eta: 0h4m 0.0%  words/sec/thread: 2682066  lr: 0.099971  loss: 2.237250  eta: 0h4m 0.0%  words/sec/thread: 2638147  lr: 0.099954  loss: 1.829342  eta: 0h4m 0.1%  words/sec/thread: 2630007  lr: 0.099935  loss: 1.599930  eta: 0h4m 0.1%  words/sec/thread: 2635906  lr: 0.099926  loss: 1.599252  eta: 0h4m 0.1%  words/sec/thread: 2647998  lr: 0.099916  loss: 1.553531  eta: 0h4m 0.1%  words/sec/thread: 2656693  lr: 0.099906  loss: 1.504884  eta: 0h4m 0.1%  words/sec/thread: 2681549  lr: 0.099858  loss: 1.304389  eta: 0h4m 0.2%  words/sec/thread: 2689459  lr: 0.099840  loss: 1.290025  eta: 0h4m 0.2%  words/sec/thread: 2692734  lr: 0.099829  loss: 1.268950  eta: 0h4m 0.2%  words/sec/thread: 2688063  l

Progress: 3.1%  words/sec/thread: 2556699  lr: 0.096896  loss: 0.733835  eta: 0h4m 1.7%  words/sec/thread: 2560038  lr: 0.098320  loss: 0.977377  eta: 0h4m 1.7%  words/sec/thread: 2560344  lr: 0.098310  loss: 0.974212  eta: 0h4m 1.7%  words/sec/thread: 2559974  lr: 0.098294  loss: 0.967157  eta: 0h4m 1.7%  words/sec/thread: 2559652  lr: 0.098277  loss: 0.964316  eta: 0h4m %  words/sec/thread: 2559717  lr: 0.098269  loss: 0.963560  eta: 0h4m 1.7%  words/sec/thread: 2559634  lr: 0.098262  loss: 0.963370  eta: 0h4m 1.7%  words/sec/thread: 2559185  lr: 0.098253  loss: 0.962757  eta: 0h4m 1.8%  words/sec/thread: 2558656  lr: 0.098245  loss: 0.961822  eta: 0h4m 1.8%  words/sec/thread: 2558433  lr: 0.098220  loss: 0.957957  eta: 0h4m   lr: 0.098192  loss: 0.952150  eta: 0h4m 4m 1.8%  words/sec/thread: 2559060  lr: 0.098167  loss: 0.945765  eta: 0h4m 1.8%  words/sec/thread: 2559426  lr: 0.098158  loss: 0.947216  eta: 0h4m 1.9%  words/sec/thread: 2559012  lr: 0.098133  loss: 0.945363  eta: 0h4m

Progress: 4.4%  words/sec/thread: 2558552  lr: 0.095627  loss: 0.577596  eta: 0h4m 3.1%  words/sec/thread: 2555537  lr: 0.096865  loss: 0.722839  eta: 0h4m 2555695  lr: 0.096857  loss: 0.720864  eta: 0h4m 3.2%  words/sec/thread: 2555794  lr: 0.096848  loss: 0.718358  eta: 0h4m 3.2%  words/sec/thread: 2555166  lr: 0.096823  loss: 0.712956  eta: 0h4m 3.2%  words/sec/thread: 2555137  lr: 0.096815  loss: 0.711708  eta: 0h4m 3.2%  words/sec/thread: 2555083  lr: 0.096799  loss: 0.708968  eta: 0h4m 3.2%  words/sec/thread: 2554961  lr: 0.096791  loss: 0.707756  eta: 0h4m 3.2%  words/sec/thread: 2554598  lr: 0.096782  loss: 0.706781  eta: 0h4m 3.2%  words/sec/thread: 2554520  lr: 0.096774  loss: 0.706008  eta: 0h4m 3.2%  words/sec/thread: 2554392  lr: 0.096765  loss: 0.704592  eta: 0h4m 3.3%  words/sec/thread: 2554212  lr: 0.096749  loss: 0.702338  eta: 0h4m 3.3%  words/sec/thread: 2553857  lr: 0.096732  loss: 0.699733  eta: 0h4m 3.3%  words/sec/thread: 2554056  lr: 0.096724  loss: 0.698190  et

Progress: 5.6%  words/sec/thread: 2553471  lr: 0.094420  loss: 0.483851  eta: 0h4m 4.4%  words/sec/thread: 2558697  lr: 0.095609  loss: 0.576253  eta: 0h4m %  words/sec/thread: 2558907  lr: 0.095601  loss: 0.575767  eta: 0h4m 4.4%  words/sec/thread: 2559123  lr: 0.095583  loss: 0.574458  eta: 0h4m 4.4%  words/sec/thread: 2559184  lr: 0.095573  loss: 0.573722  eta: 0h4m 4.4%  words/sec/thread: 2559187  lr: 0.095563  loss: 0.573016  eta: 0h4m 4.4%  words/sec/thread: 2559147  lr: 0.095555  loss: 0.572485  eta: 0h4m 4.5%  words/sec/thread: 2559080  lr: 0.095547  loss: 0.571742  eta: 0h4m 4.5%  words/sec/thread: 2559270  lr: 0.095538  loss: 0.571003  eta: 0h4m   lr: 0.095529  loss: 0.570310  eta: 0h4m 4.5%  words/sec/thread: 2559178  lr: 0.095520  loss: 0.569670  eta: 0h4m 0h4m 4.5%  words/sec/thread: 2559067  lr: 0.095504  loss: 0.568492  eta: 0h4m 4.5%  words/sec/thread: 2558921  lr: 0.095495  loss: 0.568006  eta: 0h4m 4.5%  words/sec/thread: 2559015  lr: 0.095488  loss: 0.567589  eta: 0h

Progress: 6.8%  words/sec/thread: 2544640  lr: 0.093155  loss: 0.406061  eta: 0h4m 5.6%  words/sec/thread: 2553455  lr: 0.094402  loss: 0.483588  eta: 0h4m 5.6%  words/sec/thread: 2553477  lr: 0.094393  loss: 0.483414  eta: 0h4m 5.6%  words/sec/thread: 2553363  lr: 0.094376  loss: 0.483159  eta: 0h4m 5.6%  words/sec/thread: 2553311  lr: 0.094367  loss: 0.483000  eta: 0h4m 0h4m 5.6%  words/sec/thread: 2552839  lr: 0.094350  loss: 0.480909  eta: 0h4m 5.7%  words/sec/thread: 2552510  lr: 0.094342  loss: 0.480031  eta: 0h4m 5.7%  words/sec/thread: 2552308  lr: 0.094333  loss: 0.479198  eta: 0h4m 5.7%  words/sec/thread: 2551930  lr: 0.094323  loss: 0.478329  eta: 0h4m 5.7%  words/sec/thread: 2551486  lr: 0.094315  loss: 0.477243  eta: 0h4m 5.7%  words/sec/thread: 2551223  lr: 0.094308  loss: 0.476689  eta: 0h4m 5.7%  words/sec/thread: 2551062  lr: 0.094298  loss: 0.475829  eta: 0h4m 5.7%  words/sec/thread: 2550945  lr: 0.094288  loss: 0.474924  eta: 0h4m 5.7%  words/sec/thread: 2550746  lr:

Progress: 8.1%  words/sec/thread: 2538004  lr: 0.091927  loss: 0.358738  eta: 0h4m 6.9%  words/sec/thread: 2544470  lr: 0.093137  loss: 0.405079  eta: 0h4m 6.9%  words/sec/thread: 2544438  lr: 0.093129  loss: 0.404580  eta: 0h4m 6.9%  words/sec/thread: 2544430  lr: 0.093111  loss: 0.403588  eta: 0h4m 6.9%  words/sec/thread: 2544387  lr: 0.093102  loss: 0.403189  eta: 0h4m 6.9%  words/sec/thread: 2544476  lr: 0.093094  loss: 0.402886  eta: 0h4m 6.9%  words/sec/thread: 2544530  lr: 0.093075  loss: 0.402489  eta: 0h4m 6.9%  words/sec/thread: 2544583  lr: 0.093067  loss: 0.402315  eta: 0h4m 6.9%  words/sec/thread: 2544535  lr: 0.093059  loss: 0.402041  eta: 0h4m 7.0%  words/sec/thread: 2544410  lr: 0.093050  loss: 0.401913  eta: 0h4m 7.0%  words/sec/thread: 2544346  lr: 0.093040  loss: 0.401661  eta: 0h4m 7.0%  words/sec/thread: 2544447  lr: 0.093032  loss: 0.401488  eta: 0h4m 7.0%  words/sec/thread: 2544459  lr: 0.093022  loss: 0.401260  eta: 0h4m 7.0%  words/sec/thread: 2544210  lr: 0.09

Progress: 10.0%  words/sec/thread: 2539052  lr: 0.090009  loss: 0.295168  eta: 0h3m .1%  words/sec/thread: 2537860  lr: 0.091905  loss: 0.358022  eta: 0h4m 8.1%  words/sec/thread: 2537670  lr: 0.091897  loss: 0.358164  eta: 0h4m 8.1%  words/sec/thread: 2537615  lr: 0.091887  loss: 0.358408  eta: 0h4m 8.1%  words/sec/thread: 2537444  lr: 0.091878  loss: 0.358240  eta: 0h4m 8.1%  words/sec/thread: 2537443  lr: 0.091869  loss: 0.358036  eta: 0h4m 8.1%  words/sec/thread: 2537467  lr: 0.091860  loss: 0.357776  eta: 0h4m 8.2%  words/sec/thread: 2537374  lr: 0.091849  loss: 0.357444  eta: 0h4m 8.2%  words/sec/thread: 2537338  lr: 0.091840  loss: 0.357177  eta: 0h4m 8.2%  words/sec/thread: 2536790  lr: 0.091812  loss: 0.356316  eta: 0h4m %  words/sec/thread: 2536666  lr: 0.091803  loss: 0.355752  eta: 0h4m .355228  eta: 0h4m 8.2%  words/sec/thread: 2536468  lr: 0.091784  loss: 0.354605  eta: 0h4m 8.2%  words/sec/thread: 2536351  lr: 0.091777  loss: 0.353957  eta: 0h4m 8.2%  words/sec/thread: 2

Progress: 11.3%  words/sec/thread: 2540170  lr: 0.088676  loss: 0.264873  eta: 0h3m d: 2539138  lr: 0.089992  loss: 0.294724  eta: 0h3m 10.0%  words/sec/thread: 2539176  lr: 0.089983  loss: 0.294387  eta: 0h3m 10.0%  words/sec/thread: 2539237  lr: 0.089974  loss: 0.294057  eta: 0h3m 10.0%  words/sec/thread: 2539286  lr: 0.089962  loss: 0.293724  eta: 0h3m 10.0%  words/sec/thread: 2539266  lr: 0.089953  loss: 0.293426  eta: 0h3m m 10.1%  words/sec/thread: 2539070  lr: 0.089929  loss: 0.292477  eta: 0h3m 10.1%  words/sec/thread: 2538968  lr: 0.089920  loss: 0.292171  eta: 0h3m 10.1%  words/sec/thread: 2538888  lr: 0.089912  loss: 0.291986  eta: 0h3m lr: 0.089896  loss: 0.291449  eta: 0h3m 10.1%  words/sec/thread: 2538702  lr: 0.089884  loss: 0.291088  eta: 0h3m 10.1%  words/sec/thread: 2538613  lr: 0.089877  loss: 0.290820  eta: 0h3m 10.1%  words/sec/thread: 2538564  lr: 0.089866  loss: 0.290499  eta: 0h3m 10.1%  words/sec/thread: 2538472  lr: 0.089856  loss: 0.290217  eta: 0h3m 10.2%  w

Progress: 12.8%  words/sec/thread: 2541078  lr: 0.087213  loss: 0.235050  eta: 0h3m 11.3%  words/sec/thread: 2540370  lr: 0.088657  loss: 0.264604  eta: 0h3m 11.4%  words/sec/thread: 2540447  lr: 0.088648  loss: 0.264471  eta: 0h3m 11.4%  words/sec/thread: 2540463  lr: 0.088637  loss: 0.264332  eta: 0h3m   eta: 0h3m   words/sec/thread: 2540583  lr: 0.088619  loss: 0.264104  eta: 0h3m   words/sec/thread: 2540662  lr: 0.088608  loss: 0.263787  eta: 0h3m ead: 2540649  lr: 0.088600  loss: 0.263425  eta: 0h3m r: 0.088591  loss: 0.263045  eta: 0h3m ss: 0.262648  eta: 0h3m 11.4%  words/sec/thread: 2540801  lr: 0.088567  loss: 0.262279  eta: 0h3m 2540843  lr: 0.088557  loss: 0.261831  eta: 0h3m 11.5%  words/sec/thread: 2540812  lr: 0.088549  loss: 0.261413  eta: 0h3m 11.5%  words/sec/thread: 2540902  lr: 0.088541  loss: 0.261166  eta: 0h3m 11.5%  words/sec/thread: 2540787  lr: 0.088531  loss: 0.260864  eta: 0h3m 11.5%  words/sec/thread: 2540754  lr: 0.088520  loss: 0.260561  eta: 0h3m 1  loss:

Progress: 14.4%  words/sec/thread: 2545350  lr: 0.085635  loss: 0.211679  eta: 0h3m 12.8%  words/sec/thread: 2541086  lr: 0.087187  loss: 0.234730  eta: 0h3m 12.8%  words/sec/thread: 2541064  lr: 0.087179  loss: 0.234617  eta: 0h3m   eta: 0h3m 12.8%  words/sec/thread: 2541154  lr: 0.087155  loss: 0.234330  eta: 0h3m 12.9%  words/sec/thread: 2541235  lr: 0.087135  loss: 0.233989  eta: 0h3m 12.9%  words/sec/thread: 2541304  lr: 0.087125  loss: 0.233910  eta: 0h3m 12.9%  words/sec/thread: 2541465  lr: 0.087098  loss: 0.233623  eta: 0h3m 12.9%  words/sec/thread: 2541461  lr: 0.087089  loss: 0.233487  eta: 0h3m 12.9%  words/sec/thread: 2541518  lr: 0.087079  loss: 0.233408  eta: 0h3m 12.9%  words/sec/thread: 2541585  lr: 0.087070  loss: 0.233328  eta: 0h3m h3m 12.9%  words/sec/thread: 2541651  lr: 0.087051  loss: 0.232922  eta: 0h3m c/thread: 2541696  lr: 0.087042  loss: 0.232624  eta: 0h3m h3m 13.0%  words/sec/thread: 2541496  lr: 0.087027  loss: 0.232064  eta: 0h3m .231812  eta: 0h3m   et

Progress: 15.0%  words/sec/thread: 2545429  lr: 0.085036  loss: 0.203478  eta: 0h3m 14.4%  words/sec/thread: 2545462  lr: 0.085605  loss: 0.211255  eta: 0h3m 14.4%  words/sec/thread: 2545511  lr: 0.085596  loss: 0.211177  eta: 0h3m 14.4%  words/sec/thread: 2545575  lr: 0.085586  loss: 0.211094  eta: 0h3m 14.4%  words/sec/thread: 2545614  lr: 0.085576  loss: 0.211018  eta: 0h3m 14.4%  words/sec/thread: 2545588  lr: 0.085567  loss: 0.210902  eta: 0h3m   lr: 0.085558  loss: 0.210756  eta: 0h3m 14.5%  words/sec/thread: 2545600  lr: 0.085548  loss: 0.210608  eta: 0h3m 14.5%  words/sec/thread: 2545645  lr: 0.085540  loss: 0.210515  eta: 0h3m %  words/sec/thread: 2545762  lr: 0.085530  loss: 0.210415  eta: 0h3m 14.5%  words/sec/thread: 2545862  lr: 0.085518  loss: 0.210281  eta: 0h3m 14.5%  words/sec/thread: 2545874  lr: 0.085505  loss: 0.210143  eta: 0h3m 14.5%  words/sec/thread: 2545898  lr: 0.085493  loss: 0.210017  eta: 0h3m 14.5%  words/sec/thread: 2545894  lr: 0.085483  loss: 0.209911  

## fastTextテスト

学習成果をテスト。

In [None]:
!fasttext test data/ldcc_fasttext_supervised.bin data/ldcc_test.txt

99.2%の精度で分類することができた驚異的な（？）結果。