# Fetch wiki corpus
- [NLP 學習筆記](https://hackmd.io/Gw4TgIwEwhaAGYAzOAWKBWYtIYMyxbICGeUATMAKbFRA)
- [以 gensim 訓練中文詞向量](http://zake7749.github.io/2016/08/28/word2vec-with-gensim/)
- [tfidf 關鍵字擷取](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

In [2]:
import time
import re
import os

import jieba
jieba.set_dictionary('datas/dict/dict.txt.big')
jieba.load_userdict('datas/dict/edu_dict.txt')

Building prefix dict from /home/sunset/word_contest/datas/dict/dict.txt.big ...
Loading model from cache /tmp/jieba.u849ecfdca27003d306f39ca004b82b5b.cache
Loading model cost 1.181 seconds.
Prefix dict has been built succesfully.


### Extract datas from wiki
- [維基百科:資料庫下載](https://zh.wikipedia.org/wiki/Wikipedia:%E6%95%B0%E6%8D%AE%E5%BA%93%E4%B8%8B%E8%BD%BD)
    - 文章含有部分 style tag 沒被移除
    - 含有不相關的字詞在文章中
    - 中英繁簡夾雜
    - 看起來不像一般的文章 (有些斷句的地方怪怪的)
- 下 `python3 WikiExtractor.py` 將 `datas/raw/zhwiki-20170801-pages-articles.xml.bz2` 轉成 `datas/wiki-texts.txt`
- 下 `opencc -i datas/wiki-texts.txt -o datas/wiki-texts-zht.txt` 將簡體全部轉繁
    - _hanziconv_ 轉的超爛 (e.g. 數學家 => 數學傢 / 一方面 => 一方麵)

### Tokenize
- [如何使用 JIEBA 結巴中文分詞程式](http://blog.fukuball.com/ru-he-shi-yong-jieba-jie-ba-zhong-wen-fen-ci-cheng-shi/)
- [中文 stop words](https://github.com/stopwords-iso/stopwords-zh)
- [jieba-zh_TW](https://github.com/ldkrsi/jieba-zh_TW.git)
- 之後用 word2vec 做 word embedding 是 fixed window size，故儘量用最少的字數達到意思
- stop words 詞庫 (有些奇怪不應該的字被加到停用字，要人工移除 e.g. 社会主义)
    - 留意，Token 中的這些字將被移除
    - **TODO** 更新 + 人工檢查 stop words 詞庫
    - https://github.com/stopwords-iso/stopwords-zh/blob/master/raw/baidu.txt
    - https://github.com/stopwords-iso/stopwords-zh/tree/master/raw
    - https://github.com/fxsjy/jieba/tree/master/extra_dict
    - https://gist.github.com/dreampuf/5548203
- 中文字典詞庫
    - ✓ https://github.com/fxsjy/jieba/tree/master/extra_dict
    - ✓ http://resources.publicense.moe.edu.tw/
    - https://raw.githubusercontent.com/NLPchina/ansj_seg/master/src/main/resources/bigramdict.dic
    - https://g0v.hackpad.tw/3du.tw-ZNwaun62BP4

In [5]:
# Show one "sentence" for preview
with open('datas/wiki-texts-zht.txt', 'r') as fi:
    not_chinese_word = u'[^\u4e00-\u9fff]'  # only keep Chinese word, remove 
    not_chinese_word_rule = re.compile(not_chinese_word)
    for line in fi:
        line = not_chinese_word_rule.sub(' ', line)
        print([word for word in jieba.cut(line) if word != ' '])
        break

['歐幾', '裏', '得', '西元前', '三', '世紀', '的', '希臘', '數學家', '現在', '被', '認為', '是', '幾何', '之', '父', '此畫', '為', '拉斐爾', '的', '作品', '雅典', '學院', '數學', '是', '利用', '符號語言', '研究', '數量', '結構', '變化', '以及', '空間', '等概', '唸', '的', '一門', '學科', '從', '某種', '角度看', '屬於', '形式', '科學', '的', '一種', '數學', '透過', '抽象化', '和', '邏輯推理', '的', '使用', '由', '計數', '計算', '數學家', '們', '拓展', '這些', '概念', '對', '數學', '基本', '概', '唸', '的', '完善', '早', '在', '古埃及', '而', '在', '古希臘', '那裡', '有', '更', '為', '嚴謹', '的', '處理', '從', '那時', '開始', '數學', '的', '發展', '便', '持續', '不斷', '地', '小幅', '進展', '世紀', '的', '文藝復興', '時期', '致使', '數學', '的', '加速', '發展', '直至', '今日', '今日', '數學', '使用', '在', '不同', '的', '領域', '中', '包括', '科學', '工程', '醫學', '和', '經濟學', '等', '有時', '亦', '會', '激起', '新', '的', '數學', '發現', '並', '導致', '全新', '學科', '的', '發展', '數學家', '也', '研究', '純數學', '就是', '數學', '本身', '的', '實質性', '內容', '而', '不以', '任何', '實際', '應用', '為', '目標', '雖然', '許多', '研究', '以', '純數學', '開始', '但', '其', '過程', '中', '也', '發現', '許多', '應用', '之', '處', '詞源', '西方', '語言', '中', '數學', '一', '詞源', '自'

In [None]:
with open('datas/wiki-texts-zht.txt', 'r') as fi, open('datas/wiki-seg.txt', 'w') as fo:
    not_chinese_word = u'[^\u4e00-\u9fff]'  # only keep Chinese word, remove 
    not_chinese_word_rule = re.compile(not_chinese_word)
    start_time = time.time()
    for i, line in enumerate(fi):
        line = not_chinese_word_rule.sub(' ', line)
        fo.write(' '.join([word for word in jieba.cut(line) if word != ' ']) + '\n')
        if i % 1000 == 0:
            print('Finished %3dk lines / elapsed time %10.2f' % (i/1000, time.time() - start_time), end='\r')

Finished 293k lines / elapsed time    1664.16

### Official corpus

In [4]:
corpus[0]

['01:00:00:29\t\t躲推銷員和金蟬脫殼',
 '01:00:02:21\t\t有什麼關係呢',
 '01:00:04:08\t\t因為牠們整個的羽化過程',
 '01:00:06:14\t\t是非常的長的',
 '01:00:07:23\t01:00:09:05\t而且這段時間牠們很脆弱',
 '01:00:35:03\t01:00:35:28\t阿倫哥呢',
 '01:00:38:24\t01:00:39:19\t這什麼啊',
 '01:00:40:18\t01:00:43:23\t阿亞姐, 這是假人耶',
 '01:00:47:01\t01:00:48:06\t不好意思啊',
 '01:00:49:27\t\t各位, 剛剛有一個推銷員',
 '01:00:51:26\t\t一直「盧」我買東西',
 '01:00:53:19\t\t所以我就來個金蟬脫殼',
 '01:00:56:00\t\t相信他知道我的意思了吧',
 '01:00:58:15\t\t會不會有點太大費周章了啊',
 '01:01:00:18\t01:01:02:15\t但是它長得跟你蠻像的',
 '01:01:03:21\t\t人不在還可以防小偷',
 '01:01:06:02\t\t有什麼東西可以偷嗎',
 '01:01:07:18\t\t把大門關起來不就好了',
 '01:01:10:07\t\t說到大門啊',
 '01:01:11:19\t\t就要問問阿亞姐了',
 '01:01:13:07\t\t門這麼久都還沒有修好',
 '01:01:15:06\t\t我早就跟你說過',
 '01:01:16:07\t\t這個門再怎麼修都修不好啦',
 '01:01:18:06\t01:01:18:27\t要不然這樣子',
 '01:01:19:28\t01:01:21:00\t交給你, 你自己來',
 '01:01:22:12\t\t這有點像我溜去同學家玩',
 '01:01:24:19\t\t用棉被蓋住枕頭',
 '01:01:26:10\t\t假裝在家裡睡覺',
 '01:01:27:29\t\t柏寬, 這樣好像有點不太好吧',
 '01:01:30:05\t\t不太好',
 '01:01:31:17\t\t是啊, 後來被媽媽抓包罰站',
 '01:01:34:00\t\t再也不敢了

In [None]:
for corpus_dir in ['datas/training_data/subtitle_with_TC', 'datas/training_data/subtitle_no_TC']:
    not_chinese_word = u'[^\u4e00-\u9fff]'  # only keep Chinese word, remove 
    not_chinese_word_rule = re.compile(not_chinese_word)
    for dir_name in os.listdir(path=corpus_dir):
        print

In [10]:
corpus = []
for corpus_dir in ['datas/training_data/subtitle_with_TC', 'datas/training_data/subtitle_no_TC']:
    not_chinese_word = u'[^\u4e00-\u9fff]'  # only keep Chinese word, remove 
    not_chinese_word_rule = re.compile(not_chinese_word)
    for dir_name in os.listdir(path=corpus_dir):
        
        full_dir_name = os.path.join(corpus_dir, dir_name)
        for f_name in os.listdir(path=full_dir_name):
            full_name = os.path.join(full_dir_name, f_name)
            if not os.path.isfile(full_name):
                continue
            with open(full_name) as f:
                corpus.append([not_chinese_word_rule.sub(' ', line).strip() for line in f])
            print('%4d lines in %s' % (len(corpus[-1]), full_name))

 844 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(35)_G201142100350013_022.txt
 841 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(37)_G201142100370008_024.txt
 808 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(29)_G201142100290005_016.txt
 772 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(26)_G201142100260009_013.txt
 756 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(15)_G201142100150009_002.txt
 930 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(27)_G201142100270014_014.txt
 751 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(21)_G201142100210008_008.txt
 784 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(20)_G201142100200012_007.txt
 783 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(32)_G201142100320010_019.txt
 805 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(36)_G201142100360008_023.txt
 820 lines in datas/training_data/subtitle_with_TC/成語賽恩思/成語賽恩思(24)_G201142100240

 831 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(118)_G200678601180003_014.txt
 713 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(119)_G200678601190003_013.txt
 854 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(93)_G200678600930016_039.txt
 715 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(127)_G200678601270002_005.txt
 758 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(102)_G200678601020012_037.txt
 692 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(117)_G200678601170003_015.txt
 770 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(108)_G200678601080003_024.txt
 734 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(122)_G200678601220002_010.txt
 713 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(107)_G200678601070002_025.txt
 704 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(98)_G200678600980005_031.txt
 729 lines in datas/training_data/subtitle_with_TC/流言追追追/流言追追追(126)_G200

 966 lines in datas/training_data/subtitle_with_TC/人生劇展/銘謝惠顧(1)_G201432000010014_055.txt
 796 lines in datas/training_data/subtitle_with_TC/人生劇展/衣櫃裡的貓(1)_G201628200010015_013.txt
1356 lines in datas/training_data/subtitle_with_TC/人生劇展/找到(1)_G201596400010017_026.txt
1423 lines in datas/training_data/subtitle_with_TC/人生劇展/海豬仔(1)_G201628600010016_018.txt
1309 lines in datas/training_data/subtitle_with_TC/人生劇展/(委194)料理人生TC已上字版.txt
1200 lines in datas/training_data/subtitle_with_TC/人生劇展/失業陣線聯盟(1)_G201592100010016_012.txt
1178 lines in datas/training_data/subtitle_with_TC/人生劇展/只想比你多活一天(白天不宜)(1)_G201432200010004_056.txt
1740 lines in datas/training_data/subtitle_with_TC/人生劇展/關老爺(1)_G201628300010016_020.txt
1221 lines in datas/training_data/subtitle_with_TC/人生劇展/偷窺心事(白天不宜)(1)_G201523700010011_037.txt
1641 lines in datas/training_data/subtitle_with_TC/人生劇展/外星有情人(1)_G201623700010015_021.txt
 189 lines in datas/training_data/subtitle_with_TC/人生劇展/可麗餅在台灣(1)_G201404900010011_059.txt
1728 lines in d

 746 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1440)_G000081614400003_026.txt
 649 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1380)_G000081613800006_044.txt
 609 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1510)_G000081615100008_088.txt
 839 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1468)_G000081614680003_130.txt
 583 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1430)_G000081614300003_036.txt
 860 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1374)_G000081613740003_050.txt
 677 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1428)_G000081614280003_038.txt
 735 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1381)_G000081613810005_043.txt
 718 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1505)_G000081615050010_093.txt
 714 lines in datas/training_data/subtitle_wit

 774 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1456)_G000081614560003_010.txt
 682 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1513)_G000081615130006_085.txt
 853 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1460)_G000081614600003_006.txt
 913 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1469)_G000081614690003_129.txt
 601 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1433)_G000081614330004_033.txt
 906 lines in datas/training_data/subtitle_with_TC/下課花路米/下課花路米(1323集起HD可播)(1475)_G000081614750004_123.txt
1531 lines in datas/training_data/subtitle_with_TC/誰來晚餐/誰來晚餐7(23)_G201554800230008_018.txt
1363 lines in datas/training_data/subtitle_with_TC/誰來晚餐/誰來晚餐7(34)_G201554800340009_007.txt
1402 lines in datas/training_data/subtitle_with_TC/誰來晚餐/誰來晚餐7(38)_G201554800380009_003.txt
1429 lines in datas/training_data/subtitle_with_TC/誰來晚餐/誰來晚餐9(1)_G201774400010017_008.txt
1

 415 lines in datas/training_data/subtitle_no_TC/我的這一班/class177.txt
 486 lines in datas/training_data/subtitle_no_TC/我的這一班/class319.txt
 435 lines in datas/training_data/subtitle_no_TC/我的這一班/我的這一班(414)_G000021604140008_035.txt
 441 lines in datas/training_data/subtitle_no_TC/我的這一班/class156.txt
 553 lines in datas/training_data/subtitle_no_TC/我的這一班/class018.txt
 457 lines in datas/training_data/subtitle_no_TC/我的這一班/class276.txt
 503 lines in datas/training_data/subtitle_no_TC/我的這一班/class245.txt
 504 lines in datas/training_data/subtitle_no_TC/我的這一班/class332.txt
 556 lines in datas/training_data/subtitle_no_TC/我的這一班/class277.txt
 203 lines in datas/training_data/subtitle_no_TC/我的這一班/我的這一班(463)_G000021604630005_006.txt
 496 lines in datas/training_data/subtitle_no_TC/我的這一班/class370.txt
 574 lines in datas/training_data/subtitle_no_TC/我的這一班/class199.txt
 358 lines in datas/training_data/subtitle_no_TC/我的這一班/class219.txt
 418 lines in datas/training_data/subtitle_no_TC/我的這一班/class315.txt
 5

 937 lines in datas/training_data/subtitle_no_TC/聽聽看/220.txt
1118 lines in datas/training_data/subtitle_no_TC/聽聽看/聽聽看(756)_G000015907560002_010.txt
1027 lines in datas/training_data/subtitle_no_TC/聽聽看/144.txt
 995 lines in datas/training_data/subtitle_no_TC/聽聽看/533.txt
1242 lines in datas/training_data/subtitle_no_TC/聽聽看/聽聽看(748)_G000015907480002_002.txt
 851 lines in datas/training_data/subtitle_no_TC/聽聽看/387.txt
1356 lines in datas/training_data/subtitle_no_TC/聽聽看/485.txt
1304 lines in datas/training_data/subtitle_no_TC/聽聽看/聽聽看(719)_G000015907190002_023.txt
1198 lines in datas/training_data/subtitle_no_TC/聽聽看/481.txt
1079 lines in datas/training_data/subtitle_no_TC/聽聽看/338.txt
1205 lines in datas/training_data/subtitle_no_TC/聽聽看/443.txt
 703 lines in datas/training_data/subtitle_no_TC/聽聽看/054.txt
1021 lines in datas/training_data/subtitle_no_TC/聽聽看/670.txt
 971 lines in datas/training_data/subtitle_no_TC/聽聽看/350.txt
 171 lines in datas/training_data/subtitle_no_TC/聽聽看/514.txt
1016 li

1272 lines in datas/training_data/subtitle_no_TC/聽聽看/268.txt
 884 lines in datas/training_data/subtitle_no_TC/聽聽看/083.txt
 967 lines in datas/training_data/subtitle_no_TC/聽聽看/646.txt
1252 lines in datas/training_data/subtitle_no_TC/聽聽看/149.txt
 998 lines in datas/training_data/subtitle_no_TC/聽聽看/231.txt
 254 lines in datas/training_data/subtitle_no_TC/聽聽看/119.txt
1121 lines in datas/training_data/subtitle_no_TC/聽聽看/378.txt
1165 lines in datas/training_data/subtitle_no_TC/聽聽看/413.txt
1293 lines in datas/training_data/subtitle_no_TC/聽聽看/聽聽看(736)_G000015907360010_040.txt
1262 lines in datas/training_data/subtitle_no_TC/聽聽看/聽聽看(769)_G000015907690002_023.txt
1253 lines in datas/training_data/subtitle_no_TC/聽聽看/261.txt
1054 lines in datas/training_data/subtitle_no_TC/聽聽看/355.txt
1011 lines in datas/training_data/subtitle_no_TC/聽聽看/317.txt
1058 lines in datas/training_data/subtitle_no_TC/聽聽看/聽聽看(738)_G000015907380002_042.txt
1227 lines in datas/training_data/subtitle_no_TC/聽聽看/255.txt
1326 li

 875 lines in datas/training_data/subtitle_no_TC/聽聽看/088.txt
1148 lines in datas/training_data/subtitle_no_TC/聽聽看/420.txt
1225 lines in datas/training_data/subtitle_no_TC/聽聽看/273.txt
1173 lines in datas/training_data/subtitle_no_TC/聽聽看/232.txt
1158 lines in datas/training_data/subtitle_no_TC/聽聽看/307.txt
1041 lines in datas/training_data/subtitle_no_TC/聽聽看/587.txt
1224 lines in datas/training_data/subtitle_no_TC/聽聽看/688.txt
 373 lines in datas/training_data/subtitle_no_TC/聽聽看/105.txt
 996 lines in datas/training_data/subtitle_no_TC/聽聽看/326.txt
1360 lines in datas/training_data/subtitle_no_TC/聽聽看/504.txt
1370 lines in datas/training_data/subtitle_no_TC/聽聽看/243.txt
 914 lines in datas/training_data/subtitle_no_TC/聽聽看/526.txt
1199 lines in datas/training_data/subtitle_no_TC/聽聽看/201.txt
 919 lines in datas/training_data/subtitle_no_TC/聽聽看/541.txt
 334 lines in datas/training_data/subtitle_no_TC/聽聽看/097.txt
1219 lines in datas/training_data/subtitle_no_TC/聽聽看/661.txt
1212 lines in datas/trai

1091 lines in datas/training_data/subtitle_no_TC/聽聽看/319.txt
1083 lines in datas/training_data/subtitle_no_TC/聽聽看/335.txt
1098 lines in datas/training_data/subtitle_no_TC/聽聽看/397.txt
1078 lines in datas/training_data/subtitle_no_TC/聽聽看/381.txt
1078 lines in datas/training_data/subtitle_no_TC/聽聽看/聽聽看(737)_G000015907370003_041.txt
1043 lines in datas/training_data/subtitle_no_TC/聽聽看/174.txt
 390 lines in datas/training_data/subtitle_no_TC/聽聽看/139.txt
1189 lines in datas/training_data/subtitle_no_TC/聽聽看/363.txt
1186 lines in datas/training_data/subtitle_no_TC/聽聽看/170.txt
1102 lines in datas/training_data/subtitle_no_TC/聽聽看/164.txt
1109 lines in datas/training_data/subtitle_no_TC/聽聽看/186.txt
1143 lines in datas/training_data/subtitle_no_TC/聽聽看/281.txt
 973 lines in datas/training_data/subtitle_no_TC/聽聽看/536.txt
1048 lines in datas/training_data/subtitle_no_TC/聽聽看/628.txt
1072 lines in datas/training_data/subtitle_no_TC/聽聽看/282.txt
1065 lines in datas/training_data/subtitle_no_TC/聽聽看/聽聽看(75

 897 lines in datas/training_data/subtitle_no_TC/人生劇展/(委048)老魏旁白字幕.txt
 764 lines in datas/training_data/subtitle_no_TC/人生劇展/(委074)過了天橋, 看見海.txt
 481 lines in datas/training_data/subtitle_no_TC/人生劇展/(委161)我愛耳邊風.txt
1172 lines in datas/training_data/subtitle_no_TC/人生劇展/(自08)患難.txt
1184 lines in datas/training_data/subtitle_no_TC/人生劇展/(委148)顏色.txt
1290 lines in datas/training_data/subtitle_no_TC/人生劇展/(委011)百里香煎魚.txt
1153 lines in datas/training_data/subtitle_no_TC/人生劇展/(委012)秘密.txt
1138 lines in datas/training_data/subtitle_no_TC/人生劇展/(委190)台北．叢林.txt
 731 lines in datas/training_data/subtitle_no_TC/人生劇展/(委051)在親密與孤獨間漂流的愛情.txt
 647 lines in datas/training_data/subtitle_no_TC/人生劇展/(委082)異度東區.txt
1462 lines in datas/training_data/subtitle_no_TC/人生劇展/(自13)車站.txt
1154 lines in datas/training_data/subtitle_no_TC/人生劇展/(委100)我家有張協商桌.txt
 629 lines in datas/training_data/subtitle_no_TC/人生劇展/(委110)地鼠.txt
1055 lines in datas/training_data/subtitle_no_TC/人生劇展/(委133)艾草.txt
 831 lines in datas/trainin

1577 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道1.txt
1377 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道16.txt
1592 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道(27)_G201207800270008_002.txt
1500 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道18.txt
 908 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道10.txt
1402 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道(198)_G201207801980007_024.txt
1591 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道(75)_G201207800750007_050.txt
1206 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道(41)_G201207800410002_016.txt
1755 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道(33)_G201207800330002_008.txt
1379 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道(74)_G201207800740008_049.txt
1527 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道(57)_G201207800570006_032.txt
1306 lines in datas/training_data/subtitle_no_TC/公視藝文大道/公視藝文大道(32)_G201207800320

 673 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-688.txt
 743 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-219.txt
 772 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-750.txt
 726 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-629.txt
 582 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-341.txt
 783 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1086.txt
 826 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-955.txt
 700 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1028.txt
 659 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1175.txt
 712 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-461.txt
 828 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-614.txt
 667 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-646.txt
 864 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1239.txt
 654 lines in datas/training_data/subtitle_no_TC/下課花路米/下課花路米(1323集起HD可播)(1345)_G000081613450007_080.txt
 454 lin

 675 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1000.txt
 645 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-692.txt
 754 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-195.txt
 610 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-999.txt
1552 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1249.txt
 579 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-909.txt
 682 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-482.txt
 789 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-478.txt
 617 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-448.txt
 561 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-825.txt
 742 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-145.txt
 710 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-987.txt
 624 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1202.txt
 770 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-466.txt
 866 lines in datas/training_data/subtitle_no

 738 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-563.txt
 608 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-794.txt
 582 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1115.txt
 884 lines in datas/training_data/subtitle_no_TC/下課花路米/下課花路米(1323集起HD可播)(1353)_G000081613530003_072.txt
 766 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-998.txt
 658 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-839.txt
 718 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-647.txt
 765 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-445.txt
 619 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-363.txt
 496 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1290.txt
 680 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-652.txt
 640 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-514.txt
 702 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-524.txt
 712 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-640.txt
 589 lines

 489 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-827.txt
 581 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1157.txt
 512 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1200.txt
 613 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1182.txt
 734 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-471.txt
 708 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-417.txt
 534 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1241.txt
 771 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1291.txt
 567 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-615.txt
 884 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-444.txt
 689 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1254.txt
 796 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1263.txt
 544 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-806.txt
 640 lines in datas/training_data/subtitle_no_TC/下課花路米/FOLL-77.txt
 670 lines in datas/training_data/subtitle

 625 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-918.txt
 686 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-510.txt
 731 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-917.txt
 713 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-148.txt
 749 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1090.txt
 569 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-697.txt
 558 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-488.txt
 753 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-671.txt
 814 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-469.txt
 539 lines in datas/training_data/subtitle_no_TC/下課花路米/F-08.txt
 829 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-567.txt
 615 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-780.txt
 799 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1292.txt
 756 lines in datas/training_data/subtitle_no_TC/下課花路米/下課花路米(1323集起HD可播)(1366)_G000081613660005_059.txt
 588 lines in 

 653 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-344.txt
 519 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-840.txt
 626 lines in datas/training_data/subtitle_no_TC/下課花路米/Foll-311.txt
 766 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-612.txt
 699 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-1052.txt
 778 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-532.txt
 837 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-587.txt
 486 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-901.txt
 597 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-921.txt
 620 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-329.txt
 695 lines in datas/training_data/subtitle_no_TC/下課花路米/Foll-254.txt
 755 lines in datas/training_data/subtitle_no_TC/下課花路米/foll-252.txt
1510 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐3+1(7).txt
1452 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐11.txt
1462 lines in datas/training_data/subtitle_no_TC/

1474 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐2-39.txt
1446 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐2-36.txt
1322 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐2-14.txt
1729 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐3+1(28).txt
1422 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐2-30.txt
1465 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐318.txt
1473 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐6(22)_G201451400220009_018.txt
1577 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐315.txt
1428 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐2-9.txt
1315 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐6(19)_G201451400190003_022.txt
1387 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐3+1(40).txt
1537 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐6(31)_G201451400310003_009.txt
1455 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐2-50.txt
1453 lines in datas/training_data/subtitle_no_TC/誰來晚餐/誰來晚餐6(

In [18]:
[word for s in corpus[0][10].split() for word in jieba.cut(s)]

['一直', '盧', '我', '買東西']