### 语料获取
这个压缩包里面存的是标题、正文部分

In [None]:
!wget https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

### 提取正文
将 xml 格式的 wiki 数据转换为 text 格式

In [7]:
!pip install gensim
from gensim.corpora import WikiCorpus

input = "./zhwiki-latest-pages-articles.xml.bz2"
output = './wiki.zh.text'
with open(output, 'w') as f:
    wiki =  WikiCorpus(input, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        str_line = ' '.join(text)
        f.write(str_line + '\n')

### 繁简转换
如果抽取中文的话需要将繁体转化为简体(维基百科的中文数据是繁简混杂的，里面包含大陆简体、台湾繁体、港澳繁体等多种不同的数据)。可以使用opencc进行转换，也可以使用其它繁简转换工具。
```
Options:
 -i [file], --input=[file]   Read original text from [file].
 -o [file], --output=[file]  Write converted text to [file].
 -c [file], --config=[file]  Load configuration of conversion from [file].
 -v, --version               Print version and build information.
 -h, --help                  Print this help.

```

In [9]:
!opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini

^C


### 编码转换
由于后续的分词需要使用utf-8格式的字符，而上述简体字中可能存在非utf-8的字符集，避免在分词时候进行到一半而出现错误，因此先进行字符格式转换。 

使用iconv命令将文件转换成utf-8编码
```
 输入/输出格式规范：
  -f, --from-code=名称     原始文本编码
  -t, --to-code=名称       输出编码

 信息：
  -l, --list                 列举所有已知的字符集

 输出控制：
  -c                         从输出中忽略无效的字符
  -o, --output=文件        输出文件
  -s, --silent               关闭警告
      --verbose              打印进度信息

  -?, --help                 给出该系统求助列表
      --usage                给出简要的用法信息
  -V, --version              打印程序版本号

```

In [10]:
!iconv -c -t UTF-8 wiki.zh.text.jian > wiki.zh.text.jian.utf8

iconv: 位于缓冲区末尾的不完整字符或转义序列


### 分词处理

In [12]:
!pip install jieba
!python -m jieba -d ' ' wiki.zh.text.jian.utf8 > wiki.zh.text.jian.utf8.seg

/usr/bin/python: No module named jieba


### word2vec 训练

In [13]:
from gensim.models import word2vec
import logging

input = './wiki.zh.text.jian.utf8.seg'
output = './wiki.vector'
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.LineSentence(input)
model = word2vec.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.wv.save_word2vec_format(output, binary=False)

2018-03-04 08:14:55,583 : INFO : collecting all words and their counts
2018-03-04 08:14:55,586 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-04 08:14:59,361 : INFO : PROGRESS: at sentence #10000, processed 12094411 words, keeping 585586 word types
2018-03-04 08:15:02,192 : INFO : PROGRESS: at sentence #20000, processed 20888421 words, keeping 842063 word types
2018-03-04 08:15:04,525 : INFO : PROGRESS: at sentence #30000, processed 28821064 words, keeping 1022757 word types
2018-03-04 08:15:06,857 : INFO : PROGRESS: at sentence #40000, processed 36129182 words, keeping 1187820 word types
2018-03-04 08:15:08,993 : INFO : PROGRESS: at sentence #50000, processed 43069207 words, keeping 1331775 word types
2018-03-04 08:15:11,209 : INFO : PROGRESS: at sentence #60000, processed 49625064 words, keeping 1456963 word types
2018-03-04 08:15:13,163 : INFO : PROGRESS: at sentence #70000, processed 55796553 words, keeping 1569634 word types
2018-03-04 08:15:15,

2018-03-04 08:16:40,939 : INFO : EPOCH 1 - PROGRESS: at 10.65% examples, 891137 words/s, in_qsize 4, out_qsize 1
2018-03-04 08:16:41,949 : INFO : EPOCH 1 - PROGRESS: at 11.08% examples, 891274 words/s, in_qsize 8, out_qsize 0
2018-03-04 08:16:42,954 : INFO : EPOCH 1 - PROGRESS: at 11.46% examples, 887412 words/s, in_qsize 6, out_qsize 1
2018-03-04 08:16:43,963 : INFO : EPOCH 1 - PROGRESS: at 11.78% examples, 880690 words/s, in_qsize 5, out_qsize 0
2018-03-04 08:16:44,964 : INFO : EPOCH 1 - PROGRESS: at 12.22% examples, 882351 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:16:45,972 : INFO : EPOCH 1 - PROGRESS: at 12.72% examples, 886575 words/s, in_qsize 4, out_qsize 1
2018-03-04 08:16:46,978 : INFO : EPOCH 1 - PROGRESS: at 13.10% examples, 884939 words/s, in_qsize 8, out_qsize 0
2018-03-04 08:16:47,984 : INFO : EPOCH 1 - PROGRESS: at 13.45% examples, 879710 words/s, in_qsize 4, out_qsize 1
2018-03-04 08:16:48,985 : INFO : EPOCH 1 - PROGRESS: at 13.94% examples, 882342 words/s, in_qsiz

2018-03-04 08:17:54,443 : INFO : EPOCH 1 - PROGRESS: at 50.08% examples, 915154 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:17:55,449 : INFO : EPOCH 1 - PROGRESS: at 50.70% examples, 915546 words/s, in_qsize 2, out_qsize 0
2018-03-04 08:17:56,475 : INFO : EPOCH 1 - PROGRESS: at 51.36% examples, 916131 words/s, in_qsize 8, out_qsize 0
2018-03-04 08:17:57,475 : INFO : EPOCH 1 - PROGRESS: at 51.92% examples, 916109 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:17:58,484 : INFO : EPOCH 1 - PROGRESS: at 52.49% examples, 916206 words/s, in_qsize 3, out_qsize 0
2018-03-04 08:17:59,487 : INFO : EPOCH 1 - PROGRESS: at 53.18% examples, 920110 words/s, in_qsize 6, out_qsize 0
2018-03-04 08:18:00,504 : INFO : EPOCH 1 - PROGRESS: at 53.73% examples, 919876 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:18:01,549 : INFO : EPOCH 1 - PROGRESS: at 54.22% examples, 917391 words/s, in_qsize 5, out_qsize 2
2018-03-04 08:18:02,554 : INFO : EPOCH 1 - PROGRESS: at 54.74% examples, 915909 words/s, in_qsiz

2018-03-04 08:19:08,112 : INFO : EPOCH 1 - PROGRESS: at 99.91% examples, 945008 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:19:08,276 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-03-04 08:19:08,286 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-04 08:19:08,297 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-04 08:19:08,299 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-04 08:19:08,300 : INFO : EPOCH - 1 : training on 182244424 raw words (169823309 effective words) took 179.7s, 945133 effective words/s
2018-03-04 08:19:09,305 : INFO : EPOCH 2 - PROGRESS: at 0.12% examples, 769008 words/s, in_qsize 5, out_qsize 0
2018-03-04 08:19:10,310 : INFO : EPOCH 2 - PROGRESS: at 0.27% examples, 711180 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:19:11,316 : INFO : EPOCH 2 - PROGRESS: at 0.50% examples, 799441 words/s, in_qsize 6, out_qsize 2
2018-03-04 08:19:12,319 : INFO : EPOCH 2 - P

2018-03-04 08:20:17,898 : INFO : EPOCH 2 - PROGRESS: at 31.49% examples, 963655 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:20:18,916 : INFO : EPOCH 2 - PROGRESS: at 31.94% examples, 961270 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:20:19,921 : INFO : EPOCH 2 - PROGRESS: at 32.41% examples, 958991 words/s, in_qsize 4, out_qsize 0
2018-03-04 08:20:20,930 : INFO : EPOCH 2 - PROGRESS: at 33.05% examples, 960567 words/s, in_qsize 4, out_qsize 0
2018-03-04 08:20:21,935 : INFO : EPOCH 2 - PROGRESS: at 33.66% examples, 959851 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:20:22,940 : INFO : EPOCH 2 - PROGRESS: at 34.27% examples, 960256 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:20:23,943 : INFO : EPOCH 2 - PROGRESS: at 34.91% examples, 961636 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:20:24,958 : INFO : EPOCH 2 - PROGRESS: at 35.54% examples, 961904 words/s, in_qsize 6, out_qsize 0
2018-03-04 08:20:25,964 : INFO : EPOCH 2 - PROGRESS: at 36.18% examples, 961935 words/s, in_qsiz

2018-03-04 08:21:31,388 : INFO : EPOCH 2 - PROGRESS: at 78.80% examples, 974338 words/s, in_qsize 3, out_qsize 0
2018-03-04 08:21:32,394 : INFO : EPOCH 2 - PROGRESS: at 79.44% examples, 974728 words/s, in_qsize 1, out_qsize 0
2018-03-04 08:21:33,396 : INFO : EPOCH 2 - PROGRESS: at 80.12% examples, 975672 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:21:34,398 : INFO : EPOCH 2 - PROGRESS: at 80.78% examples, 975494 words/s, in_qsize 6, out_qsize 0
2018-03-04 08:21:35,401 : INFO : EPOCH 2 - PROGRESS: at 81.22% examples, 973311 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:21:36,401 : INFO : EPOCH 2 - PROGRESS: at 81.82% examples, 973230 words/s, in_qsize 7, out_qsize 1
2018-03-04 08:21:37,407 : INFO : EPOCH 2 - PROGRESS: at 82.51% examples, 973637 words/s, in_qsize 4, out_qsize 0
2018-03-04 08:21:38,409 : INFO : EPOCH 2 - PROGRESS: at 83.11% examples, 973833 words/s, in_qsize 6, out_qsize 0
2018-03-04 08:21:39,426 : INFO : EPOCH 2 - PROGRESS: at 83.60% examples, 972558 words/s, in_qsiz

2018-03-04 08:22:41,255 : INFO : EPOCH 3 - PROGRESS: at 14.45% examples, 928716 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:22:42,272 : INFO : EPOCH 3 - PROGRESS: at 14.79% examples, 923670 words/s, in_qsize 5, out_qsize 0
2018-03-04 08:22:43,274 : INFO : EPOCH 3 - PROGRESS: at 15.31% examples, 925237 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:22:44,280 : INFO : EPOCH 3 - PROGRESS: at 15.83% examples, 926681 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:22:45,287 : INFO : EPOCH 3 - PROGRESS: at 16.30% examples, 926518 words/s, in_qsize 8, out_qsize 0
2018-03-04 08:22:46,302 : INFO : EPOCH 3 - PROGRESS: at 16.86% examples, 927882 words/s, in_qsize 6, out_qsize 0
2018-03-04 08:22:47,314 : INFO : EPOCH 3 - PROGRESS: at 17.43% examples, 931292 words/s, in_qsize 3, out_qsize 0
2018-03-04 08:22:48,320 : INFO : EPOCH 3 - PROGRESS: at 18.03% examples, 933787 words/s, in_qsize 4, out_qsize 0
2018-03-04 08:22:49,332 : INFO : EPOCH 3 - PROGRESS: at 18.50% examples, 933416 words/s, in_qsiz

2018-03-04 08:23:54,986 : INFO : EPOCH 3 - PROGRESS: at 56.55% examples, 946426 words/s, in_qsize 8, out_qsize 0
2018-03-04 08:23:55,996 : INFO : EPOCH 3 - PROGRESS: at 57.03% examples, 944127 words/s, in_qsize 7, out_qsize 1
2018-03-04 08:23:56,997 : INFO : EPOCH 3 - PROGRESS: at 57.71% examples, 944580 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:23:58,001 : INFO : EPOCH 3 - PROGRESS: at 58.56% examples, 946225 words/s, in_qsize 2, out_qsize 0
2018-03-04 08:23:59,005 : INFO : EPOCH 3 - PROGRESS: at 59.33% examples, 946493 words/s, in_qsize 0, out_qsize 1
2018-03-04 08:24:00,016 : INFO : EPOCH 3 - PROGRESS: at 60.03% examples, 946123 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:24:01,028 : INFO : EPOCH 3 - PROGRESS: at 60.68% examples, 945872 words/s, in_qsize 6, out_qsize 1
2018-03-04 08:24:02,029 : INFO : EPOCH 3 - PROGRESS: at 61.51% examples, 946841 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:24:03,047 : INFO : EPOCH 3 - PROGRESS: at 62.17% examples, 946764 words/s, in_qsiz

2018-03-04 08:25:04,871 : INFO : EPOCH 4 - PROGRESS: at 0.37% examples, 936426 words/s, in_qsize 6, out_qsize 0
2018-03-04 08:25:05,872 : INFO : EPOCH 4 - PROGRESS: at 0.57% examples, 882226 words/s, in_qsize 6, out_qsize 1
2018-03-04 08:25:06,880 : INFO : EPOCH 4 - PROGRESS: at 0.80% examples, 901361 words/s, in_qsize 7, out_qsize 1
2018-03-04 08:25:07,894 : INFO : EPOCH 4 - PROGRESS: at 1.12% examples, 927137 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:25:08,898 : INFO : EPOCH 4 - PROGRESS: at 1.38% examples, 921982 words/s, in_qsize 8, out_qsize 0
2018-03-04 08:25:09,903 : INFO : EPOCH 4 - PROGRESS: at 1.66% examples, 922467 words/s, in_qsize 3, out_qsize 0
2018-03-04 08:25:10,913 : INFO : EPOCH 4 - PROGRESS: at 2.00% examples, 933506 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:25:11,921 : INFO : EPOCH 4 - PROGRESS: at 2.35% examples, 951493 words/s, in_qsize 5, out_qsize 0
2018-03-04 08:25:12,921 : INFO : EPOCH 4 - PROGRESS: at 2.70% examples, 952446 words/s, in_qsize 0, out_

2018-03-04 08:26:18,466 : INFO : EPOCH 4 - PROGRESS: at 32.85% examples, 918547 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:26:19,471 : INFO : EPOCH 4 - PROGRESS: at 33.42% examples, 917904 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:26:20,472 : INFO : EPOCH 4 - PROGRESS: at 34.00% examples, 917950 words/s, in_qsize 4, out_qsize 0
2018-03-04 08:26:21,479 : INFO : EPOCH 4 - PROGRESS: at 34.61% examples, 919250 words/s, in_qsize 1, out_qsize 0
2018-03-04 08:26:22,482 : INFO : EPOCH 4 - PROGRESS: at 35.27% examples, 920873 words/s, in_qsize 4, out_qsize 0
2018-03-04 08:26:23,489 : INFO : EPOCH 4 - PROGRESS: at 35.80% examples, 919030 words/s, in_qsize 6, out_qsize 1
2018-03-04 08:26:24,496 : INFO : EPOCH 4 - PROGRESS: at 36.24% examples, 916516 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:26:25,502 : INFO : EPOCH 4 - PROGRESS: at 36.82% examples, 915950 words/s, in_qsize 4, out_qsize 0
2018-03-04 08:26:26,504 : INFO : EPOCH 4 - PROGRESS: at 37.46% examples, 916256 words/s, in_qsiz

2018-03-04 08:27:32,060 : INFO : EPOCH 4 - PROGRESS: at 78.47% examples, 931197 words/s, in_qsize 5, out_qsize 1
2018-03-04 08:27:33,073 : INFO : EPOCH 4 - PROGRESS: at 79.06% examples, 930750 words/s, in_qsize 8, out_qsize 0
2018-03-04 08:27:34,073 : INFO : EPOCH 4 - PROGRESS: at 79.65% examples, 931023 words/s, in_qsize 2, out_qsize 0
2018-03-04 08:27:35,076 : INFO : EPOCH 4 - PROGRESS: at 80.32% examples, 931809 words/s, in_qsize 0, out_qsize 1
2018-03-04 08:27:36,085 : INFO : EPOCH 4 - PROGRESS: at 81.01% examples, 932329 words/s, in_qsize 6, out_qsize 1
2018-03-04 08:27:37,086 : INFO : EPOCH 4 - PROGRESS: at 81.62% examples, 932484 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:27:38,089 : INFO : EPOCH 4 - PROGRESS: at 82.23% examples, 932352 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:27:39,092 : INFO : EPOCH 4 - PROGRESS: at 82.69% examples, 931222 words/s, in_qsize 3, out_qsize 2
2018-03-04 08:27:40,096 : INFO : EPOCH 4 - PROGRESS: at 83.31% examples, 931859 words/s, in_qsiz

2018-03-04 08:28:42,471 : INFO : EPOCH 5 - PROGRESS: at 13.67% examples, 939775 words/s, in_qsize 6, out_qsize 0
2018-03-04 08:28:43,472 : INFO : EPOCH 5 - PROGRESS: at 14.13% examples, 939778 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:28:44,473 : INFO : EPOCH 5 - PROGRESS: at 14.51% examples, 934681 words/s, in_qsize 3, out_qsize 1
2018-03-04 08:28:45,480 : INFO : EPOCH 5 - PROGRESS: at 14.92% examples, 932830 words/s, in_qsize 6, out_qsize 0
2018-03-04 08:28:46,488 : INFO : EPOCH 5 - PROGRESS: at 15.39% examples, 932231 words/s, in_qsize 6, out_qsize 1
2018-03-04 08:28:47,507 : INFO : EPOCH 5 - PROGRESS: at 15.89% examples, 931437 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:28:48,519 : INFO : EPOCH 5 - PROGRESS: at 16.27% examples, 927845 words/s, in_qsize 7, out_qsize 1
2018-03-04 08:28:49,520 : INFO : EPOCH 5 - PROGRESS: at 16.82% examples, 928989 words/s, in_qsize 2, out_qsize 0
2018-03-04 08:28:50,524 : INFO : EPOCH 5 - PROGRESS: at 17.38% examples, 932393 words/s, in_qsiz

2018-03-04 08:29:56,045 : INFO : EPOCH 5 - PROGRESS: at 55.06% examples, 945429 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:29:57,050 : INFO : EPOCH 5 - PROGRESS: at 55.61% examples, 944623 words/s, in_qsize 0, out_qsize 0
2018-03-04 08:29:58,064 : INFO : EPOCH 5 - PROGRESS: at 56.17% examples, 943712 words/s, in_qsize 7, out_qsize 1
2018-03-04 08:29:59,071 : INFO : EPOCH 5 - PROGRESS: at 56.89% examples, 944624 words/s, in_qsize 5, out_qsize 1
2018-03-04 08:30:00,074 : INFO : EPOCH 5 - PROGRESS: at 57.58% examples, 945183 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:30:01,091 : INFO : EPOCH 5 - PROGRESS: at 58.34% examples, 945800 words/s, in_qsize 6, out_qsize 2
2018-03-04 08:30:02,110 : INFO : EPOCH 5 - PROGRESS: at 58.95% examples, 944544 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:30:03,111 : INFO : EPOCH 5 - PROGRESS: at 59.65% examples, 943794 words/s, in_qsize 7, out_qsize 0
2018-03-04 08:30:04,121 : INFO : EPOCH 5 - PROGRESS: at 60.35% examples, 944143 words/s, in_qsiz

2018-03-04 08:31:05,265 : INFO : training on a 911222120 raw words (849119780 effective words) took 896.7s, 946989 effective words/s
2018-03-04 08:31:05,266 : INFO : storing 782240x100 projection weights into ./wiki.vector


### word2vec 测试

In [15]:
# model['女人'] + model['国王'] - model['男人'] = model['皇后']
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

2018-03-04 08:32:10,519 : INFO : precomputing L2-norms of word weight vectors


[('queen', 0.7293198108673096),
 ('bride', 0.683803915977478),
 ('mistress', 0.6707652807235718),
 ('prince', 0.6648019552230835),
 ('wives', 0.6588137149810791),
 ('princess', 0.6529775857925415),
 ('queens', 0.6459839344024658),
 ('daughters', 0.6443694829940796),
 ('mother', 0.6377485990524292),
 ('godmother', 0.6289682388305664)]

In [16]:
model.wv.similarity('woman', 'man')

0.6361447854063201

In [19]:
model.wv.similarity('queen', 'king')

0.6681771014772736

In [21]:
from gensim.models import word2vec
import logging

input = './wiki.zh.text.jian.utf8.seg'
output = './wiki.vector50'
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.LineSentence(input)
model = word2vec.Word2Vec(sentences, size=50, window=5, min_count=5, workers=4)
model.wv.save_word2vec_format(output, binary=False)

2018-03-04 09:15:03,233 : INFO : collecting all words and their counts
2018-03-04 09:15:03,236 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-04 09:15:07,066 : INFO : PROGRESS: at sentence #10000, processed 12094411 words, keeping 585586 word types
2018-03-04 09:15:09,865 : INFO : PROGRESS: at sentence #20000, processed 20888421 words, keeping 842063 word types
2018-03-04 09:15:12,327 : INFO : PROGRESS: at sentence #30000, processed 28821064 words, keeping 1022757 word types
2018-03-04 09:15:14,899 : INFO : PROGRESS: at sentence #40000, processed 36129182 words, keeping 1187820 word types
2018-03-04 09:15:16,929 : INFO : PROGRESS: at sentence #50000, processed 43069207 words, keeping 1331775 word types
2018-03-04 09:15:19,264 : INFO : PROGRESS: at sentence #60000, processed 49625064 words, keeping 1456963 word types
2018-03-04 09:15:21,090 : INFO : PROGRESS: at sentence #70000, processed 55796553 words, keeping 1569634 word types
2018-03-04 09:15:23,

2018-03-04 09:16:48,359 : INFO : EPOCH 1 - PROGRESS: at 11.20% examples, 959126 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:16:49,364 : INFO : EPOCH 1 - PROGRESS: at 11.72% examples, 961180 words/s, in_qsize 4, out_qsize 1
2018-03-04 09:16:50,390 : INFO : EPOCH 1 - PROGRESS: at 12.20% examples, 962717 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:16:51,391 : INFO : EPOCH 1 - PROGRESS: at 12.58% examples, 958740 words/s, in_qsize 6, out_qsize 1
2018-03-04 09:16:52,395 : INFO : EPOCH 1 - PROGRESS: at 13.05% examples, 959035 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:16:53,400 : INFO : EPOCH 1 - PROGRESS: at 13.59% examples, 962257 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:16:54,410 : INFO : EPOCH 1 - PROGRESS: at 14.13% examples, 965005 words/s, in_qsize 4, out_qsize 0
2018-03-04 09:16:55,412 : INFO : EPOCH 1 - PROGRESS: at 14.59% examples, 965224 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:16:56,418 : INFO : EPOCH 1 - PROGRESS: at 15.04% examples, 963481 words/s, in_qsiz

2018-03-04 09:18:01,896 : INFO : EPOCH 1 - PROGRESS: at 54.09% examples, 986759 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:18:02,896 : INFO : EPOCH 1 - PROGRESS: at 54.87% examples, 988227 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:18:03,898 : INFO : EPOCH 1 - PROGRESS: at 55.61% examples, 989664 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:18:04,903 : INFO : EPOCH 1 - PROGRESS: at 56.19% examples, 988682 words/s, in_qsize 5, out_qsize 1
2018-03-04 09:18:05,910 : INFO : EPOCH 1 - PROGRESS: at 57.03% examples, 990945 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:18:06,913 : INFO : EPOCH 1 - PROGRESS: at 57.82% examples, 992036 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:18:07,925 : INFO : EPOCH 1 - PROGRESS: at 58.55% examples, 991967 words/s, in_qsize 1, out_qsize 1
2018-03-04 09:18:08,930 : INFO : EPOCH 1 - PROGRESS: at 59.40% examples, 992756 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:18:09,935 : INFO : EPOCH 1 - PROGRESS: at 60.28% examples, 994477 words/s, in_qsiz

2018-03-04 09:19:10,738 : INFO : EPOCH 2 - PROGRESS: at 0.51% examples, 803971 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:19:11,741 : INFO : EPOCH 2 - PROGRESS: at 0.77% examples, 862697 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:19:12,756 : INFO : EPOCH 2 - PROGRESS: at 1.08% examples, 889983 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:19:13,761 : INFO : EPOCH 2 - PROGRESS: at 1.38% examples, 920687 words/s, in_qsize 3, out_qsize 0
2018-03-04 09:19:14,766 : INFO : EPOCH 2 - PROGRESS: at 1.72% examples, 938215 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:19:15,768 : INFO : EPOCH 2 - PROGRESS: at 2.11% examples, 973690 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:19:16,773 : INFO : EPOCH 2 - PROGRESS: at 2.47% examples, 986801 words/s, in_qsize 8, out_qsize 0
2018-03-04 09:19:17,800 : INFO : EPOCH 2 - PROGRESS: at 2.87% examples, 998813 words/s, in_qsize 8, out_qsize 0
2018-03-04 09:19:18,804 : INFO : EPOCH 2 - PROGRESS: at 3.22% examples, 1006214 words/s, in_qsize 8, out

2018-03-04 09:20:24,508 : INFO : EPOCH 2 - PROGRESS: at 37.42% examples, 997477 words/s, in_qsize 5, out_qsize 0
2018-03-04 09:20:25,510 : INFO : EPOCH 2 - PROGRESS: at 38.11% examples, 996983 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:20:26,513 : INFO : EPOCH 2 - PROGRESS: at 38.62% examples, 993121 words/s, in_qsize 3, out_qsize 0
2018-03-04 09:20:27,539 : INFO : EPOCH 2 - PROGRESS: at 39.28% examples, 993660 words/s, in_qsize 2, out_qsize 0
2018-03-04 09:20:28,547 : INFO : EPOCH 2 - PROGRESS: at 39.82% examples, 992597 words/s, in_qsize 8, out_qsize 0
2018-03-04 09:20:29,549 : INFO : EPOCH 2 - PROGRESS: at 40.37% examples, 992227 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:20:30,555 : INFO : EPOCH 2 - PROGRESS: at 40.92% examples, 992045 words/s, in_qsize 4, out_qsize 0
2018-03-04 09:20:31,567 : INFO : EPOCH 2 - PROGRESS: at 41.34% examples, 988278 words/s, in_qsize 6, out_qsize 1
2018-03-04 09:20:32,572 : INFO : EPOCH 2 - PROGRESS: at 42.05% examples, 989432 words/s, in_qsiz

2018-03-04 09:21:38,056 : INFO : EPOCH 2 - PROGRESS: at 85.69% examples, 999233 words/s, in_qsize 1, out_qsize 0
2018-03-04 09:21:39,061 : INFO : EPOCH 2 - PROGRESS: at 86.30% examples, 999454 words/s, in_qsize 2, out_qsize 1
2018-03-04 09:21:40,066 : INFO : EPOCH 2 - PROGRESS: at 86.92% examples, 998643 words/s, in_qsize 2, out_qsize 0
2018-03-04 09:21:41,072 : INFO : EPOCH 2 - PROGRESS: at 87.71% examples, 999237 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:21:42,077 : INFO : EPOCH 2 - PROGRESS: at 88.51% examples, 999945 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:21:43,096 : INFO : EPOCH 2 - PROGRESS: at 89.28% examples, 1000395 words/s, in_qsize 6, out_qsize 1
2018-03-04 09:21:44,099 : INFO : EPOCH 2 - PROGRESS: at 89.92% examples, 999684 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:21:45,106 : INFO : EPOCH 2 - PROGRESS: at 90.51% examples, 997954 words/s, in_qsize 5, out_qsize 1
2018-03-04 09:21:46,109 : INFO : EPOCH 2 - PROGRESS: at 91.33% examples, 997992 words/s, in_qsi

2018-03-04 09:22:46,869 : INFO : EPOCH 3 - PROGRESS: at 20.69% examples, 1000153 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:22:47,887 : INFO : EPOCH 3 - PROGRESS: at 21.24% examples, 1001336 words/s, in_qsize 0, out_qsize 1
2018-03-04 09:22:48,902 : INFO : EPOCH 3 - PROGRESS: at 21.89% examples, 1002845 words/s, in_qsize 2, out_qsize 1
2018-03-04 09:22:49,906 : INFO : EPOCH 3 - PROGRESS: at 22.47% examples, 1002885 words/s, in_qsize 1, out_qsize 0
2018-03-04 09:22:50,919 : INFO : EPOCH 3 - PROGRESS: at 22.91% examples, 998398 words/s, in_qsize 4, out_qsize 1
2018-03-04 09:22:51,931 : INFO : EPOCH 3 - PROGRESS: at 23.52% examples, 998259 words/s, in_qsize 7, out_qsize 1
2018-03-04 09:22:52,932 : INFO : EPOCH 3 - PROGRESS: at 24.16% examples, 1000909 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:22:53,969 : INFO : EPOCH 3 - PROGRESS: at 24.71% examples, 999882 words/s, in_qsize 1, out_qsize 2
2018-03-04 09:22:54,985 : INFO : EPOCH 3 - PROGRESS: at 25.28% examples, 998270 words/s, in

2018-03-04 09:24:00,629 : INFO : EPOCH 3 - PROGRESS: at 67.71% examples, 1004583 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:24:01,629 : INFO : EPOCH 3 - PROGRESS: at 68.39% examples, 1004065 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:24:02,634 : INFO : EPOCH 3 - PROGRESS: at 69.05% examples, 1004326 words/s, in_qsize 4, out_qsize 0
2018-03-04 09:24:03,653 : INFO : EPOCH 3 - PROGRESS: at 69.74% examples, 1004867 words/s, in_qsize 7, out_qsize 1
2018-03-04 09:24:04,663 : INFO : EPOCH 3 - PROGRESS: at 70.30% examples, 1003817 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:24:05,665 : INFO : EPOCH 3 - PROGRESS: at 70.79% examples, 1001333 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:24:06,667 : INFO : EPOCH 3 - PROGRESS: at 71.48% examples, 1001411 words/s, in_qsize 2, out_qsize 0
2018-03-04 09:24:07,670 : INFO : EPOCH 3 - PROGRESS: at 72.18% examples, 1001918 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:24:08,674 : INFO : EPOCH 3 - PROGRESS: at 72.87% examples, 1002369 words/s

2018-03-04 09:25:09,192 : INFO : EPOCH 4 - PROGRESS: at 7.47% examples, 982542 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:25:10,192 : INFO : EPOCH 4 - PROGRESS: at 7.89% examples, 981704 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:25:11,202 : INFO : EPOCH 4 - PROGRESS: at 8.31% examples, 980321 words/s, in_qsize 1, out_qsize 1
2018-03-04 09:25:12,209 : INFO : EPOCH 4 - PROGRESS: at 8.82% examples, 983064 words/s, in_qsize 7, out_qsize 1
2018-03-04 09:25:13,213 : INFO : EPOCH 4 - PROGRESS: at 9.32% examples, 987265 words/s, in_qsize 6, out_qsize 1
2018-03-04 09:25:14,221 : INFO : EPOCH 4 - PROGRESS: at 9.75% examples, 989818 words/s, in_qsize 5, out_qsize 0
2018-03-04 09:25:15,235 : INFO : EPOCH 4 - PROGRESS: at 10.14% examples, 983435 words/s, in_qsize 6, out_qsize 1
2018-03-04 09:25:16,251 : INFO : EPOCH 4 - PROGRESS: at 10.61% examples, 982784 words/s, in_qsize 8, out_qsize 0
2018-03-04 09:25:17,253 : INFO : EPOCH 4 - PROGRESS: at 11.04% examples, 981102 words/s, in_qsize 1, o

2018-03-04 09:26:22,780 : INFO : EPOCH 4 - PROGRESS: at 49.66% examples, 1004565 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:26:23,787 : INFO : EPOCH 4 - PROGRESS: at 50.27% examples, 1004252 words/s, in_qsize 7, out_qsize 1
2018-03-04 09:26:24,788 : INFO : EPOCH 4 - PROGRESS: at 50.93% examples, 1004484 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:26:25,795 : INFO : EPOCH 4 - PROGRESS: at 51.73% examples, 1006760 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:26:26,796 : INFO : EPOCH 4 - PROGRESS: at 52.36% examples, 1006849 words/s, in_qsize 8, out_qsize 0
2018-03-04 09:26:27,796 : INFO : EPOCH 4 - PROGRESS: at 52.96% examples, 1008424 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:26:28,800 : INFO : EPOCH 4 - PROGRESS: at 53.60% examples, 1009161 words/s, in_qsize 5, out_qsize 0
2018-03-04 09:26:29,806 : INFO : EPOCH 4 - PROGRESS: at 54.22% examples, 1008338 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:26:30,810 : INFO : EPOCH 4 - PROGRESS: at 54.86% examples, 1007435 words/s

2018-03-04 09:27:32,166 : INFO : EPOCH - 4 : training on 182244424 raw words (169825457 effective words) took 165.1s, 1028585 effective words/s
2018-03-04 09:27:33,177 : INFO : EPOCH 5 - PROGRESS: at 0.18% examples, 1033451 words/s, in_qsize 7, out_qsize 1
2018-03-04 09:27:34,186 : INFO : EPOCH 5 - PROGRESS: at 0.47% examples, 1117372 words/s, in_qsize 5, out_qsize 0
2018-03-04 09:27:35,189 : INFO : EPOCH 5 - PROGRESS: at 0.69% examples, 1045970 words/s, in_qsize 8, out_qsize 0
2018-03-04 09:27:36,193 : INFO : EPOCH 5 - PROGRESS: at 0.83% examples, 918765 words/s, in_qsize 8, out_qsize 1
2018-03-04 09:27:37,202 : INFO : EPOCH 5 - PROGRESS: at 1.15% examples, 948634 words/s, in_qsize 0, out_qsize 2
2018-03-04 09:27:38,205 : INFO : EPOCH 5 - PROGRESS: at 1.48% examples, 983260 words/s, in_qsize 3, out_qsize 0
2018-03-04 09:27:39,215 : INFO : EPOCH 5 - PROGRESS: at 1.83% examples, 979896 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:27:40,225 : INFO : EPOCH 5 - PROGRESS: at 2.09% example

2018-03-04 09:28:45,751 : INFO : EPOCH 5 - PROGRESS: at 34.41% examples, 977037 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:28:46,752 : INFO : EPOCH 5 - PROGRESS: at 34.79% examples, 972681 words/s, in_qsize 2, out_qsize 0
2018-03-04 09:28:47,763 : INFO : EPOCH 5 - PROGRESS: at 35.37% examples, 971850 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:28:48,765 : INFO : EPOCH 5 - PROGRESS: at 36.04% examples, 972714 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:28:49,770 : INFO : EPOCH 5 - PROGRESS: at 36.69% examples, 972540 words/s, in_qsize 4, out_qsize 0
2018-03-04 09:28:50,772 : INFO : EPOCH 5 - PROGRESS: at 37.20% examples, 971089 words/s, in_qsize 2, out_qsize 0
2018-03-04 09:28:51,775 : INFO : EPOCH 5 - PROGRESS: at 37.95% examples, 971173 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:28:52,781 : INFO : EPOCH 5 - PROGRESS: at 38.67% examples, 971771 words/s, in_qsize 1, out_qsize 0
2018-03-04 09:28:53,805 : INFO : EPOCH 5 - PROGRESS: at 39.35% examples, 972848 words/s, in_qsiz

2018-03-04 09:29:59,315 : INFO : EPOCH 5 - PROGRESS: at 81.14% examples, 972298 words/s, in_qsize 8, out_qsize 0
2018-03-04 09:30:00,324 : INFO : EPOCH 5 - PROGRESS: at 81.75% examples, 972158 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:30:01,330 : INFO : EPOCH 5 - PROGRESS: at 82.24% examples, 970342 words/s, in_qsize 8, out_qsize 0
2018-03-04 09:30:02,336 : INFO : EPOCH 5 - PROGRESS: at 82.86% examples, 970580 words/s, in_qsize 3, out_qsize 1
2018-03-04 09:30:03,343 : INFO : EPOCH 5 - PROGRESS: at 83.47% examples, 970599 words/s, in_qsize 6, out_qsize 0
2018-03-04 09:30:04,343 : INFO : EPOCH 5 - PROGRESS: at 84.13% examples, 971429 words/s, in_qsize 7, out_qsize 0
2018-03-04 09:30:05,354 : INFO : EPOCH 5 - PROGRESS: at 84.67% examples, 970178 words/s, in_qsize 7, out_qsize 1
2018-03-04 09:30:06,356 : INFO : EPOCH 5 - PROGRESS: at 85.35% examples, 970428 words/s, in_qsize 0, out_qsize 0
2018-03-04 09:30:07,368 : INFO : EPOCH 5 - PROGRESS: at 85.97% examples, 971231 words/s, in_qsiz