## Training a word-level language model

#### 实验结论：
* 序列建模，对下一位的预测效果还算符合预期。
* 训练时间至少 200-500个epoch
* adam lr=1e-3
* 目前最多才测试了1000行语料，提高语料规模后训练速度很慢。

#### 处理不同序列长度的一些策略 [todo]
* Padding（input补上0，直接计算，结果再针对性截断，比较简单）
	但是对于长度特别不一致的序列，会浪费很多计算资源
* Packed sequence 打包序列法，每个时间步骤叠加，记录初始结束位置。
  
#### 训练lstm (本文仅使用truncated BPTT进行训练)
* truncated BPTT and hidden repackaging（缺点，长依赖丢失）
* 记录下最后的h和c，使用detach把上一轮的计算图消掉。（有了长依赖，介于BPTT 和 truncated BPTT之间）[todo]

#### TODO:
* 比较pytorch官方实现的效率和效果 
* grad clip功能
* 完整的训练过程，使用 train、valid、test数据选择best model parameters
* 训练过程图形可视化

In [None]:
# 环境配置
%cd /playground/sgd_deep_learning/sgd_nlp/
import sys 
sys.path.append('./python')

In [None]:
# Download the datasets
import urllib.request
import os

!mkdir -p './data/ptb'
# Download Penn Treebank dataset

# github raw-file下载有问题，手动下载对应文件到data目录
# ptb_data = "https://github.com/wojzaremba/lstm/blob/master/data/ptb."
ptb_data = "https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb."
for f in ['train.txt', 'test.txt', 'valid.txt']:
    if not os.path.exists(os.path.join('./data/ptb', f)):
        print(ptb_data + f)
        urllib.request.urlretrieve(ptb_data + f, os.path.join('./data/ptb', f))

In [None]:
import torch
import sgd_nlp
import numpy as np
from sgd_nlp.models import LanguageModel
from sgd_nlp.simple_training import train_ptb, evaluate_ptb

In [None]:
# 设置训练超参数
# device = torch.device('cpu')   
device = torch.device('cuda:0')

num_layers=2 # RNN层数
n_epochs=200 # 数据遍历次数

embedding_size=400 # word编码维度
hidden_size = 1150  # hidden dim

seq_len = 20 # truncated BPTT 序列截断长度
batch_size = 200 # 批处理数量

optimizer=torch.optim.Adam
lr=1e-4 # 学习率
weight_decay=0
loss_fn=torch.nn.CrossEntropyLoss()

# 加载训练数据
corpus = sgd_nlp.data.Corpus("data/ptb", max_lines=1000)
train_data = sgd_nlp.data.batchify(corpus.train, batch_size=batch_size, device=device, dtype=np.float32)
print(train_data.shape)

### RNN

In [None]:
# 确认模型参数正确
model = LanguageModel(embedding_size=embedding_size,
                      output_size=len(corpus.dictionary),
                      hidden_size=hidden_size,
                      num_layers=num_layers,
                      seq_model='rnn',
                      device=device)

for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data)

In [None]:

model = LanguageModel(embedding_size=embedding_size,
                      output_size=len(corpus.dictionary),
                      hidden_size=hidden_size,
                      num_layers=num_layers,
                      seq_model='rnn',
                      device=device)

train_ptb(model, 
          train_data, 
          seq_len=seq_len, 
          n_epochs=n_epochs, 
          device=device, 
          optimizer=optimizer, 
          lr=lr, 
          weight_decay=weight_decay, 
          loss_fn=loss_fn)

evaluate_ptb(model,
             train_data,
             seq_len=seq_len,
             device=device)

### LSTM

In [None]:
lr = 1e-3
n_epochs =100

model = LanguageModel(embedding_size=embedding_size, output_size=len(corpus.dictionary), hidden_size=hidden_size, num_layers=num_layers, seq_model='lstm', device=device)
train_ptb(model, train_data, seq_len=seq_len, n_epochs=n_epochs, device=device, optimizer=optimizer, lr=lr, weight_decay=weight_decay, loss_fn=loss_fn)
evaluate_ptb(model, train_data, seq_len=seq_len, device=device)

### GRU

In [None]:
lr = 1e-3
n_epochs =100

model = LanguageModel(embedding_size=embedding_size, output_size=len(corpus.dictionary), hidden_size=hidden_size, num_layers=num_layers, seq_model='gru', device=device)
train_ptb(model, train_data, seq_len=seq_len, n_epochs=n_epochs, device=device, optimizer=optimizer, lr=lr, weight_decay=weight_decay, loss_fn=loss_fn)
evaluate_ptb(model, train_data, seq_len=seq_len, device=device)