In [1]:
import torch
import torch.nn as nn
from torchtext import data

In [2]:
# tokenize = lambda x: x.split()
TEXT = data.Field(init_token='<SOS>', eos_token='<EOS>', lower=True)
LABEL = data.Field(sequential=False, use_vocab=False)

In [3]:
def raw_info(vocab, ID_list):
    length, bsz = ID_list.shape
    for _ in range(bsz):
        for i in range(length):
            print(vocab.itos[ID_list[i,_]], end=' ')
        print()

In [4]:
data_path = '/home/zhaoyu/Datasets/NLPBasics/sentiment/train.tsv'
dataset = data.TabularDataset(data_path, 'TSV', skip_header=True, 
                              fields=[('PhraseId', LABEL), ('SentenceId', LABEL),
                                      ('Phrase', TEXT), ('Sentiment', LABEL)])
TEXT.build_vocab(dataset)
vocab = TEXT.vocab

In [5]:
train_iter = data.BucketIterator(dataset, batch_size=4, 
                                 train=False,
                                 sort=False,
                                 shuffle=True, 
                                 sort_within_batch=False,
                                 sort_key=lambda x: len(x.Phrase),
                                 repeat=False)

print('train:', train_iter.train, '\nsort:', train_iter.sort, 
      '\nshuffle:', train_iter.shuffle)

train: False 
sort: False 
shuffle: True


In [6]:
for i, batch in enumerate(train_iter):
    if i>10000 and i<10100:
        print(i, batch.PhraseId, batch.Sentiment, batch.Phrase.shape)
        raw_info(vocab, batch.Phrase)
    if i>10100:
        break

10001 tensor([ 36584, 101035,  27719,  52323]) tensor([2, 2, 3, 2]) torch.Size([7, 4])
<SOS> , for that matter . <EOS> 
<SOS> defeated but defiant <EOS> <pad> <pad> 
<SOS> pedro <EOS> <pad> <pad> <pad> <pad> 
<SOS> fosters <EOS> <pad> <pad> <pad> <pad> 
10002 tensor([ 32512, 129524,  23216, 139599]) tensor([2, 3, 3, 0]) torch.Size([32, 4])
<SOS> of war 's madness remembered that we , today , can prevent its tragic waste of life <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> a real audience-pleaser that will strike a chord with anyone who 's ever waited in a doctor 's office , emergency room , hospital bed or insurance company office . <EOS> 
<SOS> casual filmgoers <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> neglecting character development <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pa

<SOS> the deliberate <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> fondness and respect <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> has crafted here a worldly-wise and very funny script . <EOS> <pad> <pad> 
10035 tensor([ 11338,  93054, 149064,  72571]) tensor([1, 1, 2, 2]) torch.Size([16, 4])
<SOS> 's no point of view , no contemporary interpretation of joan 's prefeminist plight <EOS> 
<SOS> 've reeked of a been-there , done-that sameness <EOS> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> passed a long time <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> tiny acts <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
10036 tensor([34041, 64690, 33895, 33770]) tensor([3, 2, 3, 0]) torch.Size([27, 4])
<SOS> watch the film twice <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> than five <EOS> <pad> <pa

10067 tensor([82078, 47902, 15058, 62193]) tensor([4, 1, 1, 2]) torch.Size([28, 4])
<SOS> is a perfect family film to take everyone to since there 's no new `` a christmas carol '' out in the theaters this year . <EOS> 
<SOS> a bad improvisation exercise , <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> were a series of bible parables and not an actual story <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> married <EOS> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
10068 tensor([ 46215, 153094,  32880, 104774]) tensor([3, 2, 2, 2]) torch.Size([10, 4])
<SOS> , rain is the far superior film . <EOS> 
<SOS> sisterly obsession <EOS> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> as weakness <EOS> <pad> <pad> <pad> <pad> <pad> <pad> 
<SOS> curmudgeon

In [7]:
class Classifier(nn.Module):
    def __init__(self, vocab_size, nemb, nhid, nclass, nlayer=2):
        super(Classifier, self).__init__()
        self.emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=nemb)
        self.rnn = nn.GRU(input_size=nemb, hidden_size=nhid, num_layers=nlayer)
        self.fc = nn.Linear(nhid, nclass)
        
    def forward(self, x, hidden_state):
        emb = self.emb(x)
        rnn_output, hidden_state = self.rnn(emb, hidden_state)
        logits = self.fc(rnn_output)
        return logits

In [8]:
rnn = Classifier(len(vocab), 300, 512, 5)