### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
import torch
import pandas as pd
import numpy as np
from torchtext import data, datasets

In [2]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('./polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste .it...,0
1998,there was a time when john carpenter was a gre...,0


### 建立Pipeline生成資料

In [14]:
input_data.values

array([['films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there\'s never really been a comic book like from hell before .for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid \'80s with a 12-part series called the watchmen .to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .in other words , don\'t dismiss this film because of its source .if you can get past the whole comic book thing , you might find another stumbling block in from hell\'s directors , albert and allen hughes .getting the hughes brothers to direct this seems almost a

In [21]:
def remove_non_char(x):
    x = ' '.join(x)
    x = re.sub(r'[^a-zA-Z]', ' ', x)
    x = x.split()

    return x

In [43]:
# 建立Field與Dataset
import spacy
nlp = spacy.load('en_core_web_sm').tokenizer

TEXT = data.Field(sequential=True, dtype=torch.float64, tokenize=nlp)
LABEL = data.LabelField(dtype=torch.float)
fields = [('text', TEXT), ('label', LABEL)]

examples = []
for text, label in input_data.values:
    examples.append(data.Example.fromlist(data=[text, label], fields=fields))

In [50]:
# 取的examples並打亂順序
import random
random.shuffle(examples)
# 以8:2的比例切分examples
train_ex = examples[:int(len(examples)*0.8)]
test_ex = examples[int(len(examples)*0.8):]

# 建立training與testing dataset
train_data = data.Dataset(examples=train_ex, fields=dict(fields))
test_data = data.Dataset(examples=test_ex, fields=dict(fields))

train_data[0].label, train_data[0].text

(0,
  desperate measures  is a generic title for a film that's beyond generic .it's also a depressing waste of talent , with the solid team of michael keaton and andy garcia unthankfully thrown thankless lead roles , not to mention once-cool director barbet schroeder sadly continuing his string of not-cool flicks -- this thriller is more " before and after " than " reversal of fortune . "the movie is a big disappointment , and yet it's somewhat easy to see what motivated such big names to attach themselves to it -- the premise is both promising and intriguing .too bad the execution's all wrong , though , because the set-up of " desperate measures " boasts some rather enticing elements that deserve to be put to far better use .san francisco cop frank connor ( garcia ) is a single parent with a troubling dilemma -- his son matt ( joseph cross ) is stricken with cancer which only a bone marrow transplant can push into remission .even worse , the only compatible donor is violent sociopath 

In [54]:
# 建立字典
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

print(f"Vocabularies of index 0-5: {TEXT.vocab.itos[:10]} \n")
print(f"words to index {LABEL.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>',  , synopsis, delicatessen, the, one, perhaps, yet, i] 

words to index defaultdict(None, {0: 0, 1: 1})


In [68]:
train_iter, test_iter = data.Iterator(dataset=train_data, batch_size=2, repeat=False, sort_key=lambda ex:len(ex.text)), data.Iterator(dataset=test_data, batch_size=2, repeat=False, sort_key=lambda ex:len(ex.text))

In [70]:
i = 0
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    i+=1
    if i == 3:
        break

tensor([[7.1700e+02, 5.4600e+02],
        [2.2310e+03, 3.2040e+03],
        [5.2360e+03, 4.5270e+03],
        ...,
        [1.0915e+06, 1.0000e+00],
        [1.0920e+06, 1.0000e+00],
        [1.0927e+06, 1.0000e+00]], dtype=torch.float64) torch.Size([920, 2])
tensor([0., 0.]) torch.Size([2])
tensor([[4.1600e+02, 7.0000e+00],
        [2.8720e+03, 3.4210e+03],
        [4.1580e+03, 4.4010e+03],
        ...,
        [1.0098e+06, 1.0000e+00],
        [1.0102e+06, 1.0000e+00],
        [1.0106e+06, 1.0000e+00]], dtype=torch.float64) torch.Size([754, 2])
tensor([1., 0.]) torch.Size([2])
tensor([[1.0510e+03, 1.3350e+03],
        [2.9900e+03, 2.3620e+03],
        [3.6370e+03, 4.3470e+03],
        ...,
        [1.0000e+00, 9.2563e+05],
        [1.0000e+00, 9.2612e+05],
        [1.0000e+00, 9.2700e+05]], dtype=torch.float64) torch.Size([626, 2])
tensor([1., 0.]) torch.Size([2])
