### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
import torch
import pandas as pd
import numpy as np
from torchtext import data, datasets
import re

In [2]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('./polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste .it...,0
1998,there was a time when john carpenter was a gre...,0


### 建立Pipeline生成資料

In [3]:
import spacy
spacy_en = spacy.load('en_core_web_sm')

def tokenizer(text): # create a tokenizer function
    # 返回 a list of <class 'spacy.tokens.token.Token'>
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [4]:
# 建立Field與Dataset

def remove_non_char(x):
    x = ' '.join(x)
    x = re.sub("[^a-zA-Z]", " ", x)
    x = x.split()
    return x

text_field = data.Field(sequential=True, tokenize=tokenizer, lower=True, preprocessing=remove_non_char)
label_field = data.Field(sequential=False)
input_data = data.TabularDataset(path='polarity.tsv', 
                                 format='tsv', 
                                 fields=[('text', text_field), ('label', label_field)])

In [5]:
# 取的examples並打亂順序
examples = input_data.examples
np.random.shuffle(examples)

# 以8:2的比例切分examples
train_ex = examples[:int(len(examples)*0.8)]
test_ex = examples[int(len(examples)*0.8):]

# 建立training與testing dataset
train_data = data.Dataset(examples=train_ex, fields={'text':text_field, 'label':label_field})
test_data = data.Dataset(examples=test_ex, fields={'text':text_field, 'label':label_field})

train_data[0].label, train_data[0].text

('0',
 ['reindeer',
  'games',
  'is',
  'easily',
  'the',
  'worst',
  'of',
  'the',
  'three',
  'recent',
  'films',
  'penned',
  'by',
  'ehren',
  'kruger',
  'scream',
  'and',
  'arlington',
  'rd',
  'are',
  'the',
  'others',
  'each',
  'derivative',
  'in',
  'their',
  'own',
  'special',
  'way',
  'the',
  'guy',
  'ca',
  'n',
  't',
  'seem',
  'to',
  'write',
  'believable',
  'dialogue',
  'sample',
  'from',
  'reindeer',
  'games',
  'rule',
  'never',
  'put',
  'a',
  'car',
  'thief',
  'behind',
  'the',
  'wheel',
  'create',
  'multi',
  'faceted',
  'characters',
  'or',
  'even',
  'engineer',
  'coherent',
  'plots',
  'but',
  'he',
  'sure',
  'knows',
  'how',
  'to',
  'pile',
  'on',
  'numerous',
  'nonsensical',
  'twists',
  'and',
  'turns',
  'no',
  'matter',
  'if',
  'each',
  'one',
  'deems',
  'the',
  'actual',
  'story',
  'increasingly',
  'unlikely',
  'his',
  'screenplay',
  'for',
  'reindeer',
  'games',
  'turns',
  'the',
  't

In [6]:
# 建立字典
text_field.build_vocab(train_data)
label_field.build_vocab(train_data)

print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 's'] 



In [7]:
# create iterator for training and testing data
train_iter, test_iter = data.Iterator.splits(datasets=(train_data, test_data),
                                             batch_sizes=(3, 3),
                                             repeat=False,  
                                             sort_key = lambda ex: len(ex.text)) 

In [8]:
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[   18,   101,    18],
        [  199,     3,   173],
        [   29,    99,   723],
        ...,
        [    1,     1,    52],
        [    1,     1,  5567],
        [    1,     1, 23342]]) torch.Size([1664, 3])
tensor([1, 2, 1]) torch.Size([3])
