**Table of contents**<a id='toc0_'></a>    
- [Representing text as Tensors](#toc1_)    
  - [Text classification task](#toc1_1_)    
  - [Tokenization and Vectorization](#toc1_2_)    
  - [Bag of Words text representation](#toc1_3_)    
  - [Training BoW classifier](#toc1_4_)    
  - [BiGrams, TriGrams and N-Grams](#toc1_5_)    
  - [Term Frequency Inverse Document Frequency TF-IDF](#toc1_6_)    
  - [Bag of Words text representation2](#toc1_7_)    
  - [BiGrams, TriGrams and N-Grams2](#toc1_8_)    
  - [Term Frequency Inverse Document Frequency TF-IDF2](#toc1_9_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Representing text as Tensors](#toc0_)

> **NOTE** 如果有一点乱的话，可能是因为，有些内容是从Github中提取出来，有些是官网上提取出来 

In [3]:
!pip install -r ./requirements.txt

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


## <a id='toc1_1_'></a>[Text classification task](#toc0_)

In this module, we will start with a simple text classification task based on **AG_NEWS** dataset, which is to classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech. This dataset is built into [`torchtext`](https://github.com/pytorch/text) module, so we can easily access it.

In [4]:
import torch
import torchtext
import os
import collections
os.makedirs('./data',exist_ok=True)
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

test.csv: 1.86MB [00:00, 6.71MB/s]                          


In [5]:
next(train_dataset)

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

In [6]:
for i, x in zip(range(5), train_dataset):
    print(f"**{classes[x[0]]}** -> {x[1]}")

**Sci/Tech** -> Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
**Sci/Tech** -> Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
**Sci/Tech** -> Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.
**Sci/Tech** -> Oil prices soar to all-time record, posing new menace to US economy (AFP) AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace ba

In [7]:
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)

## <a id='toc1_2_'></a>[Tokenization and Vectorization](#toc0_)
Now we need to convert text into **numbers** that can be represented as tensors. If we want word-level representation, we need to do two things:
* use **tokenizer** to split text into **tokens**
* build a **vocabulary** of those tokens.

In [9]:
# torchtext 内置分词器
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
tokenizer('He said: hello')

['he', 'said', 'hello']

In [16]:
# 构建所有标记的词汇表
counter = collections.Counter()
for (label, line) in train_dataset:
    counter.update(tokenizer(line))
vocab = torchtext.vocab.Vocab(counter, min_freq=1)

In [47]:
print(counter.__len__())
counter.most_common()[:5]


95800


[('.', 225963), ('the', 203833), (',', 165675), ('to', 119203), ('a', 110149)]

In [23]:
# Using vocabulary, we can easily encode out tokenized string into a set of numbers:
# torchtext vocab.stoi 字典允许我们从字符串表示形式转换为数字（名称 stoi 代表“从字符串到整数”）
vocab_size = len(vocab)
print(f"Vocab size if {vocab_size}")

def encode(x):
    return [vocab.stoi[s] for s in tokenizer(x)]

encode('I love to play with my words')

Vocab size if 95802


[283, 2321, 5, 337, 19, 1301, 2357]

## <a id='toc1_3_'></a>[Bag of Words text representation](#toc0_)


In [64]:
# To compute bag-of-words vector from the vector representation of our AG_NEWS dataset, we can use the following function:
vocab_size = len(vocab)

def to_bow(text,bow_vocab_size=vocab_size):
    res = torch.zeros(bow_vocab_size,dtype=torch.float32)
    for i in encode(text):
        if i<bow_vocab_size:
            res[i] += 1
    return res

print(to_bow(train_dataset[0][1]))

tensor([0., 0., 3.,  ..., 0., 0., 0.])


> **Note:** Here we are using global `vocab_size` variable to specify default size of the vocabulary. Since often vocabulary size is pretty big, we can limit the size of the vocabulary to most frequent words. Try lowering `vocab_size` value and running the code below, and see how it affects the accuracy. You should expect some accuracy drop, but not dramatic, in lieu of higher performance.

## <a id='toc1_4_'></a>[Training BoW classifier](#toc0_)

- train a classifier on top of BoW. 
First, we need to convert our dataset for training in such a way, that all positional vector representations are converted to bag-of-words representation. 
This can be achieved by passing `bowify` function as `collate_fn` parameter to standard torch `DataLoader`:

In [65]:
from torch.utils.data import DataLoader
import numpy as np 

# this collate function gets list of batch_size tuples, and needs to 
# return a pair of label-feature tensors for the whole minibatch
def bowify(b):
    return (
            torch.LongTensor([t[0]-1 for t in b]),
            torch.stack([to_bow(t[1]) for t in b])
    )

train_loader = DataLoader(train_dataset, batch_size=16, collate_fn=bowify, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=bowify, shuffle=True)

In [66]:
# Now let's define a simple classifier neural network that contains one linear layer. 
# The size of the input vector equals to `vocab_size`, and output size corresponds to the number of classes (4). 
# Because we are solving classification task, the final activation function is `LogSoftmax()`.
net = torch.nn.Sequential(torch.nn.Linear(vocab_size, 4), torch.nn.LogSoftmax(dim=1))

In [67]:
# Now we will define standard PyTorch training loop. Because our dataset is quite large, for our teaching purpose we will train only for one epoch, 
#   and sometimes even for less than an epoch (specifying the `epoch_size` parameter allows us to limit training). 
# We would also report accumulated training accuracy during training; the frequency of reporting is specified using `report_freq` parameter.
def train_epoch(net, dataloader, lr=0.01, optimizer=None, loss_fn = torch.nn.NLLLoss(), epoch_size=None, report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(), lr=lr)
    net.train()
    total_loss,acc,count,i = 0,0,0,0
    for labels, features in dataloader:
        optimizer.zero_grad()
        out = net(features)
        loss = loss_fn(out, labels) # cross_entropy(out,labels)
        loss.backward()     # 反向传播
        optimizer.step()    # 更新参数
        total_loss+=loss
        _,predicted = torch.max(out,1)
        acc+=(predicted==labels).sum()
        count+=len(labels)
        i+=1
        if i%report_freq==0:
            print(f"{count}: acc={acc.item()/count}")
        if epoch_size and count>epoch_size:
            break
    return total_loss.item()/count, acc.item()/count

In [68]:
train_epoch(net, train_loader, epoch_size=15000)

3200: acc=0.8
6400: acc=0.83984375
9600: acc=0.8551041666666667
12800: acc=0.861875


(0.025586477983226653, 0.8661380597014925)

## <a id='toc1_5_'></a>[BiGrams, TriGrams and N-Grams](#toc0_)

In [None]:
from torchtext.data.utils import ngrams_iterator
line = "I love to play with my words"
ite = ngrams_iterator(tokenizer(line), ngrams=2)
list(ite)

['i',
 'love',
 'to',
 'play',
 'with',
 'my',
 'words',
 'i love',
 'love to',
 'to play',
 'play with',
 'with my',
 'my words']

In [None]:
# 从我们的新闻数据集中构建二元组词汇表
from torchtext.data.utils import ngrams_iterator

bi_counter = collections.Counter()
for (label, line) in train_dataset:
    bi_counter.update(ngrams_iterator(tokenizer(line),ngrams=2))    # ngrams_iterator 将标记序列转换为 n-gram 序列的函数
bi_vocab = torchtext.vocab.Vocab(bi_counter, min_freq=2)

print(f"Bigram vocab size = {len(bi_vocab)}")

Bigram vocab size = 481947


In [None]:
bi_vocab

<torchtext.vocab.Vocab at 0x7fd811a15760>

In [None]:
print(bi_counter.__len__())
bi_counter.most_common()[55:60]

1308790


[('first', 9035),
 ('two', 8926),
 ('he', 8901),
 ('for the', 8819),
 ('world', 8464)]


## <a id='toc1_6_'></a>[Term Frequency Inverse Document Frequency TF-IDF](#toc0_)


In [None]:
N = 1000
df = torch.zeros(vocab_size)
for _,line in train_dataset[:N]:
    for i in set(encode(line)):
        df[i] += 1

In [None]:
def tf_idf(s):
    bow = to_bow(s)     # 术语频率向量
    return bow*torch.log((N+1)/(df+1))

print(tf_idf(train_dataset[0][1]))

tensor([0.0000, 0.0000, 0.0544,  ..., 0.0000, 0.0000, 0.0000])


## <a id='toc1_7_'></a>[Bag of Words text representation2](#toc0_)


In [None]:
# how to generate a bag of word representation using the Scikit Learn python library:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()     # array([[1, 1, 0, 2, 0, 0, 0, 0, 0]])


## <a id='toc1_8_'></a>[BiGrams, TriGrams and N-Grams2](#toc0_)

Below is an example of how to generate a bigram bag of word representation using the Scikit Learn:


In [52]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
bigram_vectorizer.fit_transform(corpus)
print("Vocabulary:\n",bigram_vectorizer.vocabulary_)
bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()


Vocabulary:
 {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}


array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [56]:
# The main drawback of N-gram approach is that vocabulary size starts to grow extremely fast. 
# In practice, we need to combine N-gram representation with some dimensionality reduction techniques, such as *embeddings*, which we will discuss in the next unit.
# To use N-gram representation in our **AG News** dataset, we need to build special ngram vocabulary:
counter = collections.Counter()
for (label, line) in train_dataset:
    l = tokenizer(line)
    counter.update(torchtext.data.utils.ngrams_iterator(l,ngrams=2))
    
bi_vocab = torchtext.vocab.Vocab(counter, min_freq=1)

print("Bigram vocabulary length = ",len(bi_vocab))

# We could then use the same code as above to train the classifier, however, it would be very memory-inefficient. 
# In the next unit, we will train bigram classifier using embeddings.
# > **Note:** You can only leave those ngrams that occur in the text more than specified number of times. 
#   This will make sure that infrequent bigrams will be omitted, and will decrease the dimensionality significantly. 
#   To do this, set `min_freq` parameter to a higher value, and observe the length of vocabulary change.

Bigram vocabulary length =  1308844


## <a id='toc1_9_'></a>[Term Frequency Inverse Document Frequency TF-IDF2](#toc0_)

In [None]:
# create TF-IDF vectorization of text
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()