### 使用PyTorch分類sklearn 20newgroups文章數據

這個notebook是基於下面的這篇教學文章

[https://www.deborahmesquita.com/2017-11-05/how-pytorch-gives-the-big-picture-with-deep-learning](https://www.deborahmesquita.com/2017-11-05/how-pytorch-gives-the-big-picture-with-deep-learning)

原作者 Déborah Mesquita 的github

[https://github.com/dmesquita/understanding_pytorch_nn](https://github.com/dmesquita/understanding_pytorch_nn)

在原本的文章裡作者使用3個newsgroups主題的資料來做分類。

在這裡我嘗試把newsgroups增加到14個來看看分類的效果如何。同時也對程式做了相關的小修改。

理論上如果能將文本進一步的預處理，像是去掉停用詞，做stemming跟lemmatization的處理。加上把BoW的詞頻向量換成word2vec相關的詞向量的話應該是可以提高預測能力一些的。


In [1]:
import torch
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.datasets import fetch_20newsgroups

In [2]:
x = torch.IntTensor([1,3,6])
y = torch.IntTensor([1,1,1])
result = x + y
print(result)
print(result.size())

tensor([2, 4, 7], dtype=torch.int32)
torch.Size([3])


In [3]:
categories = ['comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns']

#categories = ["comp.graphics","sci.space","rec.sport.baseball"]

newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

print('total texts in train:', len(newsgroups_train.data))
print('total texts in test:', len(newsgroups_test.data))

total texts in train: 8242
total texts in test: 5487


### 把訓練文本跟測試文本分割成單詞然後存在Counter字典裡面

In [4]:
vocab = Counter()

for text in newsgroups_train.data:
    for word in text.split(' '):
        vocab[word.lower()]+=1

for text in newsgroups_test.data:
    for word in text.split(' '):
        vocab[word.lower()]+=1

total_words = len(vocab)
vocab

Counter({'from:': 13254,
         '"jack': 5,
         'previdi"': 3,
         '<p00020@psilink.com>\nsubject:': 1,
         're:': 8801,
         'printing\nin-reply-to:': 1,
         '<1qk2m5$1up@agate.berkeley.edu>\nnntp-posting-host:': 1,
         '127.0.0.1\norganization:': 7,
         'avoirdupois': 1,
         'institute\nx-mailer:': 1,
         'psilink-dos': 5,
         '(3.4)\nlines:': 4,
         '38\n\n>date:': 1,
         '': 807949,
         '15': 432,
         'apr': 746,
         '1993': 701,
         '16:32:05': 1,
         'gmt\n>from:': 2,
         'cozzlab@garnet.berkeley.edu\n>\n>in': 1,
         'article': 6888,
         '<1993apr15.053905.16811@sarah.albany.edu>': 2,
         'me9574@albnyvms.bitnet': 2,
         'writes:\n>\n>[advertises': 1,
         'his': 3832,
         'printing': 100,
         'business]\n>\n>oh,': 1,
         'dear.': 9,
         'let': 895,
         'me': 5025,
         'be': 16824,
         'the': 126577,
         'first': 2273,
        

In [5]:
total_words

434897

### 建立一個單詞映射到索引的字典

In [6]:
def get_word_2_index(vocab):
    word2index = {}
    for i,word in enumerate(vocab):
        word2index[word.lower()] = i

    return word2index

word2index = get_word_2_index(vocab)
word2index

{'from:': 0,
 '"jack': 1,
 'previdi"': 2,
 '<p00020@psilink.com>\nsubject:': 3,
 're:': 4,
 'printing\nin-reply-to:': 5,
 '<1qk2m5$1up@agate.berkeley.edu>\nnntp-posting-host:': 6,
 '127.0.0.1\norganization:': 7,
 'avoirdupois': 8,
 'institute\nx-mailer:': 9,
 'psilink-dos': 10,
 '(3.4)\nlines:': 11,
 '38\n\n>date:': 12,
 '': 13,
 '15': 14,
 'apr': 15,
 '1993': 16,
 '16:32:05': 17,
 'gmt\n>from:': 18,
 'cozzlab@garnet.berkeley.edu\n>\n>in': 19,
 'article': 20,
 '<1993apr15.053905.16811@sarah.albany.edu>': 21,
 'me9574@albnyvms.bitnet': 22,
 'writes:\n>\n>[advertises': 23,
 'his': 24,
 'printing': 25,
 'business]\n>\n>oh,': 26,
 'dear.': 27,
 'let': 28,
 'me': 29,
 'be': 30,
 'the': 31,
 'first': 32,
 'on': 33,
 'my': 34,
 'block.\n>\n>you': 35,
 'have': 36,
 'just': 37,
 'violated': 38,
 'one': 39,
 'of': 40,
 'major': 41,
 'shibboleths': 42,
 'usenet': 43,
 'groups:\n': 44,
 '^^^^^^^^^^^\n': 45,
 '\tnit:': 46,
 'is': 47,
 'he': 48,
 'unable': 49,
 'to': 50,
 'type': 51,
 "'h'": 52,
 'i

### 建立批次訓練數據

基本上這裡是把每一條訓練文本分割後使用bag of words（BoW)的方式建立成一個含有單詞頻次的向量。然後對應著這個向量的是所屬分類主題類別

In [7]:
def get_batch(df, i, batch_size):
    batches = []
    results = []
    texts = df.data[i*batch_size:i*batch_size+batch_size]
    categories = df.target[i*batch_size:i*batch_size+batch_size]
    
    for text in texts:
        layer = np.zeros(total_words, dtype=float)
        for word in text.split(' '):
            layer[word2index[word.lower()]] += 1
        batches.append(layer)

    for category in categories:
        """
        index_y = -1
        if category == 0:
            index_y = 0
        elif category == 1:
            index_y = 1
        else:
            index_y = 2
        results.append(index_y)
        """
        results.append(category)

    return np.array(batches), np.array(results)

### 基本訓練參數

In [8]:
# Parameters
learning_rate = 0.01
num_epochs = 2
batch_size = 150
display_step = 1

# Network Parameters
hidden_size = 100           # 1st layer and 2nd layer number of features
input_size = total_words    # Words in vocab
num_classes = len(categories)         
print('number of classes: %d' % num_classes)

number of classes: 14


In [9]:
#from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

### 模型定義

簡單的fully connected layers with ReLu activation

In [10]:
class OurNet(nn.Module):
     def __init__(self, input_size, hidden_size, num_classes):
        super(OurNet, self).__init__()
        self.layer_1 = nn.Linear(input_size,hidden_size, bias=True)
        self.relu = nn.ReLU()
        self.layer_2 = nn.Linear(hidden_size, hidden_size, bias=True)
        self.output_layer = nn.Linear(hidden_size, num_classes, bias=True)

     def forward(self, x):
        out = self.layer_1(x)
        out = self.relu(out)
        out = self.layer_2(out)
        out = self.relu(out)
        out = self.output_layer(out)
        return out

### 訓練廻圈

In [11]:
net = OurNet(input_size, hidden_size, num_classes)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()  
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)  

# Train the Model
for epoch in range(num_epochs):
    total_batch = int(len(newsgroups_train.data)/batch_size)
    # Loop over all batches
    for i in range(total_batch):
        batch_x, batch_y = get_batch(newsgroups_train, i, batch_size)
        #articles = Variable(torch.FloatTensor(batch_x))
        #labels = Variable(torch.LongTensor(batch_y))
        articles = torch.FloatTensor(batch_x)
        labels = torch.LongTensor(batch_y)
        #print("articles",articles)
        #print(batch_x, labels)
        #print("size labels",labels.size())

        # Forward + Backward + Optimize
        optimizer.zero_grad()  # zero the gradient buffer
        outputs = net(articles)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        if (i+1) % 4 == 0:
            print ('Epoch [%d/%d], Step [%d/%d], Loss: %.4f'
                   %(epoch+1, num_epochs, i+1, len(newsgroups_train.data)//batch_size, loss.data))

Epoch [1/2], Step [4/54], Loss: 2.5895
Epoch [1/2], Step [8/54], Loss: 2.3218
Epoch [1/2], Step [12/54], Loss: 1.9189
Epoch [1/2], Step [16/54], Loss: 1.5746
Epoch [1/2], Step [20/54], Loss: 1.1540
Epoch [1/2], Step [24/54], Loss: 1.2092
Epoch [1/2], Step [28/54], Loss: 0.9164
Epoch [1/2], Step [32/54], Loss: 0.7724
Epoch [1/2], Step [36/54], Loss: 0.7211
Epoch [1/2], Step [40/54], Loss: 0.6619
Epoch [1/2], Step [44/54], Loss: 0.5575
Epoch [1/2], Step [48/54], Loss: 0.7289
Epoch [1/2], Step [52/54], Loss: 0.5483
Epoch [2/2], Step [4/54], Loss: 0.3211
Epoch [2/2], Step [8/54], Loss: 0.2343
Epoch [2/2], Step [12/54], Loss: 0.1787
Epoch [2/2], Step [16/54], Loss: 0.5945
Epoch [2/2], Step [20/54], Loss: 0.2380
Epoch [2/2], Step [24/54], Loss: 0.1309
Epoch [2/2], Step [28/54], Loss: 0.3651
Epoch [2/2], Step [32/54], Loss: 0.1366
Epoch [2/2], Step [36/54], Loss: 0.0562
Epoch [2/2], Step [40/54], Loss: 0.1330
Epoch [2/2], Step [44/54], Loss: 0.1916
Epoch [2/2], Step [48/54], Loss: 0.0729
Epoc

In [12]:
# Test the Model
correct = 0
total = 0
total_test_data = len(newsgroups_test.target)

batch_x_test, batch_y_test = get_batch(newsgroups_test,0,total_test_data)
#articles = Variable(torch.FloatTensor(batch_x_test))
articles = torch.FloatTensor(batch_x_test)
labels = torch.LongTensor(batch_y_test)
net.eval()
with torch.no_grad():    
    outputs = net(articles)

print('output size: ')
print(outputs.data.size())
_, predicted = torch.max(outputs.data, 1)
print('correctly predicted: ')
print((predicted == labels).sum())

total += labels.size(0)
correct += (predicted == labels).sum()

accuracy = 100 * correct / total
print('Accuracy of the network on the %d test articles: %d %%' % (total, accuracy))

output size: 
torch.Size([5487, 14])
correctly predicted: 
tensor(4209)
Accuracy of the network on the 5487 test articles: 76 %
