# Text mining
## HW 2 MaxEnt on Reuters

We are going to train a Multiclass Maximum Entropy (Softmax Regression) to predict the origin of a document coming from the 20newsgroup dataset.

This exercise is similar to 01-LinearRegression. The difference is that you'll have to implement the algorithm yourself.

For this puprose we'll use PyTorch, and sklearn. Your job is to fill in the missing code into the cells below.

You will find the steps you need to perform in the **Task** section in each cell.

In [1]:
import torch
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

In [2]:
print('Loading data...')

# Passing none as we want to train over all the data.
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=None)

newsgroups_test = fetch_20newsgroups(subset='test',
                                      categories=None)

Loading data...


In [3]:
BATCH_SIZE = 32
MAX_EPOCHS = 20
# Lambda
REG_PARAM = 0.01
# Alpha
LEARNING_RATE = 1e-02
# Number of features
MAX_WORDS = 10000
# Priting error information after display_step epochs
DISPLAY_STEP = 1
NUM_CLASSES = np.max(newsgroups_train.target) + 1

In [4]:
def to_categorical(y, num_classes):
    """ 1-hot encodes a tensor """
    return np.eye(num_classes, dtype='uint8')[y]

In [5]:
print(NUM_CLASSES, 'classes')

print('Vectorizing sequence data...')

tokenizer = TfidfVectorizer(max_features=MAX_WORDS)

x_train = tokenizer.fit_transform(newsgroups_train.data).toarray()
x_test = tokenizer.transform(newsgroups_test.data).toarray()
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Convert class vector to binary class matrix '
      '(for use with categorical_crossentropy)')

y_train = to_categorical(newsgroups_train.target, num_classes=NUM_CLASSES)
y_test = to_categorical(newsgroups_test.target, num_classes=NUM_CLASSES)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

20 classes
Vectorizing sequence data...
x_train shape: (11314, 10000)
x_test shape: (7532, 10000)
Convert class vector to binary class matrix (for use with categorical_crossentropy)
y_train shape: (11314, 20)
y_test shape: (7532, 20)


In [6]:
print(type(x_train))
print(y_train.dtype)

<class 'numpy.ndarray'>
uint8


In [7]:
from torch.utils.data import Dataset, DataLoader

class TwentyNewsGroupsDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

        assert(len(self.x) == len(self.y))

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

train_dataset = TwentyNewsGroupsDataset(x_train, y_train)
test_dataset = TwentyNewsGroupsDataset(x_test, y_test)

# Model initilization

Here comes the most interesting part of the model. You'll have to implement Softmax Regression with SGD. The formulas are presented below for you. You don't have to derive them, you can use them as they are, or you can use PyTorch's gradient function to obtain them.

## Softmax regression formulas

*Keep in mind that those are the final formulas, the derivation of gradients has been omitted, but in order to derive them you must use the chain and quotient rules.*

Here is the basic linear (activation) function:

$ z_i = x^T w_i + b_i$

This is the softmax (prediction) for class i:

$\hat{y}_i = \sigma(\textbf{z})_i = \frac{\exp(z_i)}{\sum_{k=1}^{K}{\exp(z_k)}}$

Derivative of the softmax wrt the activation, here $1(i = j)$ is the identity function, which is $1$ if $i = j$ and $0$ otherwise:

${\frac{\partial}{\partial w_j} \sigma(\textbf{z})_i = \sigma(\textbf{z})_i}(1(i = j) - {\sigma(\textbf{z})_j})\ x$

Negative cross-entropy, note that this is a dot product of $y$ and $\hat{y}$, which are K dimentional vectors (y is K dimentional vector with 1 in correct class and 0 everywhere else, so it can be omitted for other classes).

$\mathcal{L_s} = - \frac{1}{N}\sum_{i = 1}^N y_{i} \log(\hat{y}_{i}) $

Gradient of the loss with respect to the weights (i is the correct class):

$ \frac{\partial }{\partial w_i} \mathcal{L_s} = \hat{y_i}\ x $

Weights update making a step in the direction opposite to the gradient, since we are minimizing the loss and the gradient is always pointing in the direction of the maximim.
Alpha is the learning rate.

$ w_i = w_i - \alpha \frac{\partial }{\partial w_i} \mathcal{L_s} $

Accuracy:

$ Acc(y, \hat{y}) = \frac{1}{N}\sum_{i = 1}^N 1(arg\,max_{j \in K}\ \hat{y}_{i,j} = y_i) $

## Dimentions of components
$ N $ - number of examples

$ M $ - number of features

$ K $ - number of classes

Features input $ x \in {\rm I\!R}^{N \times M} $

Expected class $ y \in {\rm I\!R}^{N \times M} $

Weight matrix $ W \in {\rm I\!R}^{M  \times K} $

Per class bias $ b \in {\rm I\!R}^{K} $

## Tasks
1. Implement softmax regression using the formulas above;
2. Implement accuracy metric, but use cross entropy for optimization. (In the `evaluation` function)

## Tips
Checking the PyTorch's documentation, and the lecture "Introduction to PyTorch". Also you can use all the built-in to compute the gradients!

Also in the loss function you can use the [LogSoftmax](https://pytorch.org/docs/master/nn.html?highlight=log_softmax#torch.nn.LogSoftmax) for numerical stability.

Check the [sub](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.sub_) function of a Tensor, you will most probably need it.


In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F
class LogisticRegression(nn.Module):

  def __init__(self, features_size, num_classes):
    super(LogisticRegression, self).__init__()

    self.w = nn.Parameter(torch.randn(features_size, num_classes, dtype=torch.float64))
    self.b = nn.Parameter(torch.randn(num_classes, dtype=torch.float64))

  def forward(self, x):
    logits = torch.matmul(x, self.w) + self.b
    y_hat = F.softmax(logits, 1)

    return (logits, y_hat)

model = LogisticRegression(MAX_WORDS, NUM_CLASSES)

In [9]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=REG_PARAM)

def update_weights(model, x, y):
  y = y.type(torch.float64)
  logits, y_hat = model(x)
  loss = F.cross_entropy(logits, y)

  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

  return loss.detach().cpu()

In [10]:
def evaluate(model, dataset):
  #fill the evaluation function, you can change parameters if you like
  model.eval()

  dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)
  correct = 0
  total = 0

  for x, y in dataloader:
    _, y_pred = model(x)
    _, predicted_index = torch.max(y_pred, dim=1)
    _, y_true_index = torch.max(y, dim=1)

    total += y.size(0)
    correct += (predicted_index == y_true_index).sum().item()

  return correct / total

# Model training

Train your model with calling the `update_weights` function, and cost computation methods. You don't have to modify this section.

## Sanity check

Your loss should be similar to:

Epoch: 0001 cost=4.237748146  
Epoch: 0002 cost=2.006925821  
Epoch: 0003 cost=0.838360906  
Epoch: 0004 cost=0.526503205  
Epoch: 0005 cost=0.406159312  
Epoch: 0006 cost=0.338935345  
Epoch: 0007 cost=0.288057804  
Epoch: 0008 cost=0.245860726  
Epoch: 0009 cost=0.208140314  
Epoch: 0010 cost=0.170706153  
Epoch: 0011 cost=0.141715422  
Epoch: 0012 cost=0.117129274  
Epoch: 0013 cost=0.094932191  
Epoch: 0014 cost=0.075968713  
Epoch: 0015 cost=0.060179509  
Epoch: 0016 cost=0.049887933  
Epoch: 0017 cost=0.039890103  
Epoch: 0018 cost=0.033839807  
Epoch: 0019 cost=0.027970247  
Epoch: 0020 cost=0.024634583  
Optimization Finished!  

In [11]:
# Training cycle
def train(model, dataset):
  model.train()

  for epoch in range(1, MAX_EPOCHS+1):
    avg_cost = []
    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE,
                        shuffle=True, drop_last=False)
    for x, y in (dataloader):
      cost = update_weights(model, x, y)

      avg_cost.append(cost)
    # Display logs per each DISPLAY_STEP
    if (epoch) % DISPLAY_STEP == 0:
      print ("Epoch: {:04d} cost={:.9f}".format(epoch, np.mean(avg_cost)))
train(model, train_dataset)
print ("Optimization Finished!")

Epoch: 0001 cost=2.077966992
Epoch: 0002 cost=0.713722233
Epoch: 0003 cost=0.381527853
Epoch: 0004 cost=0.239921180
Epoch: 0005 cost=0.164923919
Epoch: 0006 cost=0.120508174
Epoch: 0007 cost=0.091987143
Epoch: 0008 cost=0.072579703
Epoch: 0009 cost=0.058907330
Epoch: 0010 cost=0.048655904
Epoch: 0011 cost=0.040869345
Epoch: 0012 cost=0.034714190
Epoch: 0013 cost=0.029770270
Epoch: 0014 cost=0.025771626
Epoch: 0015 cost=0.022453961
Epoch: 0016 cost=0.019547111
Epoch: 0017 cost=0.017104139
Epoch: 0018 cost=0.015096708
Epoch: 0019 cost=0.013260410
Epoch: 0020 cost=0.011861033
Optimization Finished!


In [12]:
print("Training datset", evaluate(model, train_dataset))
print("Test datset", evaluate(model, test_dataset))

Training datset 0.9994696835778681
Test datset 0.8182421667551779
