In [None]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [None]:
import torch
import torch.nn as nn

### Task 1

Write a function **word_2_vec(L)**. It takes a list L of strings (you can assume that the strings contain only lower-case letters, no number or special characters), and translate each string into a matrx (2d array) of nx26 where n is the number of letters in the string. (Different strings may have different length and thus different n.) Each row of the matrix is a one-hot vector (length 26) encoding the corresponding letter. For example, the encoding vector for the letter 'a' should be a vector of all zeros expect that the first entry should be one. The encoding vector for 'b' should have the second entry be one. The function should return a list of matrices, each corresponds to a string in L.   

[Use Python and Numpy only. No other packages. If you don't want to build by hand a letter-to-number dictionary, You may use Python function **ord(l)** to obtain the ascii value of the letter l.

In [None]:
def word_2_vec(L):
    r = []
    for w in L:
        n = len(w)
        t = np.zeros((n, 26))
        for i, l in enumerate(w):
            j = ord(l)-97
            t[i, j] = 1
        r.append(t)
    return r

In [None]:
for m in word_2_vec(['abc', 'to', 'vec']):
    print(m)

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]]


### IMDb dataset


In [None]:
from keras.datasets import imdb

max_features = 20000  # use only this number of words (most common words) and ignore others
maxlen = 80  # make all review the same length (cutting longer ones and padding shorter ones)

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

25000 train sequences
25000 test sequences


In [None]:
min_id = min([min(x) for x in x_train])
max_id = max([max(x) for x in x_train])
print(min_id, max_id)

1 19999


### Task 2

Build a model to classify text. The inputs are imdb reviews and your model should read a review and predict whether the review is positive (1) or negative (0). Note that in the data, each review is a list of numbers (not words).



#### Task 2.1

Preprocess the data to make the sequence the same length (as defined by maxlen above).
After preprocessing, turn x_train and x_test into tensor datasets so that you can create PyTorch dataloaders from these tensors.

In [None]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, TensorDataset

x_train = pad_sequence([torch.tensor(seq[:maxlen]) for seq in x_train], batch_first=True)
x_test = pad_sequence([torch.tensor(seq[:maxlen]) for seq in x_test], batch_first=True)

train_data = TensorDataset(x_train, torch.from_numpy(y_train))
test_data = TensorDataset(x_test, torch.from_numpy(y_test))

bsz = 128
train_loader = DataLoader(train_data, batch_size=bsz, shuffle=True)
test_loader = DataLoader(test_data, batch_size=bsz, shuffle=False)

#### Task 2.2

Build a model (subclass of nn.Module) with two layers of LSTM (each has 128 neurons/cells) for the text classification. Note that:
 - We don't use the word number as input directly. You need to use an embedding layer (nn.Embedding) to translate the word (number) into an vector. (You can set the length of the vector to be 128.)  
 - After going through the LSTM layers, give the LSTM output at the last position to a linear output layer to compute the probability of positive/negative.



In [None]:
class SeqClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(max_features, embedding_dim=128)
        self.lstm = nn.LSTM(input_size=128, hidden_size=128,
                            num_layers=2, batch_first=True)
        self.out = nn.Linear(128, 2)

    def forward(self, x):
        x = self.embedding(x)
        _, (o, _) = self.lstm(x)
        return self.out(o[-1])

#### Task 2.3

After building your model, train it (20 epoches) on the train data and test it on the test data. Print out the accuracy on the test data.  

In [None]:
from torch.optim import Adam
from sklearn.metrics import accuracy_score

device='cuda'
model = SeqClassifier().to(device)
loss_func = nn.CrossEntropyLoss()
opt = Adam(model.parameters(), lr=0.001)

for epoch in range(20):
    model.train()
    for inputs, labels in train_loader:
        opt.zero_grad()
        outputs = model(inputs.to(device))
        loss = loss_func(outputs, labels.to(device))
        loss.backward()
        opt.step()

    model.eval()
    all_predictions = []
    with torch.no_grad():
        for inputs, _ in test_loader:
            outputs = model(inputs.to(device))[:, 1]
            predictions = (outputs >= 0.5).int().cpu().numpy().tolist()
            all_predictions.extend(predictions)
    accuracy = accuracy_score(y_test, all_predictions)
    print(f'Epoch {epoch+1}, Accuracy: {accuracy:.3f}')