In [1]:
%pylab inline

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
import torch
import torch.nn as nn

### Task 1

Write a function **word_2_vec(L)**. It takes a list L of strings (you can assume that the strings contain only lower-case letters, no number or special characters), and translate each string into a matrx (2d array) of nx26 where n is the number of letters in the string. (Different strings may have different length and thus different n.) Each row of the matrix is a one-hot vector (length 26) encoding the corresponding letter. For example, the encoding vector for the letter 'a' should be a vector of all zeros expect that the first entry should be one. The encoding vector for 'b' should have the second entry be one. The function should return a list of matrices, each corresponds to a string in L.   

[Use Python and Numpy only. No other packages. If you don't want to build by hand a letter-to-number dictionary, You may use Python function **ord(l)** to obtain the ascii value of the letter l.

In [11]:
def char_2_one_hot(c):
    index = ord(c) - ord('a')
    return [1 if i == index else 0 for i in range(26)]

def word_2_vec(L):
    return [[char_2_one_hot(c) for c in word] for word in L]

In [12]:
for m in word_2_vec(['abc', 'to', 'vec']):
    print(m)

[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


### IMDb dataset


In [13]:
from keras.datasets import imdb

max_features = 20000  # use only this number of words (most common words) and ignore others
maxlen = 80  # make all review the same length (cutting longer ones and padding shorter ones)

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

25000 train sequences
25000 test sequences


In [14]:
min_id = min([min(x) for x in x_train])
max_id = max([max(x) for x in x_train])
print(min_id, max_id)

1 19999


### Task 2

Build a model to classify text. The inputs are imdb reviews and your model should read a review and predict whether the review is positive (1) or negative (0). Note that in the data, each review is a list of numbers (not words).



#### Task 2.1

Preprocess the data to make the sequence the same length (as defined by maxlen above).
After preprocessing, turn x_train and x_test into tensors so that you can create PyTorch dataloaders from these tensors.

In [15]:

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, TensorDataset

x_train_padded = pad_sequence(
    [torch.tensor(seq) for seq in x_train], batch_first=True, padding_value=0).numpy()
x_test_padded = pad_sequence(
    [torch.tensor(seq) for seq in x_test], batch_first=True, padding_value=0).numpy()

x_train_padded = torch.tensor(x_train_padded[:, :maxlen])
x_test_padded = torch.tensor(x_test_padded[:, :maxlen])

y_train_tensor = torch.tensor(y_train)
y_test_tensor = torch.tensor(y_test)

train_data = TensorDataset(x_train_padded, y_train_tensor)
test_data = TensorDataset(x_test_padded, y_test_tensor)

train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=False)

#### Task 2.2

Build a model (subclass of nn.Module) with two layers of LSTM (each has 128 neurons/cells) for the text classification. Note that:
 - We don't use the word number as input directly. You need to use an embedding layer (nn.Embedding) to translate the word (number) into an vector. (You can set the length of the vector to be 128.)  
 - After going through the LSTM layers, give the LSTM output at the last position to a linear output layer to compute the probability of positive/negative.



In [16]:
class MovieReviewClassifier(nn.Module):
    def __init__(self):
        super(MovieReviewClassifier, self).__init__()
        self.embedding = nn.Embedding(
            num_embeddings=max_features, embedding_dim=128)
        self.lstm = nn.LSTM(input_size=128, hidden_size=128,
                            num_layers=2, batch_first=True)
        self.out = nn.Linear(128, 1)

    def forward(self, x):
        x = self.embedding(x)
        _, (hidden, _) = self.lstm(x)
        x = hidden[-1]
        return torch.sigmoid(self.out(x))


model = MovieReviewClassifier()

#### Task 2.3

After building your model, train it (20 epoches) on the train data and test it on the test data. Print out the accuracy on the test data.  

In [17]:
from torch.optim import Adam
from sklearn.metrics import accuracy_score

criterion = nn.BCELoss()
optimizer = Adam(model.parameters(), lr=0.001)

for epoch in range(20):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs).squeeze()
        loss = criterion(outputs, labels.float())
        loss.backward()
        optimizer.step()

    model.eval()
    all_predictions = []
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs).squeeze()
            predictions = (outputs >= 0.5).long()
            all_predictions.extend(predictions.numpy())
    accuracy = accuracy_score(y_test, all_predictions)
    print(f'Epoch {epoch+1}, Accuracy: {accuracy}')

Epoch 1, Accuracy: 0.56968
Epoch 2, Accuracy: 0.7348
Epoch 3, Accuracy: 0.7688
Epoch 4, Accuracy: 0.79588
Epoch 5, Accuracy: 0.7892
Epoch 6, Accuracy: 0.7898
Epoch 7, Accuracy: 0.782
Epoch 8, Accuracy: 0.776
Epoch 9, Accuracy: 0.77452
Epoch 10, Accuracy: 0.7782
Epoch 11, Accuracy: 0.77616
Epoch 12, Accuracy: 0.76404
Epoch 13, Accuracy: 0.77208
Epoch 14, Accuracy: 0.7772
Epoch 15, Accuracy: 0.77228
Epoch 16, Accuracy: 0.76996
Epoch 17, Accuracy: 0.77196
Epoch 18, Accuracy: 0.76928
Epoch 19, Accuracy: 0.7692
Epoch 20, Accuracy: 0.76976
