# Subjectivity classification with CNNs

In this notebook we implement the approched described in this [paper](https://arxiv.org/pdf/1408.5882.pdf) for classifiying sentences using Convolutional Neural Networks. In particular, we will classify sentences into "subjective" or "objective". 

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import Dataset, DataLoader

In [2]:
from sklearn.model_selection import train_test_split

## Subjectivity Dataset

The subjectivity dataset has 5000 subjective and 5000 objective processed sentences. To get the data:
```
wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
```

In [3]:
from pathlib import Path
PATH = Path("data")
list(PATH.iterdir())

[PosixPath('data/glove.6B.300d.txt'),
 PosixPath('data/glove.6B.100d.txt'),
 PosixPath('data/names_train.csv'),
 PosixPath('data/names_test.csv'),
 PosixPath('data/glove.6B.50d.txt'),
 PosixPath('data/plot.tok.gt9.5000'),
 PosixPath('data/subjdata.README.1.0'),
 PosixPath('data/pmlb'),
 PosixPath('data/quote.tok.gt9.5000'),
 PosixPath('data/glove.6B.200d.txt'),
 PosixPath('data/glove.6B.zip')]

From the readme file:
- quote.tok.gt9.5000 contains 5000 subjective sentences (or snippets)
- plot.tok.gt9.5000 contains 5000 objective sentences

In [4]:
! head data/plot.tok.gt9.5000

the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . 
emerging from the human psyche and showing characteristics of abstract expressionism , minimalism and russian constructivism , graffiti removal has secured its place in the history of modern art while being created by artists who are unconscious of their artistic achievements . 
spurning her mother's insistence that she get on with her life , mary is thrown out of the house , rejected by joe , and expelled from school as she grows larger with child . 
amitabh can't believe the board of directors and his mind is filled with revenge and what better revenge than robbing the bank himself , ironic as it may sound . 
she , among others excentricities , talks to a small rock , gertrude , like if she was alive . 
this gives the girls a fair chance of pulling the wool over their eyes using their sexiness to poach any last vestige of common sense the dons might have had . 
styled after vh1's "

## String cleaning functions

In [5]:
import numpy as np
from collections import defaultdict
import re

In [6]:
# this is from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Every dataset is lower cased except for TREC
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)     
    string = re.sub(r"\'s", " \'s", string) 
    string = re.sub(r"\'ve", " \'ve", string) 
    string = re.sub(r"n\'t", " n\'t", string) 
    string = re.sub(r"\'re", " \'re", string) 
    string = re.sub(r"\'d", " \'d", string) 
    string = re.sub(r"\'ll", " \'ll", string) 
    string = re.sub(r",", " , ", string) 
    string = re.sub(r"!", " ! ", string) 
    string = re.sub(r"\(", " \( ", string) 
    string = re.sub(r"\)", " \) ", string) 
    string = re.sub(r"\?", " \? ", string) 
    string = re.sub(r"\s{2,}", " ", string)    
    return string.strip().lower()

In [7]:
def read_file(path):
    """ Read file returns a shuttled list.
    """
    with open(path, encoding = "ISO-8859-1") as f:
        content = np.array(f.readlines())
    return content

In [8]:
def get_vocab(list_of_content):
    """Computes Dict of counts of words.
    
    Computes the number of times a word is on a document.
    """
    vocab = defaultdict(float)
    for content in list_of_content:
        for line in content:
            line = clean_str(line.strip())
            words = set(line.split())
            for word in words:
                vocab[word] += 1
    return vocab       

## Split train and test

In [9]:
sub_content = read_file(PATH/"quote.tok.gt9.5000")
obj_content = read_file(PATH/"plot.tok.gt9.5000")
sub_content = np.array([clean_str(line.strip()) for line in sub_content])
obj_content = np.array([clean_str(line.strip()) for line in obj_content])
sub_y = np.zeros(len(sub_content))
obj_y = np.ones(len(obj_content))
X = np.append(sub_content, obj_content)
y = np.append(sub_y, obj_y)

In [10]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
X_train[:5], y_train[:5]

(array(['will god let her fall or give her a new path \\?',
        "the director 's twitchy sketchbook style and adroit perspective shifts grow wearisome amid leaden pacing and indifferent craftsmanship \\( most notably wretched sound design \\)",
        "welles groupie scholar peter bogdanovich took a long time to do it , but he 's finally provided his own broadside at publishing giant william randolph hearst",
        'based on the 1997 john king novel of the same name with a rather odd synopsis a first novel about a seasoned chelsea football club hooligan who represents a disaffected society operating by brutal rules',
        'yet , beneath an upbeat appearance , she is struggling desperately with the emotional and physical scars left by the attack'],
       dtype='<U679'), array([1., 0., 0., 1., 1.]))

In [12]:
# getting vocab from training sets
word_count = get_vocab([X_train])

In [13]:
#word_count

In [14]:
len(word_count.keys())

19310

In [15]:
# let's delete words that are very infrequent
for word in list(word_count):
    if word_count[word] < 5:
        del word_count[word]
len(word_count.keys())

4203

In [16]:
## Finally we need an index for each word in the vocab
vocab2index = {"<PAD>":0, "UNK":1} # init with padding and unknown
words = ["<PAD>", "UNK"]
for word in word_count:
    vocab2index[word] = len(words)
    words.append(word)

## Embedding Layer

In [17]:
# an Embedding module containing 10 (words) tensors of size 3
embed = nn.Embedding(10, 3)
a = torch.LongTensor([[1,2,4,5,1]])
embed(a)

tensor([[[-0.6895,  0.5455,  0.9540],
         [-0.7710,  0.6922, -1.5952],
         [-0.3402, -1.1721, -1.0863],
         [ 0.8806, -0.4155,  1.0337],
         [-0.6895,  0.5455,  0.9540]]], grad_fn=<EmbeddingBackward>)

In [18]:
## here is the randomly initialized embeddings
embed.weight.data

tensor([[-0.3417,  0.4184,  0.5396],
        [-0.6895,  0.5455,  0.9540],
        [-0.7710,  0.6922, -1.5952],
        [-1.0999, -0.3351,  0.2493],
        [-0.3402, -1.1721, -1.0863],
        [ 0.8806, -0.4155,  1.0337],
        [-0.1286,  1.2243,  0.7281],
        [-0.2313,  1.6461, -1.4697],
        [-0.6915,  1.0947, -0.2442],
        [-0.2410,  0.2314, -0.6789]])

Question: How many parameters do we have in this embedding matrix?

## Encoding training and validation sets

We will be using 1D Convolutional neural networks as our model. CNNs assume a fixed input size so we need to assume a fixed size and truncate or pad the sentences as needed. Let's find a good value to set our sequence length to.

In [19]:
x_len = np.array([len(x.split()) for x in X_train])

In [20]:
np.percentile(x_len, 95) # let set the max sequence len to N=40

42.0

In [21]:
X_train[0]

'will god let her fall or give her a new path \\?'

In [22]:
# returns the index of the word or the index of "UNK" otherwise
vocab2index.get("will", vocab2index["UNK"])

2

In [23]:
np.array([vocab2index.get(w, vocab2index["UNK"]) for w in X_train[0].split()])

array([ 2, 11, 10,  4, 12,  5,  6,  4,  7,  3,  8,  9])

In [24]:
def encode_sentence(s, N=40):
    enc = np.zeros(N, dtype=np.int32)
    enc1 = np.array([vocab2index.get(w, vocab2index["UNK"]) for w in s.split()])
    l = min(N, len(enc1))
    enc[:l] = enc1[:l]
    return enc

In [25]:
class SubjectivityDataset(Dataset):
    def __init__(self, X, y):
        self.x = X
        self.y = y
    
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        x = self.x[idx]
        x = encode_sentence(x)
        return x, self.y[idx]
    
train_ds = SubjectivityDataset(X_train, y_train)
valid_ds = SubjectivityDataset(X_val, y_val)

In [26]:
valid_ds[0]

(array([   1,  498, 2405,   63,   94,   61, 3622,   19, 1331,  498, 2151,
         315,   94,   61,    1,    1,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0], dtype=int32), 1.0)

In [27]:
train_dl = DataLoader(train_ds, batch_size=500, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=500)

## Playing and debugging CNN layers

In [28]:
tr_dl = DataLoader(train_ds, batch_size=3, shuffle=True)

In [41]:
V = len(words)
D = 7
N = 40

In [42]:
emb = nn.Embedding(V, D)

In [43]:
x, y = next(iter(tr_dl))
x.shape, y

(torch.Size([3, 40]), tensor([0., 1., 1.]))

In [44]:
x

tensor([[ 151,   69,  180,    1,  172,   26,  797,    7,    1,   92,    7,  251,
          273,  122, 1270,  587,    7,  158,   63, 1526,    1,   55,  344,    7,
            1,   37,  153, 3492, 3690,  391,    1, 1896,    7,  273,  220,   46,
          395,    1,    0,    0],
        [ 151, 2283,  981, 1547,   59,    1,    1,   19,    1,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [  77,    7, 3784,   19,  148,   98,  790,   37, 1538,    1,    1, 3783,
          391,    1,  363,  619,   19,    1,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]], dtype=torch.int32)

In [45]:
x1 = emb(x.long())

In [46]:
x1.size()

torch.Size([3, 40, 7])

In [47]:
x1 = x1.transpose(1,2)  # needs to convert x to (batch, embedding_dim, sentence_len)
x1.size()

torch.Size([3, 7, 40])

In [48]:
conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)

In [49]:
x3 = conv_3(x1)

In [50]:
x3.size()

torch.Size([3, 100, 38])

In [51]:
conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)

In [52]:
x4 = conv_4(x1)
x5 = conv_5(x1)
print(x4.size(), x5.size())

torch.Size([3, 100, 37]) torch.Size([3, 100, 36])


Note that the convolution all apply to the same `x1`. How do we combine now the results of the convolutions? 

In [53]:
# 100 3-gram detectors
x3 = nn.ReLU()(x3)
x3 = nn.MaxPool1d(kernel_size = 38)(x3)
x3.size()

torch.Size([3, 100, 1])

In [54]:
# 100 4-gram detectors
x4 = nn.ReLU()(x4)
x4 = nn.MaxPool1d(kernel_size = 37)(x4)
x4.size()

torch.Size([3, 100, 1])

In [55]:
# 100 5-gram detectors
x5 = nn.ReLU()(x5)
x5 = nn.MaxPool1d(kernel_size = 36)(x5)
x5.size()

torch.Size([3, 100, 1])

In [56]:
# concatenate x3, x4, x5
out = torch.cat([x3, x4, x5], 2)
out.size()

torch.Size([3, 100, 3])

In [57]:
out = out.view(out.size(0), -1)
out.size()

torch.Size([3, 300])

After this we have a fully connected network. Let's write a network that implements this.

## 1D CNN model for sentence classification

Notation:
* V -- vocabulary size
* D -- embedding size
* N -- MAX Sentence length

In [58]:
class SentenceCNN(nn.Module):
    
    def __init__(self, V, D):
        super(SentenceCNN, self).__init__()
        self.embedding = nn.Embedding(V, D, padding_idx=0)

        self.conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)
        self.conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
        self.conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)
        
        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(300, 1)
        
    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(1,2)
        x3 = F.relu(self.conv_3(x))
        x4 = F.relu(self.conv_4(x))
        x5 = F.relu(self.conv_5(x))
        x3 = nn.MaxPool1d(kernel_size = 38)(x3)
        x4 = nn.MaxPool1d(kernel_size = 37)(x4)
        x5 = nn.MaxPool1d(kernel_size = 36)(x5)
        out = torch.cat([x3, x4, x5], 2)
        out = out.view(out.size(0), -1)
        out = self.dropout(out)
        return self.fc(out)   

In [59]:
V = len(words)
D = 50
N = 40
model = SentenceCNN(V, D)

In [61]:
x, y = next(iter(train_dl))

In [62]:
y_hat = model(x.long())
y_hat.size()

torch.Size([500, 1])

In [64]:
F.binary_cross_entropy_with_logits(y_hat, y.unsqueeze(1))

tensor(0.7309, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)

## Training

In [69]:
def valid_metrics(model):
    model.eval()
    total = 0
    sum_loss = 0
    correct = 0
    for x, y in valid_dl:
        x = x.long()  #.cuda()
        y = y.float().unsqueeze(1)
        batch = y.shape[0]
        out = model(x)
        loss = F.binary_cross_entropy_with_logits(out, y)
        sum_loss += batch*(loss.item())
        total += batch
        pred = (out > 0).float()
        correct += (pred == y).float().sum().item()
    val_loss = sum_loss/total
    val_acc = correct/total
    return val_loss, val_acc

In [70]:
def train_epocs(model, optimizer, epochs=10):
    for i in range(epochs):
        model.train()
        total_loss = 0
        total = 0
        for x, y in train_dl:
            x = x.long()
            y = y.float().unsqueeze(1)
            out = model(x)
            loss = F.binary_cross_entropy_with_logits(out, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += x.size(0)*loss.item()
            total += x.size(0)
        train_loss = total_loss/total
        val_loss, val_accuracy = valid_metrics(model)
        
        print("train_loss %.3f val_loss %.3f val_accuracy %.3f" % (
            train_loss, val_loss, val_accuracy))

In [71]:
model = SentenceCNN(V, D)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
train_epocs(model, optimizer, epochs=5)

train_loss 0.892 val_loss 0.542 val_accuracy 0.719
train_loss 0.468 val_loss 0.454 val_accuracy 0.790
train_loss 0.322 val_loss 0.344 val_accuracy 0.863
train_loss 0.221 val_loss 0.320 val_accuracy 0.872
train_loss 0.149 val_loss 0.324 val_accuracy 0.881


In [72]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
train_epocs(model, optimizer, epochs=10)

train_loss 0.107 val_loss 0.336 val_accuracy 0.876
train_loss 0.092 val_loss 0.346 val_accuracy 0.878
train_loss 0.084 val_loss 0.358 val_accuracy 0.878
train_loss 0.074 val_loss 0.367 val_accuracy 0.878
train_loss 0.064 val_loss 0.377 val_accuracy 0.878
train_loss 0.059 val_loss 0.386 val_accuracy 0.880
train_loss 0.053 val_loss 0.396 val_accuracy 0.877
train_loss 0.046 val_loss 0.406 val_accuracy 0.877
train_loss 0.041 val_loss 0.418 val_accuracy 0.881
train_loss 0.037 val_loss 0.424 val_accuracy 0.877


## References

The CNN is adapted from here https://github.com/junwang4/CNN-sentence-classification-pytorch-2017/blob/master/cnn_pytorch.py.
Code for the original paper can be found here https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py.