# Subjectivity classification with CNNs

In this notebook we implement the approched described in this [paper](https://arxiv.org/pdf/1408.5882.pdf) for classifiying sentences using Convolutional Neural Networks. In particular, we will classify sentences into "subjective" or "objective". 

## Subjectivity Dataset

The subjectivity dataset has 5000 subjective and 5000 objective processed sentences. To get the data:
```
wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
```

In [1]:
from pathlib import Path
PATH = Path("/data2/yinterian/rotten_imdb/")
list(PATH.iterdir())

[PosixPath('/data2/yinterian/rotten_imdb/glove.6B.300d.txt'),
 PosixPath('/data2/yinterian/rotten_imdb/glove.6B.100d.txt'),
 PosixPath('/data2/yinterian/rotten_imdb/glove.6B.zip'),
 PosixPath('/data2/yinterian/rotten_imdb/glove.6B.200d.txt'),
 PosixPath('/data2/yinterian/rotten_imdb/plot.tok.gt9.5000'),
 PosixPath('/data2/yinterian/rotten_imdb/subjdata.README.1.0'),
 PosixPath('/data2/yinterian/rotten_imdb/quote.tok.gt9.5000'),
 PosixPath('/data2/yinterian/rotten_imdb/glove.6B.50d.txt')]

From the readme file:
- quote.tok.gt9.5000 contains 5000 subjective sentences (or snippets)
- plot.tok.gt9.5000 contains 5000 objective sentences

In [2]:
! head /data2/yinterian/rotten_imdb/plot.tok.gt9.5000

the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . 
emerging from the human psyche and showing characteristics of abstract expressionism , minimalism and russian constructivism , graffiti removal has secured its place in the history of modern art while being created by artists who are unconscious of their artistic achievements . 
spurning her mother's insistence that she get on with her life , mary is thrown out of the house , rejected by joe , and expelled from school as she grows larger with child . 
amitabh can't believe the board of directors and his mind is filled with revenge and what better revenge than robbing the bank himself , ironic as it may sound . 
she , among others excentricities , talks to a small rock , gertrude , like if she was alive . 
this gives the girls a fair chance of pulling the wool over their eyes using their sexiness to poach any last vestige of common sense the dons might have had . 
styled after vh1's "

## String cleaning functions

In [26]:
import numpy as np
from collections import defaultdict
import re

In [28]:
# this is from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Every dataset is lower cased except for TREC
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)     
    string = re.sub(r"\'s", " \'s", string) 
    string = re.sub(r"\'ve", " \'ve", string) 
    string = re.sub(r"n\'t", " n\'t", string) 
    string = re.sub(r"\'re", " \'re", string) 
    string = re.sub(r"\'d", " \'d", string) 
    string = re.sub(r"\'ll", " \'ll", string) 
    string = re.sub(r",", " , ", string) 
    string = re.sub(r"!", " ! ", string) 
    string = re.sub(r"\(", " \( ", string) 
    string = re.sub(r"\)", " \) ", string) 
    string = re.sub(r"\?", " \? ", string) 
    string = re.sub(r"\s{2,}", " ", string)    
    return string.strip().lower()

In [16]:
def read_file(path):
    """ Read file returns a shuttled list.
    """
    with open(path, encoding = "ISO-8859-1") as f:
        content = np.array(f.readlines())
    return content

In [41]:
def get_vocab(list_of_content):
    """Computes Dict of counts of words.
    
    Computes the number of times a word is on a document.
    """
    vocab = defaultdict(float)
    for content in list_of_content:
        for line in content:
            line = clean_str(line.strip())
            words = set(line.split())
            for word in words:
                vocab[word] += 1
    return vocab       

## Split train and test

In [66]:
sub_content = read_file(PATH/"quote.tok.gt9.5000")
obj_content = read_file(PATH/"plot.tok.gt9.5000")
sub_content = np.array([clean_str(line.strip()) for line in sub_content])
obj_content = np.array([clean_str(line.strip()) for line in obj_content])
sub_y = np.zeros(len(sub_content))
obj_y = np.ones(len(obj_content))
X = np.append(sub_content, obj_content)
y = np.append(sub_y, obj_y)

In [67]:
from sklearn.model_selection import train_test_split

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [69]:
X_train[:5], y_train[:5]

(array(['will god let her fall or give her a new path \\?',
        "the director 's twitchy sketchbook style and adroit perspective shifts grow wearisome amid leaden pacing and indifferent craftsmanship \\( most notably wretched sound design \\)",
        "welles groupie scholar peter bogdanovich took a long time to do it , but he 's finally provided his own broadside at publishing giant william randolph hearst",
        'based on the 1997 john king novel of the same name with a rather odd synopsis a first novel about a seasoned chelsea football club hooligan who represents a disaffected society operating by brutal rules',
        'yet , beneath an upbeat appearance , she is struggling desperately with the emotional and physical scars left by the attack'],
       dtype='<U679'), array([1., 0., 0., 1., 1.]))

In [70]:
# getting vocab from training sets
vocab = get_vocab([X_train])

## Embedding Layer

In [71]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

In [72]:
# an Embedding module containing 10 (words) tensors of size 3
embed = nn.Embedding(10, 3)
a = Variable(torch.LongTensor([[1,2,4,5,1]]))
embed(a)

Variable containing:
(0 ,.,.) = 
  1.4004 -1.4059 -0.4985
 -0.9177 -0.0454 -0.2885
 -0.9559  1.9323  0.9919
  0.0247  1.6792  0.9583
  1.4004 -1.4059 -0.4985
[torch.FloatTensor of size 1x5x3]

In [73]:
## here is the randomly initialized embeddings
embed.weight.data


 2.3088 -0.0738  2.5989
 1.4004 -1.4059 -0.4985
-0.9177 -0.0454 -0.2885
 1.9063 -0.8565 -0.1296
-0.9559  1.9323  0.9919
 0.0247  1.6792  0.9583
 1.7486  0.5193  0.2437
-1.4288 -0.0590 -0.0395
 0.0327 -1.8944 -0.8037
-1.3820 -1.6717  0.6392
[torch.FloatTensor of size 10x3]

### Initializing embedding layer with Glove embeddings

To get glove pre-trained embeddings:
    `wget http://nlp.stanford.edu/data/glove.6B.zip`

In this section I am keeping the whole Glove embeddings. You can decide to keep just the words on your training set.

In [74]:
! head -2 /data2/yinterian/rotten_imdb/glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392


We would like to initialize the embeddings from our model with the pre-trained Glove embeddings. After initializing we should "freeze" the embeddings at least initially. The rationale is that we first want the network to learn weights for the other parameters that were randomly initialize. After that phase we could finetune the embeddings to our task. 

`embed.weight.requires_grad = False` freezes the embedding parameters.

The following code initializes the embedding. Here `V` is the vocabulary size and `D` is the embedding size. `pretrained_weight` is a numpy matrix of shape `(V, D)`.

In [75]:
def loadGloveModel(gloveFile="/data2/yinterian/rotten_imdb/glove.6B.300d.txt"):
    """ Loads word vectors into a dictionary."""
    f = open(gloveFile,'r')
    word_vecs = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        word_vecs[word] = np.array([float(val) for val in splitLine[1:]])
    return word_vecs

In [76]:
word_vecs = loadGloveModel()

In [77]:
print(len(word_vecs.keys()), len(vocab.keys()))

400000 19310


In [85]:
# from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
def add_unknown_words(word_vecs, vocab, min_df=1, D=300):
    """
    For words that occur in at least min_df documents, create a separate word vector.    
    0.25 is chosen so the unknown vectors have (approximately) same variance 
    as pre-trained ones
    """
    for word in vocab:
        if word not in word_vecs and vocab[word] >= min_df:
            word_vecs[word] = np.random.uniform(-0.25,0.25,D)
    # here for rare words we will use UNK
    word_vecs["UNK"] = np.random.uniform(-0.25,0.25,D)

In [86]:
def create_embedding_matrix(word_vecs, D=300):
    """Creates embedding matrix from word vectors. """
    V = len(word_vecs.keys())
    vocab2index = {}
    vocab = []
    W = np.zeros((V+1, D), dtype="float32")
    W[0] = np.zeros(D, dtype='float32')
    i = 1
    for word in word_vecs:
        W[i] = word_vecs[word]
        vocab2index[word] = i
        vocab.append(word)
        i += 1
    return W, np.array(vocab), vocab2index

In [87]:
# adds a few extra embeddings
add_unknown_words(word_vecs, vocab, min_df=10, D=300)

In [88]:
print(len(word_vecs.keys()))

400006


In [89]:
pretrained_weight, vocab, vocab2index = create_embedding_matrix(word_vecs)

In [90]:
len(pretrained_weight) # note that index 0 is for padding

400007

In [91]:
D = 300
V = len(pretrained_weight)
emb = nn.Embedding(V, D)
emb.weight.data.copy_(torch.from_numpy(pretrained_weight))


 0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
 0.0466  0.2132 -0.0074  ...   0.0091 -0.2099  0.0539
-0.2554 -0.2572  0.1317  ...  -0.2329 -0.1223  0.3550
          ...             ⋱             ...          
 0.1991 -0.2367  0.1619  ...   0.1165  0.0549  0.2112
-0.1448 -0.2493 -0.1492  ...   0.1099  0.0447  0.0731
 0.2005 -0.0730 -0.0157  ...  -0.0139  0.1225 -0.2240
[torch.FloatTensor of size 400007x300]

Question: How many parameters do we have in this embedding matrix?

## Encoding training and validation sets

We will be using 1D Convolutional neural networks as our model. CNNs assume a fixed input size so we need to assume a fixed size and truncate or pad the sentences as needed. Let's find a good value to set our sequence length to.

In [92]:
x_len = np.array([len(x.split()) for x in X_train])

In [96]:
np.percentile(x_len, 95) # let set the max sequence len to N=40

42.0

In [97]:
X_train[0]

'will god let her fall or give her a new path \\?'

In [107]:
# returns the index of the word or the index of "UNK" otherwise
vocab2index.get("will", vocab2index["UNK"])

44

In [109]:
np.array([vocab2index.get(w, vocab2index["UNK"]) for w in X_train[0].split()])

array([    44,   1534,    887,     72,    808,     47,    456,     72,
            8,     51,   2819, 400001])

In [114]:
def encode_sentence(s, N=40):
    enc = np.zeros(N, dtype=np.int32)
    enc1 = np.array([vocab2index.get(w, vocab2index["UNK"]) for w in s.split()])
    l = min(N, len(enc1))
    enc[:l] = enc1[:l]
    return enc

In [115]:
encode_sentence(X_train[0])

array([    44,   1534,    887,     72,    808,     47,    456,     72,
            8,     51,   2819, 400001,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0],
      dtype=int32)

In [125]:
x_train = np.vstack([encode_sentence(x) for x in X_train])
x_train.shape

(8000, 40)

In [189]:
x_test = np.vstack([encode_sentence(x) for x in X_test])
x_test.shape

(2000, 40)

## Playing and debugging CNN layers

In [127]:
V = len(pretrained_weight)
D = 300
N = 40

In [128]:
emb = nn.Embedding(V, D)
emb.weight.data.copy_(torch.from_numpy(pretrained_weight))


 0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
 0.0466  0.2132 -0.0074  ...   0.0091 -0.2099  0.0539
-0.2554 -0.2572  0.1317  ...  -0.2329 -0.1223  0.3550
          ...             ⋱             ...          
 0.1991 -0.2367  0.1619  ...   0.1165  0.0549  0.2112
-0.1448 -0.2493 -0.1492  ...   0.1099  0.0447  0.0731
 0.2005 -0.0730 -0.0157  ...  -0.0139  0.1225 -0.2240
[torch.FloatTensor of size 400007x300]

In [137]:
x = x_train[:2]
x.shape

(2, 40)

In [138]:
x = torch.LongTensor(x)
x



Columns 0 to 5 
 4.4000e+01  1.5340e+03  8.8700e+02  7.2000e+01  8.0800e+02  4.7000e+01
 1.0000e+00  3.7000e+02  1.0000e+01  8.9539e+04  8.9665e+04  1.1360e+03

Columns 6 to 11 
 4.5600e+02  7.2000e+01  8.0000e+00  5.1000e+01  2.8190e+03  4.0000e+05
 6.0000e+00  6.8677e+04  5.2520e+03  8.6760e+03  2.2750e+03  1.4506e+05

Columns 12 to 17 
 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
 1.3520e+03  6.5315e+04  2.7433e+04  6.0000e+00  1.9001e+04  3.6307e+04

Columns 18 to 23 
 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
 4.0000e+05  9.7000e+01  3.7230e+03  3.2368e+04  1.5080e+03  1.2530e+03

Columns 24 to 29 
 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
 4.0000e+05  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00

Columns 30 to 35 
 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00

Columns 36 to 39 
 0.00

In [139]:
x1 = emb(Variable(x))

In [142]:
x1.size()

torch.Size([2, 40, 300])

In [143]:
x1 = x1.transpose(1,2)  # needs to convert x to (batch, embedding_dim, sentence_len)
x1.size()

torch.Size([2, 300, 40])

In [150]:
conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)

In [158]:
x3 = conv_3(x1)

In [159]:
x3.size()

torch.Size([2, 100, 38])

In [153]:
conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)

In [162]:
x4 = conv_4(x1)
x5 = conv_5(x1)
print(x4.size(), x5.size())

torch.Size([2, 100, 37]) torch.Size([2, 100, 36])


Note that the convolution all apply to the same `x1`. How do we combine now the results of the convolutions? 

In [160]:
# 100 3-gram detectors
x3 = nn.ReLU()(x3)
x3 = nn.MaxPool1d(kernel_size = 38)(x3)
x3.size()

torch.Size([2, 100, 1])

In [163]:
# 100 4-gram detectors
x4 = nn.ReLU()(x4)
x4 = nn.MaxPool1d(kernel_size = 37)(x4)
x4.size()

torch.Size([2, 100, 1])

In [164]:
# 100 5-gram detectors
x5 = nn.ReLU()(x5)
x5 = nn.MaxPool1d(kernel_size = 36)(x5)
x5.size()

torch.Size([2, 100, 1])

In [165]:
# concatenate x3, x4, x5
out = torch.cat([x3, x4, x5], 2)
out.size()

torch.Size([2, 100, 3])

In [166]:
out = out.view(out.size(0), -1)
out.size()

torch.Size([2, 300])

After this we have a fully connected network. Let's write a network that implements this.

## 1D CNN model for sentence classification

Notation:
* V -- vocabulary size
* D -- embedding size
* N -- MAX Sentence length

In [172]:
class SentenceCNN(nn.Module):
    
    def __init__(self, V, D, glove_weights):
        super(SentenceCNN, self).__init__()
        self.glove_weights = glove_weights
        self.embedding = nn.Embedding(V, D, padding_idx=0)
        self.embedding.weight.data.copy_(torch.from_numpy(self.glove_weights))
        self.embedding.weight.requires_grad = False ## freeze embeddings

        self.conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)
        self.conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
        self.conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)
        
        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(300, 1)
        
    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(1,2)
        x3 = F.relu(self.conv_3(x))
        x4 = F.relu(self.conv_4(x))
        x5 = F.relu(self.conv_5(x))
        x3 = nn.MaxPool1d(kernel_size = 38)(x3)
        x4 = nn.MaxPool1d(kernel_size = 37)(x4)
        x5 = nn.MaxPool1d(kernel_size = 36)(x5)
        out = torch.cat([x3, x4, x5], 2)
        out = out.view(out.size(0), -1)
        out = self.dropout(out)
        return self.fc(out)   

In [173]:
V = len(pretrained_weight)
D = 300
N = 40
model = SentenceCNN(V, D, glove_weights=pretrained_weight)

In [174]:
# testing the model
x = x_train[:10]
print(x.shape)
x = Variable(torch.LongTensor(x))

(10, 40)


In [175]:
y_hat = model(x)
y_hat.size()

torch.Size([10, 1])

## Training

Note that I am not bodering with mini-batches since our dataset is small.

In [337]:
#def train(model, x_train, y_train):
model = SentenceCNN(V, D, glove_weights=pretrained_weight).cuda()

In [338]:
test_metrics(model)

test loss 0.698 and accuracy 0.490


In [339]:
# this filters parameters with p.requires_grad=True
parameters = filter(lambda p: p.requires_grad, model.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.01)

In [340]:
def train_epocs(model, epochs=10, lr=0.01):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    model.train()
    for i in range(epochs):
        x = Variable(torch.LongTensor(x_train)).cuda()
        y = Variable(torch.Tensor(y_train)).cuda().unsqueeze(1)
        y_hat = model(x)
        loss = F.binary_cross_entropy_with_logits(y_hat, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.data[0])
    test_metrics(model)

In [341]:
def test_metrics(m):
    model.eval()
    x = Variable(torch.LongTensor(x_test)).cuda()
    y = Variable(torch.Tensor(y_test)).cuda().unsqueeze(1)
    y_hat = m(x)
    loss = F.binary_cross_entropy_with_logits(y_hat, y)
    y_pred = y_hat > 0
    correct = (y_pred.float() == y).float().sum()
    accuracy = correct/pred.shape[0]
    print("test loss %.3f and accuracy %.3f" % (loss.data[0], accuracy.data[0]))

In [342]:
train_epocs(model, epochs=10, lr=0.01)

0.7035446763038635
1.1868393421173096
0.9086019396781921
0.5865945816040039
0.3559541702270508
0.3733510375022888
0.4246615469455719
0.4422186017036438
0.4157627522945404
0.3762800097465515
test loss 0.360 and accuracy 0.860


In [343]:
train_epocs(model, epochs=10, lr=0.01)

0.34370580315589905
0.5294861793518066
0.32328617572784424
0.40581214427948
0.2806266248226166
0.2568565309047699
0.30520763993263245
0.29174017906188965
0.24017520248889923
0.22176489233970642
test loss 0.285 and accuracy 0.892


In [344]:
train_epocs(model, epochs=10, lr=0.001)

0.23691332340240479
0.2138260155916214
0.20660893619060516
0.20420822501182556
0.20270921289920807
0.20497538149356842
0.19893155992031097
0.1915552318096161
0.19009748101234436
0.18447645008563995
test loss 0.246 and accuracy 0.898


In [345]:
# how to figure out the parameters
parameters = filter(lambda p: p.requires_grad, model.parameters())
print([p.size() for p in parameters])

[torch.Size([100, 300, 3]), torch.Size([100]), torch.Size([100, 300, 4]), torch.Size([100]), torch.Size([100, 300, 5]), torch.Size([100]), torch.Size([1, 300]), torch.Size([1])]


In [346]:
# unfreezing the embeddings
model.embedding.weight.requires_grad = True

In [347]:
parameters = filter(lambda p: p.requires_grad, model.parameters())
print([p.size() for p in parameters])

[torch.Size([400007, 300]), torch.Size([100, 300, 3]), torch.Size([100]), torch.Size([100, 300, 4]), torch.Size([100]), torch.Size([100, 300, 5]), torch.Size([100]), torch.Size([1, 300]), torch.Size([1])]


In [348]:
train_epocs(model, epochs=10, lr=0.001)

0.1830516755580902
0.17343704402446747
0.16393160820007324
0.15581797063350677
0.14433661103248596
0.1368999481201172
0.1289738416671753
0.12384654581546783
0.11491061747074127
0.1075991541147232
test loss 0.228 and accuracy 0.913


In [349]:
train_epocs(model, epochs=10, lr=0.0001)

0.10160469263792038
0.10249238461256027
0.10043996572494507
0.10032773017883301
0.09892096370458603
0.09907413274049759
0.09923809766769409
0.09644830971956253
0.09582842141389847
0.0971798449754715
test loss 0.227 and accuracy 0.912


## TODOs

* Show how to save model
* Show how to predict on new data
* Test a version with a smaller word embedding matrix
* Try Another tokenizer

## Lab 

* You may not need to keep all word embeddings.
* Extend this code by finetunning the embedding layer.
* Use fasttext instead of globe model. (https://fasttext.cc/docs/en/english-vectors.html)

   `! pip install git+https://github.com/facebookresearch/fastText.git`
* Extend this code to do cross-validation. Look at https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py for an example on how to do it.

## References

The CNN is adapted from here https://github.com/junwang4/CNN-sentence-classification-pytorch-2017/blob/master/cnn_pytorch.py.
Code for the original paper can be found here https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py.