# Subjectivity classification with CNNs

In this notebook we implement the approched described in this [paper](https://arxiv.org/pdf/1408.5882.pdf) for classifiying sentences using Convolutional Neural Networks. In particular, we will classify sentences into "subjective" or "objective". 

## Subjectivity Dataset

The subjectivity dataset has 5000 subjective and 5000 objective processed sentences. To get the data:
```
wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
```

In [10]:
from pathlib import Path
PATH = Path("/data2/yinterian/rotten_imdb/")
list(PATH.iterdir())

[PosixPath('/data2/yinterian/rotten_imdb/plot.tok.gt9.5000'),
 PosixPath('/data2/yinterian/rotten_imdb/subjdata.README.1.0'),
 PosixPath('/data2/yinterian/rotten_imdb/quote.tok.gt9.5000')]

From the readme file:
- quote.tok.gt9.5000 contains 5000 subjective sentences (or snippets)
- plot.tok.gt9.5000 contains 5000 objective sentences

In [11]:
! head /data2/yinterian/rotten_imdb/plot.tok.gt9.5000

the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . 
emerging from the human psyche and showing characteristics of abstract expressionism , minimalism and russian constructivism , graffiti removal has secured its place in the history of modern art while being created by artists who are unconscious of their artistic achievements . 
spurning her mother's insistence that she get on with her life , mary is thrown out of the house , rejected by joe , and expelled from school as she grows larger with child . 
amitabh can't believe the board of directors and his mind is filled with revenge and what better revenge than robbing the bank himself , ironic as it may sound . 
she , among others excentricities , talks to a small rock , gertrude , like if she was alive . 
this gives the girls a fair chance of pulling the wool over their eyes using their sexiness to poach any last vestige of common sense the dons might have had . 
styled after vh1's "

## Split train and test

In [46]:
import numpy as np

In [53]:
def read_file(path):
    """ Read file returns a shuttled list.
    """
    with open(path_sub, encoding = "ISO-8859-1") as f:
        content = np.array(f.readlines())
    return content

In [60]:
sub_content = read_file(PATH/"quote.tok.gt9.5000")
obj_content = read_file(PATH/"plot.tok.gt9.5000")
sub_y = np.zeros(len(sub_content))
obj_y = np.ones(len(obj_content))
X = np.append(sub_content, obj_content)
y = np.append(sub_y, obj_y)

In [58]:
from sklearn.model_selection import train_test_split

In [61]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [65]:
X_train[:5], y_train[:5]

(array(['the first hour alone is worth the admission price . \n',
        "the director's twitchy sketchbook style and adroit perspective shifts grow wearisome amid leaden pacing and indifferent craftsmanship ( most notably wretched sound design ) . \n",
        "welles groupie/scholar peter bogdanovich took a long time to do it , but he's finally provided his own broadside at publishing giant william randolph hearst . \n",
        'a coming-of-age film that avoids the cartoonish clichés and sneering humor of the genre as it provides a fresh view of an old type -- the uncertain girl on the brink of womanhood . \n',
        'there is something in full frontal , i guess , about artifice and acting and how it distorts reality for people who make movies and watch them , but like most movie riddles , it works only if you have an interest in the characters you see . \n'],
       dtype='<U264'), array([1., 0., 0., 1., 1.]))

## Embedding Layer

In [77]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

In [80]:
# an Embedding module containing 10 (words) tensors of size 3
embed = nn.Embedding(10, 3)
a = Variable(torch.LongTensor([[1,2,4,5,1]]))
embed(a)

Variable containing:
(0 ,.,.) = 
 -0.4092 -1.7332 -1.1100
 -1.3909 -0.7689  1.7758
 -0.6456 -1.4082  1.4182
  0.0972  1.0144  1.3114
 -0.4092 -1.7332 -1.1100
[torch.FloatTensor of size 1x5x3]

In [82]:
## here is the randomly initialized embeddings
embed.weight.data


-0.2949 -0.9054  0.5869
-0.4092 -1.7332 -1.1100
-1.3909 -0.7689  1.7758
-0.7651  1.3772 -1.3829
-0.6456 -1.4082  1.4182
 0.0972  1.0144  1.3114
 0.6111  0.9083  0.7442
-1.8281  1.3925 -0.3052
 0.4369  1.1894  0.1140
 0.0385 -0.8008 -0.6054
[torch.FloatTensor of size 10x3]

### Initializing embedding layer with Glove embeddings

To get glove pre-trained embeddings:
    `wget http://nlp.stanford.edu/data/glove.6B.zip`

In [87]:
! head -2 /data2/yinterian/rotten_imdb/glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392


We would like to initialize the embeddings from our model with the pre-trained Glove embeddings. After initializing we should "freeze" the embeddings at least initially. The rationale is that we first want the network to learn weights for the other parameters that were randomly initialize. After that phase we could finetune the embeddings to our task. 

`embed.weight.requires_grad = False` freezes the embedding parameters.

The following code initializes the embedding. Here `V` is the vocabulary size and `D` is the embedding size. `pretrained_weight` is a numpy matrix of shape `(V, D)`.

In [105]:
def loadGloveModel(gloveFile="/data2/yinterian/rotten_imdb/glove.6B.300d.txt"):
    f = open(gloveFile,'r')
    D = 300
    V = 400000
    ind = 0
    vocab2index = {}
    vocab = []
    weights = np.zeros((V, D))
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        weights[ind] = embedding
        vocab2index[word] = ind
        vocab.append(word)
        ind += 1
    return weights, np.array(vocab), vocab2index

In [110]:
pretrained_weight, vocab, vocab2index = loadGloveModel()

In [112]:
D = 300
V = 400000
emb = nn.Embedding(V, D)
emb.weight.data.copy_(torch.from_numpy(pretrained_weight))


 4.6560e-02  2.1318e-01 -7.4364e-03  ...   9.0611e-03 -2.0989e-01  5.3913e-02
-2.5539e-01 -2.5723e-01  1.3169e-01  ...  -2.3290e-01 -1.2226e-01  3.5499e-01
-1.2559e-01  1.3630e-02  1.0306e-01  ...  -3.4224e-01 -2.2394e-02  1.3684e-01
                ...                   ⋱                   ...                
 7.5713e-02 -4.0502e-02  1.8345e-01  ...   2.1838e-01  3.0967e-01  4.3761e-01
 8.1451e-01 -3.6221e-01  3.1186e-01  ...   7.5486e-02  2.8408e-01 -1.7559e-01
 4.2919e-01 -2.9690e-01  1.5011e-01  ...   2.8975e-01  3.2618e-01 -5.9053e-02
[torch.FloatTensor of size 400000x300]

Question: How many parameters do we have in this embedding matrix?

## 1D CNN model for sentence classification

Notation:
V -- vocabulary size
D -- embedding size
N -- MAX Sentence length
in - in channel = 1
out - 

In [None]:
class SentenceCNN(nn.Module):
    
    def __init__(self, V, D, glove_weights):
        super(SentenceCNN, self).__init__()
        # one for UNK and one for zero padding
        self.glove_weights = glove_weights
        self.embedding = nn.Embedding(V + 2, D, padding_idx=V + 1)
        self.embedding.weight.data.copy_(torch.from_numpy(self.glove_weights))
        ## freeze embeddings
        self.embedding.weight.requires_grad = False

        self.conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)
        self.conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
        self.conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)
        
        self.dropout = nn.Dropout(p=0.5)
        self.fc1 = nn.Linear(len(Ks)*Co, C)