# Sentiment Analysis

## Text classification with CNN

Classification tasks can either be binary, multi-class, or multi-label.

## Word embedding

Representing words as numerical vectors that keeps semantic meaning and connections between them.

In [2]:
import torch

In [69]:
# sentence, label pairs
text_samples = [
    ("I would recommend this book.".split(),1),
    ("The story was interesting.".split(),1),
    ("The plot is not written well".split(),0),
    ("I like the characters".split(),1),
]

In [70]:
words = {ii for i in text_samples for ii in i[0] }

In [71]:
words

{'I',
 'The',
 'book.',
 'characters',
 'interesting.',
 'is',
 'like',
 'not',
 'plot',
 'recommend',
 'story',
 'the',
 'this',
 'was',
 'well',
 'would',
 'written'}

In [72]:
word_to_index = {word: idx for idx, word in enumerate(words)}

In [73]:
word_to_index

{'like': 0,
 'The': 1,
 'was': 2,
 'interesting.': 3,
 'written': 4,
 'characters': 5,
 'I': 6,
 'not': 7,
 'is': 8,
 'would': 9,
 'plot': 10,
 'book.': 11,
 'the': 12,
 'story': 13,
 'this': 14,
 'well': 15,
 'recommend': 16}

In [74]:
preprocessed_words = [word_to_index[i] for i in words]

In [75]:
preprocessed_words

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

In [76]:
inputs = torch.LongTensor(preprocessed_words)
embedding = torch.nn.Embedding(num_embeddings=len(words), embedding_dim=5)

In [77]:
output = embedding(inputs)

In [78]:
output

tensor([[ 0.4200, -1.4579, -0.9933, -0.0159, -0.3277],
        [-0.3111, -0.1781,  0.4281,  0.9825,  0.5980],
        [ 0.0804, -0.4480, -0.9898,  0.4201, -1.2605],
        [-1.2659, -0.6779,  0.5481,  0.7124, -2.3134],
        [ 0.1346,  0.6053,  1.9627,  2.1202, -0.4237],
        [-0.1819, -1.3125, -0.4762, -0.4572,  0.6339],
        [-0.9884, -0.3030, -0.8270, -0.4089, -0.2556],
        [ 0.5830, -0.1364,  0.3091, -1.2305,  0.1511],
        [ 0.6196,  0.9154, -1.5264,  2.0294, -0.6247],
        [ 0.7854, -2.4355,  0.4791,  0.3470,  0.6774],
        [ 1.3339,  0.2000, -1.2083,  2.1884, -0.2339],
        [-0.9382, -0.7047, -0.4650,  0.8876, -0.8559],
        [ 0.9153, -0.3533,  0.1098,  1.6644, -0.2976],
        [-0.5616, -1.2606, -0.3202,  0.5183,  1.0450],
        [-1.1563, -1.1008, -0.4844, -0.8209,  0.1986],
        [ 0.3050, -0.0158,  0.9197,  0.0267, -0.8344],
        [ 1.0470, -0.9032,  0.6473,  1.0519, -0.7370]],
       grad_fn=<EmbeddingBackward0>)

Output is an embedding vector for each input word.

## CNN - convolutional layer

Convolutional layer detects patterns. 

The convolution operation is sliding the filter(kernel) over the input data and calculating element-wise matrix multiplication.

The filter(kernel) is a small marix that slides over input.

The stride is a number of positions the filter moves.

<img src="./img/cnn_conv.png" alt="cnn_conv" style="width: 600px;"/>

## CNN - pooling layer

Pooling layer reduces data size while preserving important information.

<img src="./img/cnn_pooling.png" alt="cnn_pooling.png" style="width: 600px;"/>

## CNN - fully connected layer

The last layer makes predictions

## Model

In [11]:
# Conv1d is prefered to Conv2d as text data is one dimentional
class Model(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.conv = torch.nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=1, padding=1)
        self.fc = torch.nn.Linear(embed_dim, 2)

    def forward(self, text):
        # convert text to embedding
        # reshape to cov layer shape (batch_size, embed_dim,sequence_len)
        embed = self.embedding(text).permute(0, 2, 1)
        # activation func - extracts features from embeddings
        conved = torch.nn.functional.relu(self.conv(embed))
        # calculate the average across the sequence length
        conved = conved.mean(dim=2)
        return self.fc(conved)
        

By reducing the size of conv features dimension simplifies the information in each sentence to a single average value for easier analysis by the model.

In [88]:
# define embdedding shape
vocab_size = len(word_to_index)
embed_dim = 10

In [89]:
model = Model(vocab_size, embed_dim)
loss_func = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

In [90]:
# training loop
for epoch in range(10):
    for sentence, label in text_samples:
        model.zero_grad()
        sentence = torch.LongTensor([word_to_index[word] for word in sentence]).unsqueeze(0)
        outputs = model(sentence)
        label = torch.LongTensor([int(label)])
        loss = loss_func(outputs, label)
        loss.backward()
        optimizer.step()

We use unsqueeze zero to add an extra dimension to the start of the tensor, creating a batch containing a single sequence to fit the model's input expectations.

In [82]:
test_samples = [
    ("I like this story".split()),
]

In [91]:
# test
for sentence in test_samples:
    sentence = torch.LongTensor([word_to_index[word] for word in sentence]).unsqueeze(0)
    outputs = model(sentence)
    _, predicted_label = torch.max(outputs.data, 1)

In [93]:
predicted_label.item()

1