$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$
# Data creation
<a id=part1></a>

we will start by setting up our datasets and dataloaders, then splitting data into train/validation/test sets

In [1]:
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re

import numpy as np
import torch
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

In [2]:
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

Using device: cpu


In [3]:
import torchtext
from torchtext import data
from torchtext import datasets

# create Field objects
ID = data.Field(sequential=False, dtype=torch.int8,use_vocab=True)
TARGET = data.Field(sequential=False, lower=True, dtype=torch.long, use_vocab=True)
TWEET = data.Field(sequential=True, use_vocab=True, lower=True,
    init_token='<sos>', eos_token='<eos>', dtype=torch.long)
STANCE = data.Field( is_target=True, sequential=False, unk_token=None, use_vocab=True)
SENTIMENT = data.Field( is_target=True, sequential=False, unk_token=None, use_vocab=True)

# create tuples representing the columns
fields = [
   ('ID', ID),
  ('TARGET', TARGET),
  ('TWEET', TWEET),
  ('STANCE', STANCE),
  (None, None), # ignore age column
  ('SENTIMENT', SENTIMENT)
]


# load the dataset in json format
train_ds, valid_ds, test_ds = data.TabularDataset.splits(
   path = 'data-all-annotations',
   train = 'trainingdata-all-annotations.txt',
   validation = 'legalization_of_abortion_valid_set.txt',
   test = 'legalization_of_abortion_test_set.txt',
   format = 'tsv',
   fields = fields,
skip_header = True
)

# check an example
print(vars(train_ds[1]))
print(vars(valid_ds[100]))
print(len(train_ds))

{'ID': '102', 'TARGET': 'atheism', 'TWEET': ['blessed', 'are', 'the', 'peacemakers,', 'for', 'they', 'shall', 'be', 'called', 'children', 'of', 'god.', 'matthew', '5:9', '#scripture', '#peace', '#semst'], 'STANCE': 'AGAINST', 'SENTIMENT': 'POSITIVE'}
{'ID': '2413', 'TARGET': 'legalization of abortion', 'TWEET': ['@showtruth', 'no,', 'i', "can't", 'explain', 'why', 'you', 'would', 'consider', 'a', 'medical', 'procedure', 'that', 'leaves', 'the', 'patient', 'healthy', '&', 'happy', 'as', 'killing.', '#semst'], 'STANCE': 'FAVOR', 'SENTIMENT': 'NEGATIVE'}
2814




### Data Pre-processing
<a id=part1_3></a>

The first thing we'll need to build a vocabulary for our fields 

In [4]:
ID.build_vocab(train_ds)
TARGET.build_vocab(train_ds)
TWEET.build_vocab(train_ds)
STANCE.build_vocab(train_ds)
SENTIMENT.build_vocab(train_ds)

print(f"Number of tokens in training samples: {len(TWEET.vocab)}")
print(f"Number of tokens in training stance labels: {len(STANCE.vocab)}")
print(f"Number of tokens in training sentiment labels: {len(SENTIMENT.vocab)}")

Number of tokens in training samples: 12528
Number of tokens in training stance labels: 3
Number of tokens in training sentiment labels: 3


In [5]:
print(f'first 20 tokens:\n', TWEET.vocab.itos[:20], end='\n\n')

first 20 tokens:
 ['<unk>', '<pad>', '<sos>', '<eos>', '#semst', 'the', 'to', 'a', 'of', 'is', 'and', 'you', 'i', 'in', 'for', 'be', 'that', 'are', 'on', 'not']



In [6]:
print(f'stance labels vocab:\n', dict(STANCE.vocab.stoi))
print(f'sentiment labels vocab:\n', dict(SENTIMENT.vocab.stoi))

stance labels vocab:
 {'AGAINST': 0, 'NONE': 1, 'FAVOR': 2}
sentiment labels vocab:
 {'NEGATIVE': 0, 'POSITIVE': 1, 'NEITHER': 2}


# Data Loaders

we want to be able to create batches and to iterate through the datasets for that we need to define data loaders 


In [7]:
BATCH_SIZE = 10

dl_train, dl_valid, dl_test = torchtext.data.BucketIterator.splits(
    (train_ds, valid_ds, test_ds), batch_size=BATCH_SIZE,
    shuffle=True, sort=False)



this is what a single batch looks like

In [8]:
batch = next(iter(dl_train))

X, y_stance, y_sentiment= batch.TWEET, batch.STANCE, batch.SENTIMENT
print('X = \n', X, X.shape, end='\n\n')
print('y = \n', y_stance, y_stance.shape)

X = 
 tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2],
        [   42,  6136,  6651,  6586,  1025,     5,    46,   474,  1477,  9607],
        [   19,    44,  6490,    38,   211,   110,    11,   201,    73, 11624],
        [   38,    69,  2487,     7,    56, 11022,  2736,   471,   394,  5284],
        [ 4528,   171,   216,  1877,   644,   314,   747,    35,    59,     9],
        [   16,     6,    36,    16,     6,    34,     5, 10418,   117,  8203],
        [   56,   112,    19,    11,  3254,   291,   282,    27,   179, 11311],
        [  424,   481,    13,    17,    16,  1326,   600, 11218,   147,  4426],
        [  334, 10139,     7,  2438,  3660,   449,  5319,  5007,  1286,  3926],
        [   87,    35, 11882,    13,  1707,     9,   762,  3776,  2816,  3971],
        [   52,     5,  1080,   137,    18,    16,     4,  5042,  9817,   669],
        [  274, 10533,   787,   119,     5,   207,     3,     4,   230,     4],
        [ 7911, 12204,    92,    1



# Model Implementation

In our problem stands the difficulties of stance detection, where the sentiment is not generic but with respect to a specific topic. 
as stated in our proposed solution we will build a RNN based deep neural network within a two-phase architecture,
first we will experiment with a more sophisticated version of RNN which is the LSTM architecture. 

#### the two-phase architecture

the first phase (Subjectivity classification):
1. Layer of embedding
2. Layer of LSTM 
3. Linear classification layer  classify (positive \ negative \ neutral)

testing the model:

In [9]:
from implementations.models import SubjectivityLSTM

EMB_DIM = 128
HID_DIM = 64
NUM_LAYERS = 2
DROP_OUT = 0.1
VOCAB_SIZE = len(TWEET.vocab)

sub = SubjectivityLSTM(VOCAB_SIZE, EMB_DIM, HID_DIM,NUM_LAYERS,DROP_OUT)
out, h, ht = sub(X)
print(f'h (S, B, H): {h.shape}')
print(f'ht (L, B, H): {ht[0].shape}')
print(f'out (S, B, H): {out.shape}')

h (S, B, H): torch.Size([25, 10, 64])
ht (L, B, H): torch.Size([2, 10, 64])
out (S, B, H): torch.Size([25, 10, 3])



the second phase (Stance detection):
1. Layer of attention
2. Layer of LSTM
3. Linear classification layer classify (favor \ against).

In [10]:
from implementations.models import StanceLSTM

stance = StanceLSTM(VOCAB_SIZE, EMB_DIM, HID_DIM,NUM_LAYERS,DROP_OUT)
yhat, _ = stance(X, ht, h)
print(f'yhat (S, B, V_tgt): {yhat.shape}')

yhat (S, B, V_tgt): torch.Size([25, 10, 3])


now for the model that includes both

In [11]:
from implementations.models import TwoPhaseLSTM

two_phase_model = TwoPhaseLSTM(sub, stance)
y_stance, y_sub = two_phase_model(X)
print('y_stance: (S, B, V_tgt) =', tuple(y_stance.shape))
print('y_sub: (S, B, V_tgt) =', tuple(y_sub.shape))

y_stance: (S, B, V_tgt) = (25, 10, 3)
y_sub: (S, B, V_tgt) = (25, 10, 3)


# training

The trainnig approarch will be standard approach with CrossEntropy loss on the class scores and calculating number of correct answers for accuracy evaluation

In [13]:
from implementations.training import train_two_phase_rnn
from implementations.training import eval_two_phase_rnn
import torch.nn as nn

EPOCHS = 100
BATCHES_PER_EPOCH=28
LR=1e-3

optimizer = torch.optim.Adam(two_phase_model.parameters(), lr=LR)
sub_loss_fn = nn.NLLLoss()
stance_loss_fn = nn.NLLLoss()

losses = []
sub_accuracies = []
stance_accuracies = []
for epoch in range(EPOCHS):
    
    print(f'=== EPOCH {epoch+1}/{EPOCHS} ===')
    losses += train_two_phase_rnn(two_phase_model, dl_train, optimizer, sub_loss_fn, stance_loss_fn, BATCHES_PER_EPOCH)
    sub_acc, stance_acc = eval_two_phase_rnn(two_phase_model, dl_valid)
    sub_accuracies += sub_acc
    stance_accuracies += stance_acc


=== EPOCH 1/100 ===
train loss=0.8484514355659485,: 100%|██████████████████████████████████████████████████| 28/28 [00:02<00:00, 10.04it/s]
 sentiment accuracy=0.6000000238418579, stance accuracy=0.699999988079071 : 100%|██████| 30/30 [00:00<00:00, 40.48it/s]
=== EPOCH 2/100 ===
train loss=0.8861850500106812,: 100%|██████████████████████████████████████████████████| 28/28 [00:03<00:00,  9.11it/s]
 sentiment accuracy=0.4000000059604645, stance accuracy=0.5 : 100%|████████████████████| 30/30 [00:00<00:00, 42.97it/s]
=== EPOCH 3/100 ===
train loss=1.063779592514038,: 100%|███████████████████████████████████████████████████| 28/28 [00:03<00:00,  8.64it/s]
 sentiment accuracy=0.6000000238418579, stance accuracy=0.5 : 100%|████████████████████| 30/30 [00:00<00:00, 42.49it/s]
=== EPOCH 4/100 ===
train loss=0.9704071879386902,: 100%|██████████████████████████████████████████████████| 28/28 [00:03<00:00,  8.64it/s]
 sentiment accuracy=0.10000000149011612, stance accuracy=0.30000001192092896 : 1