$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$
# Data creation
<a id=part1></a>

we will start by setting up our datasets and dataloaders, then splitting data into train/validation/test sets

In [1]:
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re

import numpy as np
import torch
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

In [2]:
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

Using device: cpu


In [3]:
import torchtext
from torchtext import data
from torchtext import datasets

# create Field objects
ID = data.Field(sequential=False, dtype=torch.int8,use_vocab=True)
TARGET = data.Field(sequential=False, lower=True, dtype=torch.long, use_vocab=True)
TWEET = data.Field(sequential=True, use_vocab=True, lower=True,
    init_token='<sos>', eos_token='<eos>', dtype=torch.long)
STANCE = data.Field( is_target=True, sequential=False, unk_token=None, use_vocab=True)
SENTIMENT = data.Field( is_target=True, sequential=False, unk_token=None, use_vocab=True)

# create tuples representing the columns
fields = [
   ('ID', ID),
  ('TARGET', TARGET),
  ('TWEET', TWEET),
  ('STANCE', STANCE),
  (None, None), # ignore age column
  ('SENTIMENT', SENTIMENT)
]


# load the dataset in json format
train_ds, valid_ds, test_ds = data.TabularDataset.splits(
   path = 'data-all-annotations',
   train = 'legalization_of_abortion_train_set.txt',
   validation = 'legalization_of_abortion_valid_set.txt',
   test = 'legalization_of_abortion_test_set.txt',
   format = 'tsv',
   fields = fields
)

# check an example
print(vars(train_ds[1]))
print(vars(valid_ds[100]))
print(len(train_ds))

{'ID': '10971', 'TARGET': 'legalization of abortion', 'TWEET': ['where', 'is', 'the', 'childcare', 'program', '@joanburton', 'which', 'you', 'said', 'would', 'be', 'in', 'place?', '#loneparents', '#7istooyoung', '#semst'], 'STANCE': 'AGAINST', 'SENTIMENT': 'NEGATIVE'}
{'ID': '2412', 'TARGET': 'legalization of abortion', 'TWEET': ['please', 'join', 'us', 'as', 'we', 'pray', 'to', 'end', 'the', 'global', 'holocaust', 'of', 'abortion!', '#semst'], 'STANCE': 'AGAINST', 'SENTIMENT': 'NEGATIVE'}
280




### Data Pre-processing
<a id=part1_3></a>

The first thing we'll need to build a vocabulary for our fields 

In [4]:
ID.build_vocab(train_ds)
TARGET.build_vocab(train_ds)
TWEET.build_vocab(train_ds)
STANCE.build_vocab(train_ds)
SENTIMENT.build_vocab(train_ds)

print(f"Number of tokens in training samples: {len(TWEET.vocab)}")
print(f"Number of tokens in training stance labels: {len(STANCE.vocab)}")
print(f"Number of tokens in training sentiment labels: {len(SENTIMENT.vocab)}")

Number of tokens in training samples: 1941
Number of tokens in training stance labels: 3
Number of tokens in training sentiment labels: 3


In [5]:
print(f'first 20 tokens:\n', TWEET.vocab.itos[:20], end='\n\n')

first 20 tokens:
 ['<unk>', '<pad>', '<sos>', '<eos>', '#semst', 'the', 'to', 'a', 'of', 'is', 'i', 'in', 'and', 'you', 'for', 'are', 'that', 'with', 'not', 'be']



In [6]:
print(f'stance labels vocab:\n', dict(STANCE.vocab.stoi))
print(f'sentiment labels vocab:\n', dict(SENTIMENT.vocab.stoi))

stance labels vocab:
 {'AGAINST': 0, 'FAVOR': 1, 'NONE': 2}
sentiment labels vocab:
 {'NEGATIVE': 0, 'POSITIVE': 1, 'NEITHER': 2}


# Data Loaders

we want to be able to create batches and to iterate through the datasets for that we need to define data loaders 


In [7]:
BATCH_SIZE = 10

dl_train, dl_valid, dl_test = torchtext.data.BucketIterator.splits(
    (train_ds, valid_ds, test_ds), batch_size=BATCH_SIZE,
    shuffle=True, sort=False)



this is what a single batch looks like

In [8]:
batch = next(iter(dl_train))

X, y_stance, y_sentiment= batch.TWEET, batch.STANCE, batch.SENTIMENT
print('X = \n', X, X.shape, end='\n\n')
print('y = \n', y_stance, y_stance.shape)

X = 
 tensor([[   2,    2,    2,    2,    2,    2,    2,    2,    2,    2],
        [ 160,   24, 1345,  861,   21,  841,   18,  757,    5,  884],
        [1656,    9,  207,   61,  607, 1764,   28,   72, 1372, 1705],
        [ 188,   18,  275,  133,    9,   13, 1879,    7,   16,  141],
        [1021,  176,   13,   74,  133,   68,   18, 1763,    7,   15],
        [1309,   21,   15, 1814,  224,    5,   28, 1317,  521,    5],
        [  12,   13, 1064, 1371,   14,   99, 1101,   32,  136, 1636],
        [1263,  234,  802,  152,   28,    8,  158,  455,   81,  122],
        [1445,   44,  235,   90,  609, 1909, 1561,  223,    9, 1708],
        [  12,    8,  421,    4,   24,   52,   24, 1567, 1573,  145],
        [1407,    5,  120,    3,   67,   18,  261, 1270,   10,  595],
        [ 428,   65,  236,    1,   19,  514,    9,   32,  431,    4],
        [  13, 1917,   36,    1,  133,   74,   16, 1171,   41,    3],
        [ 422,   16,   28,    1,  224,    6, 1757,  456,  371,    1],
        [1377,



# Model Implementation

In our problem stands the difficulties of stance detection, where the sentiment is not generic but with respect to a specific topic. 
as stated in our proposed solution we will build a RNN based deep neural network within a two-phase architecture,
first we will experiment with a more sophisticated version of RNN which is the LSTM architecture. 

#### the two-phase architecture

the first phase (Subjectivity classification):
1. Layer of embedding
2. Layer of LSTM 
3. Linear classification layer  classify (positive \ negative \ neutral)

testing the model:

In [9]:
from implementations.models import SubjectivityLSTM

EMB_DIM = 128
HID_DIM = 64
NUM_LAYERS = 2
DROP_OUT = 0.1
VOCAB_SIZE = len(TWEET.vocab)

sub = SubjectivityLSTM(VOCAB_SIZE, EMB_DIM, HID_DIM,NUM_LAYERS,DROP_OUT)
out, h, ht = sub(X)
print(f'h (S, B, H): {h.shape}')
print(f'ht (L, B, H): {ht[0].shape}')
print(f'out (S, B, H): {out.shape}')

h (S, B, H): torch.Size([28, 10, 64])
ht (L, B, H): torch.Size([2, 10, 64])
out (S, B, H): torch.Size([28, 10, 3])



the second phase (Stance detection):
1. Layer of attention
2. Layer of LSTM
3. Linear classification layer classify (favor \ against).

In [10]:
from implementations.models import StanceLSTM

stance = StanceLSTM(VOCAB_SIZE, EMB_DIM, HID_DIM,NUM_LAYERS,DROP_OUT)
yhat, _ = stance(X, ht, h)
print(f'yhat (S, B, V_tgt): {yhat.shape}')

yhat (S, B, V_tgt): torch.Size([28, 10, 3])


now for the model that includes both

In [11]:
from implementations.models import TwoPhaseLSTM

two_phase_model = TwoPhaseLSTM(sub, stance)
y_stance, y_sub = two_phase_model(X)
print('y_stance: (S, B, V_tgt) =', tuple(y_stance.shape))
print('y_sub: (S, B, V_tgt) =', tuple(y_sub.shape))

y_stance: (S, B, V_tgt) = (28, 10, 3)
y_sub: (S, B, V_tgt) = (28, 10, 3)


# training

The trainnig approarch will be standard approach with CrossEntropy loss on the class scores and calculating number of correct answers for accuracy evaluation

In [13]:
from implementations.training import train_two_phase_rnn
from implementations.training import eval_two_phase_rnn
import torch.nn as nn

EPOCHS = 100
BATCHES_PER_EPOCH=28
LR=1e-2

optimizer = torch.optim.Adam(two_phase_model.parameters(), lr=LR)
sub_loss_fn = nn.NLLLoss()
stance_loss_fn = nn.NLLLoss()

losses = []
sub_accuracies = []
stance_accuracies = []
for epoch in range(EPOCHS):
    
    print(f'=== EPOCH {epoch+1}/{EPOCHS} ===')
    losses += train_two_phase_rnn(two_phase_model, dl_train, optimizer, sub_loss_fn, stance_loss_fn, BATCHES_PER_EPOCH)
    sub_acc, stance_acc = eval_two_phase_rnn(two_phase_model, dl_valid)
    sub_accuracies += sub_acc
    stance_accuracies += stance_acc


=== EPOCH 1/100 ===
train loss=0.3520396053791046,: 100%|██████████████████████████████████████████████████| 28/28 [00:01<00:00, 15.42it/s]
 sentiment accuracy=0.8999999761581421, stance accuracy=0.4000000059604645 : 100%|█████| 31/31 [00:00<00:00, 40.26it/s]
=== EPOCH 2/100 ===
train loss=0.715445339679718,: 100%|███████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.95it/s]
 sentiment accuracy=0.30000001192092896, stance accuracy=0.30000001192092896 : 100%|███| 31/31 [00:00<00:00, 36.02it/s]
=== EPOCH 3/100 ===
train loss=0.18244589865207672,: 100%|█████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.09it/s]
 sentiment accuracy=0.699999988079071, stance accuracy=0.4000000059604645 : 100%|██████| 31/31 [00:00<00:00, 38.90it/s]
=== EPOCH 4/100 ===
train loss=0.2815021872520447,: 100%|██████████████████████████████████████████████████| 28/28 [00:02<00:00, 11.74it/s]
 sentiment accuracy=0.699999988079071, stance accuracy=0.20000000298023224 : 100