## Dimensional Sentiment Model

## Overview

This notebook seeks to implement a variation on the VA-spacial, dimensional CNN-LSTM for sentiment analysis based on ([Wang et al. 2016](https://www.aclweb.org/anthology/P16-2037.pdf); [Wang et al. 2020](https://ieeexplore.ieee.org/ielx7/6570655/8938144/08930925.pdf)).

The dataset used is JULIELab's [EmoBank](https://github.com/JULIELab/EmoBank), a large-scale (10k sentence) VAP-scheme corpus.

## Data Preparation


### Download Data
EmoBank provides three prominent datasets: 1) the reader perspective, 2) the writer perspective, and 3) the weighted average of reader and writer annotations (`emobank.csv`; we'll use this one).

In [None]:
!wget --show-progress --continue -O /content/emobank.csv https://raw.githubusercontent.com/JULIELab/EmoBank/master/corpus/emobank.csv

--2020-07-22 16:24:51--  https://raw.githubusercontent.com/JULIELab/EmoBank/master/corpus/emobank.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1335010 (1.3M) [text/plain]
Saving to: ‘/content/emobank.csv’


2020-07-22 16:24:51 (9.86 MB/s) - ‘/content/emobank.csv’ saved [1335010/1335010]



In [None]:
!wget --show-progress --continue -O /content/crawl-300d-50K.vec.zip https://mb-14.github.io/static/crawl-300d-50K.vec.zip
!unzip /content/crawl-300d-50K.vec.zip -d /content/

--2020-07-22 16:24:54--  https://mb-14.github.io/static/crawl-300d-50K.vec.zip
Resolving mb-14.github.io (mb-14.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to mb-14.github.io (mb-14.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37233955 (36M) [application/zip]
Saving to: ‘/content/crawl-300d-50K.vec.zip’


2020-07-22 16:24:55 (51.4 MB/s) - ‘/content/crawl-300d-50K.vec.zip’ saved [37233955/37233955]

Archive:  /content/crawl-300d-50K.vec.zip
  inflating: /content/crawl-300d-50K.vec  


### Loading Data


In [None]:
import pandas as pd
import tensorflow as tf

import distutils
if distutils.version.LooseVersion(tf.__version__) < '2.0':
  raise Exception('This notebook is compatible with TensorFlow 2.0 or higher.')

full = pd.read_csv("/content/emobank.csv", index_col=0)[['split', 'text', 'V', 'A']]

import nltk
nltk.download('punkt')

sanitize = lambda s: [i.lower().translate(str.maketrans('', '', '.?!.;:()[]/')) for i in nltk.tokenize.sent_tokenize(s)]

# sanitize = lambda s: s.lower().translate(str.maketrans('', '', '.?!.;:()[]/'))

full['text'] = full['text'].map(sanitize)

train = full[full['split'] == 'train'][['text', 'V', 'A']]
test = full[full['split'] == 'test'][['text', 'V', 'A']]

print(train.sort_values(by=['V']))

len(train)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
                                                                                     text  ...     A
id                                                                                         ...      
A_defense_of_Michael_Moore_12034_12044                                       ["fuck you"]  ...  4.20
captured_moments_5506_5538                              [i hate it, despise it, abhor it]  ...  4.40
captured_moments_6594_6611                                            ["obscenely ugly,"]  ...  2.90
detroit_13623_13627                                                                 [sad]  ...  3.20
Nathans_Bylichka_7597_7686              [my girlfriend has disappeared, i don’t even k...  ...  4.00
...                                                                                   ...  ...   ...
captured_moments_28753_28863            [for a perfect moment, emil and tasha and i we

8062

### Tokenizing

In [None]:
MAX_SEQUENCE_LENGTH=50
MAX_SENTENCE_LENGTH=70
MAX_ENTRY_LENGTH=15
MAX_NUMBER_WORDS=50000

flatten = lambda l: [item for sublist in l for item in sublist]

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

import numpy as np

tokenizer = Tokenizer(num_words=MAX_NUMBER_WORDS)
sentences = flatten(train['text'].values)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

def tokenize_and_pad(dataset):
  sequences = [tokenizer.texts_to_sequences(i) for i in dataset['text'].values]
  x_unpadded = np.array([pad_sequences(seq, maxlen=MAX_SENTENCE_LENGTH) for seq in sequences])

  def pad_3d_array(array, shape):
    padded = np.zeros(shape, dtype=int)
    for x in range(shape[0]):
      align_bottom_index = shape[1] - len(array[x])
      end_index = shape[1]-1
      if align_bottom_index == end_index:
        padded[x][end_index] = array[x]
      else:
        padded[x][align_bottom_index:end_index] = array[x][0:len(array[x])-1]
    return padded

  x_f = pad_3d_array(x_unpadded, (len(dataset['text'].values), MAX_ENTRY_LENGTH, MAX_SENTENCE_LENGTH))
  y_f = np.array(dataset[['V', 'A']].values)

  return x_f, y_f

x_train, y_train = tokenize_and_pad(train)
x_val, y_val = tokenize_and_pad(test)

Using TensorFlow backend.


## Model


### Make Embedding Layer

In [None]:
from gensim.models import KeyedVectors
from tensorflow.keras.initializers import Constant

def make_embedding_layer(word_index, embeddings_path):
    embeddings = KeyedVectors.load_word2vec_format(embeddings_path)
    embedding_dims = embeddings.vector_size
    nb_words = min(len(embeddings.vocab), len(word_index))+1

    embedding_matrix = np.zeros((nb_words, embedding_dims))

    for word, i in word_index.items():
        if i >= nb_words:
            continue
        try:
            embedding_vector = embeddings.get_vector(word)
            embedding_matrix[i] = embedding_vector
        except KeyError:
            continue

    embedding_layer = layers.Embedding(input_dim=nb_words, output_dim=embedding_dims, embeddings_initializer=Constant(embedding_matrix),
                                input_length=MAX_SENTENCE_LENGTH,
                                trainable=False)
    return embedding_layer, embedding_dims

### Build the Model

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import os

CNN_DIM = 32
CNN_WINDOW = 3
POOL_SIZE = 2

def cnn_lstm_model(word_index, embeddings, seq_len=100, batch_size=300, stateful=True):
  embedded_sequences, embeddings_dim = make_embedding_layer(word_index, embeddings)

  return tf.keras.Sequential([
    layers.TimeDistributed(embedded_sequences, input_shape=(MAX_ENTRY_LENGTH, MAX_SENTENCE_LENGTH)),
    layers.TimeDistributed(layers.Conv1D(CNN_DIM, CNN_WINDOW, activation='relu')),
    layers.TimeDistributed(layers.MaxPool1D(pool_size=POOL_SIZE, strides=1, padding='valid')),
    layers.TimeDistributed(layers.Flatten()),
    layers.LSTM(20),
    layers.Dense(10),
    layers.Dense(2)
  ])

model = cnn_lstm_model(word_index, '/content/crawl-300d-50K.vec')
model.summary()

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
time_distributed_3 (TimeDist (None, 15, 70, 300)       4682400   
_________________________________________________________________
time_distributed_4 (TimeDist (None, 15, 68, 32)        28832     
_________________________________________________________________
time_distributed_5 (TimeDist (None, 15, 67, 32)        0         
_________________________________________________________________
time_distributed_6 (TimeDist (None, 15, 2144)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 20)                173200    
_________________________________________________________________
dense_2 (Dense)              (None, 10)                210       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                

## Training

### Train the Model


In [None]:
model = cnn_lstm_model(word_index, '/content/crawl-300d-50K.vec')

model.compile(optimizer='adam', loss='mse', metrics=['acc'])
model.fit(x=x_train, y=y_train, validation_data=(x_val, y_val), epochs=50, batch_size=100)

model.save('Dimensional_Sentiment_Model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: Dimensional_Sentiment_Model/assets


### Save Model

In [None]:
!pip install tensorflowjs

Collecting tensorflowjs
[?25l  Downloading https://files.pythonhosted.org/packages/2d/07/1e1da0d87f0cd1e3e3b6fa20ac4ceca43318d9ba10d49206c49dbd2816f2/tensorflowjs-2.0.1.post1-py3-none-any.whl (60kB)
[K     |█████▍                          | 10kB 15.4MB/s eta 0:00:01[K     |██████████▉                     | 20kB 1.9MB/s eta 0:00:01[K     |████████████████▎               | 30kB 2.2MB/s eta 0:00:01[K     |█████████████████████▊          | 40kB 2.6MB/s eta 0:00:01[K     |███████████████████████████▏    | 51kB 2.5MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.1MB/s 
Collecting tensorflow-cpu<3,>=2.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/e7/4f/7bf91c87907873177ad99a31014fb77271a693a3a7cb75e522ac6b556416/tensorflow_cpu-2.2.0-cp36-cp36m-manylinux2010_x86_64.whl (144.4MB)
[K     |████████████████████████████████| 144.4MB 79kB/s 
[?25hCollecting PyInquirer==1.0.3
  Downloading https://files.pythonhosted.org/packages/fb/4c/434b7c454010a284b

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
time_distributed_4 (TimeDist (None, 15, 70, 300)       4682400   
_________________________________________________________________
time_distributed_5 (TimeDist (None, 15, 68, 32)        28832     
_________________________________________________________________
time_distributed_6 (TimeDist (None, 15, 67, 32)        0         
_________________________________________________________________
time_distributed_7 (TimeDist (None, 15, 2144)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 20)                173200    
_________________________________________________________________
dense_2 (Dense)              (None, 10)                210       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                

In [None]:
import tensorflowjs as tfjs

def get_model_without_embeddings(model, input_shape):
  return tf.keras.Sequential([
    tf.keras.Input(shape=input_shape),
    layers.TimeDistributed(layers.Conv1D(CNN_DIM, CNN_WINDOW, activation='relu', weights=model.get_layer('time_distributed_5').get_weights())),
    layers.TimeDistributed(layers.MaxPool1D(pool_size=POOL_SIZE, strides=1, padding='valid', weights=model.get_layer('time_distributed_6').get_weights())),
    layers.TimeDistributed(layers.Flatten(weights=model.get_layer('time_distributed_7').get_weights())),
    layers.LSTM(20, weights=model.get_layer('lstm_1').get_weights()),
    layers.Dense(10, weights=model.get_layer('dense_2').get_weights()),
    layers.Dense(2, weights=model.get_layer('dense_3').get_weights())
  ])

def save_model(path, model):
  tfjs.converters.save_keras_model(model, path)
  model.save(os.path.join(path, 'model.h5'))

embedding_dims=300

new_model = get_model_without_embeddings(model, (MAX_ENTRY_LENGTH, MAX_SENTENCE_LENGTH, embedding_dims))
new_model.summary()
save_model('generated', new_model)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
time_distributed_8 (TimeDist (None, 15, 68, 32)        28832     
_________________________________________________________________
time_distributed_9 (TimeDist (None, 15, 67, 32)        0         
_________________________________________________________________
time_distributed_10 (TimeDis (None, 15, 2144)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 20)                173200    
_________________________________________________________________
dense_4 (Dense)              (None, 10)                210       
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 22        
Total params: 202,264
Trainable params: 202,264
Non-trainable params: 0
________________________________________________

  return h5py.File(h5file)
