# Sentence Emotion Detection Model

This notebook contains code we used to train our model that uses embedding and LSTM sentiment analysis to predict the emotion of a journal entry (text sentence)

## Preperations

Install SpaCy and other libraries


In [2]:
!pip install --upgrade torch==1.7.1 torchtext==0.8.1 torchvision==0.8.2

Collecting torch==1.7.1
[?25l  Downloading https://files.pythonhosted.org/packages/90/5d/095ddddc91c8a769a68c791c019c5793f9c4456a688ddd235d6670924ecb/torch-1.7.1-cp37-cp37m-manylinux1_x86_64.whl (776.8MB)
[K     |████████████████████████████████| 776.8MB 22kB/s 
[?25hCollecting torchtext==0.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/13/80/046f0691b296e755ae884df3ca98033cb9afcaf287603b2b7999e94640b8/torchtext-0.8.1-cp37-cp37m-manylinux1_x86_64.whl (7.0MB)
[K     |████████████████████████████████| 7.0MB 26.7MB/s 
[?25hCollecting torchvision==0.8.2
[?25l  Downloading https://files.pythonhosted.org/packages/94/df/969e69a94cff1c8911acb0688117f95e1915becc1e01c73e7960a2c76ec8/torchvision-0.8.2-cp37-cp37m-manylinux1_x86_64.whl (12.8MB)
[K     |████████████████████████████████| 12.8MB 238kB/s 
Installing collected packages: torch, torchtext, torchvision
  Found existing installation: torch 1.8.1+cu101
    Uninstalling torch-1.8.1+cu101:
      Successfully uninstalled

In [3]:
import torch, torchtext
from torch import nn, optim, functional as F
import pandas as pd, csv
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import pdb

Import dataset (already cleaned) from dropbox link

In [4]:
!wget -O text.csv https://www.dropbox.com/s/iulhdbo1yc8farq/Emotion_final.csv?dl=0

--2021-04-07 08:11:33--  https://www.dropbox.com/s/iulhdbo1yc8farq/Emotion_final.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.64.18, 2620:100:6020:18::a27d:4012
Connecting to www.dropbox.com (www.dropbox.com)|162.125.64.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/iulhdbo1yc8farq/Emotion_final.csv [following]
--2021-04-07 08:11:34--  https://www.dropbox.com/s/raw/iulhdbo1yc8farq/Emotion_final.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc286a79fbc82545e23a0802b85f.dl.dropboxusercontent.com/cd/0/inline/BMJlvO8_3bNWASB5y3WeRwage67btWKh2Bar-8qnsO_8-It4RL6TS4xw_fean-Aj1OqgyRGEgEjkVIgUQGJs0zxZZt9ugI48sOn_FHLaewm_cVZRaWG1s9ewvHVd3PnjsGzbmFwd8h4FuLgLMBT8EWKG/file# [following]
--2021-04-07 08:11:34--  https://uc286a79fbc82545e23a0802b85f.dl.dropboxusercontent.com/cd/0/inline/BMJlvO8_3bNWASB5y3WeRwage67btWKh2Bar-8qnsO_8-It4RL6TS4xw_fean-Aj1Oq

In [5]:
text = pd.read_csv('/content/text.csv')

In [6]:
text

Unnamed: 0,Number,Text,Emotion
0,1,i didnt feel humiliated,sadness
1,2,i can go from feeling so hopeless to so damned...,sadness
2,3,im grabbing a minute to post i feel greedy wrong,anger
3,4,i am ever feeling nostalgic about the fireplac...,love
4,5,i am feeling grouchy,anger
...,...,...,...
21454,21455,Melissa stared at her friend in dism,fear
21455,21456,Successive state elections have seen the gover...,fear
21456,21457,Vincent was irritated but not dismay,fear
21457,21458,Kendall-Hume turned back to face the dismayed ...,fear


Sentiments into an array for later use

In [7]:
text.Emotion.unique()

array(['sadness', 'anger', 'love', 'surprise', 'fear', 'happy'],
      dtype=object)

In [8]:
sentiment = ['sadness', 'anger', 'love', 'surprise', 'fear', 'happy']

## Dataset

Define Dataset for text and split into train/test subsets

In [9]:
class Sentences(torch.utils.data.Dataset):
    def __init__(self, fn):
        lengths = []
        convert = { u: n for n, u in enumerate(fn['Emotion'].unique()) }
        fn['Emotion'] = fn['Emotion'].apply(lambda u: convert[u])              # 12 unique words should be assigned integers starting from 0
        tokenizer = torchtext.data.utils.get_tokenizer('spacy', 'en_core_web_sm')  # tokenizer using spaCy
        for i in range(len(text['Text'])):
          lengths.append(len(tokenizer(text['Text'].iat[i])))                 # store the number of tokens in each tweet to beused in getitem
        string = ' '.join([text['Text'].iat[i] 
                           for i in range(len(text['Text']))])                # combine everything into one single string
        toks = tokenizer(string)                                                   # tokenize the single string

        self.vocab = torchtext.vocab.build_vocab_from_iterator([toks])
        self.sentiment = fn['Emotion'].values
        self.text = fn['Text'].values
        self.length = lengths
        self.toks = torch.LongTensor([self.vocab[tok] for tok in toks])

    def __len__(self):
        return len(self.length)

    def __getitem__(self, i):
        sum = 0
        for x in range(i):
          sum += self.length[x]
        return (self.sentiment[i], self.toks[sum: sum + self.length[i]])          # return the sentiment and related tokns for a specific tweet

In [10]:
ds_full = Sentences(text)
n_train = int(0.8 * len(ds_full))
n_test = len(ds_full) - n_train
rng = torch.Generator().manual_seed(291)
ds_train, ds_test = torch.utils.data.random_split(ds_full, [n_train, n_test], rng)

1lines [00:00, 16.66lines/s]


Check outputs are correct

In [11]:
ds_test[100]

(5, tensor([   2,   25,   10,   17, 4064,    4, 1331]))

In [12]:
' '.join([ds_full.vocab.itos[x] for x in ds_test[100][1]])

'i am feeling so hyper and bouncy'

In [13]:
sentiment[ds_test[100][0]]

'happy'

In [14]:
len(ds_full.toks)

416914

## Model

Model with embedding and LSTM

In [15]:
class SentenceModel(nn.Module):
      def __init__(self, vocab_size, embedding_dim, lstm_dim, n_cats, n_layers = 2, drop_prob = 0.5):
        super().__init__()                                                      #constructor for parent class
        self.embedding = torch.nn.Embedding(vocab_size, embedding_dim)          #use word embeddings 
        self.lstm = torch.nn.LSTM(embedding_dim, lstm_dim, n_layers,
                                  dropout=drop_prob, batch_first=True)          #LSTM layer
        self.linear = nn.Linear(lstm_dim, n_cats)
        nn.init.xavier_uniform_(self.embedding.weight.data) #??? need to check this with TA ???
        nn.init.xavier_uniform_(self.linear.weight.data) #??? need to check this with TA ???
        
      def forward(self, text):
        emb = self.embedding(text)
        lstm_out, _ = self.lstm(emb)
        out = self.linear(lstm_out)
        return torch.mean(out, dim=1) # certain dimensions required ??? need to check this with TA ???

Test and Train loops

In [16]:
device = torch.device('cpu')

def run_test(model, ds, crit):
    preds = []                                                                  # array to store predictions
    batch_size = 1                                                              # change batch size here
    model.eval()
    total_loss, total_acc = 0, 0
    ldr = torch.utils.data.DataLoader(ds)
    for labs, txts in ldr:                                                
        labs, txts = labs.to(device), txts.to(device)
        with torch.no_grad():
            outs = model(txts)
            loss = crit(outs, labs)
            total_loss += loss.item()
            total_acc += (outs.argmax(1) == labs).sum().item()
            preds.append(outs.argmax(1))                                        # append all the predictions to an array
    return total_loss / len(ds), total_acc / len(ds), preds, batch_size         # added array return value 'preds' and batchsize

def run_train(model, ds, crit, opt, sched):
    model.train()
    total_loss, total_acc = 0, 0
    ldr = torch.utils.data.DataLoader(ds)
    for labs, txts in tqdm(ldr, leave=False, desc='train iter'):          
        opt.zero_grad()
        labs, txts = labs.to(device), txts.to(device)
        outs = model(txts)
        loss = crit(outs, labs)
        loss.backward()
        opt.step()
        total_loss += loss.item()
        total_acc += (outs.argmax(1) == labs).sum().item()
    sched.step()
    return total_loss / len(ds), total_acc / len(ds)

def run_all(model, test_ds, train_ds, crit, opt, sched, n_epochs=10):
    for epoch in tqdm(range(n_epochs), desc='epochs'):
        train_loss, train_acc = run_train(model, train_ds, crit, opt, sched)
        test_loss, test_acc, _, _ = run_test(model, test_ds, crit)
        tqdm.write(f'epoch {epoch}   train loss {train_loss:.6f} acc {train_acc:.4f}   test loss {test_loss:.6f} acc {test_acc:.4f}')   

## Training

In [17]:
model = SentenceModel(len(ds_full.vocab), 32, 1, len(text.Emotion.unique()))
model.to(device);
crit = nn.CrossEntropyLoss().to(device)
opt = optim.SGD(model.parameters(), lr=1.0)
sched = optim.lr_scheduler.StepLR(opt, 10, gamma=0.1)

In [18]:
run_all(model, ds_test, ds_train, crit, opt, sched, 10)

HBox(children=(FloatProgress(value=0.0, description='epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 0   train loss 1.809258 acc 0.2762   test loss 1.712620 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 1   train loss 1.809175 acc 0.2765   test loss 1.712628 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 2   train loss 1.809174 acc 0.2765   test loss 1.712601 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 3   train loss 1.809171 acc 0.2764   test loss 1.712594 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 4   train loss 1.809168 acc 0.2764   test loss 1.712592 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 5   train loss 1.809171 acc 0.2764   test loss 1.712599 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 6   train loss 1.809172 acc 0.2764   test loss 1.712600 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 7   train loss 1.809170 acc 0.2765   test loss 1.712599 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

KeyboardInterrupt: ignored

In [20]:
model = SentenceModel(len(ds_full.vocab), 32, 64, len(text.Emotion.unique())) #lstm_dim: 1->64
model.to(device);
crit = nn.CrossEntropyLoss().to(device)
opt = optim.SGD(model.parameters(), lr=1.0)
sched = optim.lr_scheduler.StepLR(opt, 10, gamma=0.1)

run_all(model, ds_test, ds_train, crit, opt, sched, 10)

HBox(children=(FloatProgress(value=0.0, description='epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 0   train loss 1.809752 acc 0.2763   test loss 1.712660 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 1   train loss 1.809259 acc 0.2765   test loss 1.712617 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 2   train loss 1.809241 acc 0.2764   test loss 1.712661 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 3   train loss 1.809223 acc 0.2765   test loss 1.712638 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 4   train loss 1.809217 acc 0.2765   test loss 1.712623 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

KeyboardInterrupt: ignored

In [27]:
model = SentenceModel(len(ds_full.vocab), 32, 64, len(text.Emotion.unique()))
model.to(device);
crit = nn.CrossEntropyLoss().to(device)
opt = optim.SGD(model.parameters(), lr=0) #lr 0.1->0
sched = optim.lr_scheduler.StepLR(opt, 10, gamma=1) #gamma 0.1->1

run_all(model, ds_test, ds_train, crit, opt, sched, 10)

HBox(children=(FloatProgress(value=0.0, description='epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 0   train loss 1.783586 acc 0.2140   test loss 1.784622 acc 0.2027


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 1   train loss 1.783529 acc 0.2139   test loss 1.784622 acc 0.2027


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 2   train loss 1.783530 acc 0.2149   test loss 1.784622 acc 0.2027


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

KeyboardInterrupt: ignored

In [19]:
model = SentenceModel(len(ds_full.vocab), 32, 64, len(text.Emotion.unique()))
device = torch.device('cuda:0') #added GPU since CPU too slow (enable that in notebook settings)
model.to(device);
crit = nn.CrossEntropyLoss().to(device)
opt = optim.SGD(model.parameters(), lr=0.1) #step 0.1->1.0
sched = optim.lr_scheduler.StepLR(opt, 1, gamma=1) #step 10->1

run_all(model, ds_test, ds_train, crit, opt, sched, 10)

HBox(children=(FloatProgress(value=0.0, description='epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 0   train loss 1.608390 acc 0.3168   test loss 1.583752 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 1   train loss 1.606537 acc 0.3162   test loss 1.583432 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 2   train loss 1.606490 acc 0.3162   test loss 1.583368 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 3   train loss 1.606478 acc 0.3162   test loss 1.583350 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 4   train loss 1.606472 acc 0.3161   test loss 1.583339 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 5   train loss 1.606451 acc 0.3163   test loss 1.583281 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 6   train loss 1.606380 acc 0.3161   test loss 1.583220 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 7   train loss 1.606091 acc 0.3169   test loss 1.583020 acc 0.3269


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 8   train loss 1.604681 acc 0.3217   test loss 1.579531 acc 0.3288


HBox(children=(FloatProgress(value=0.0, description='train iter', max=17167.0, style=ProgressStyle(description…

epoch 9   train loss 1.527158 acc 0.3887   test loss 1.414449 acc 0.4823



TRAINING NOT FINISHED CONTINUE BELOW

**Notes for future training:**


*   Will probably need to add dropout to model to reduce overfitting
*   Increase number of epochs since model apparently needs it
*   Also I (Kaleb) accidentally changed 2 variables for some of the training so will need to fix that

