# LSTM

By

A.Ntoumi & A. Steger (University of Groningen, Language Technology Project 2019-20)

## Importing libraries

In [2]:
import numpy as np
import pandas as pd
import string
import torch
from torch.utils.data import TensorDataset, DataLoader
import random
from pprint import pprint
import statistics
from collections import Counter
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import matplotlib.pyplot as plt

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

stemmer = SnowballStemmer("dutch")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Loading the dataset

In [6]:
essays_df = pd.read_csv("./data/clean_data.csv")

In [7]:
# show a sample of the loaded dataset
essays_df.sample(3)

Unnamed: 0,user_id,essay,personality,Openness,Conscientiousness,Extroversion,Agreeableness,Neuroticism,clean_essay
200,56610387,"Het is niet enkel algemeen geweten, het is ook...",12-35-70-38-49,0,0,1,0,0,enkel algemen gewet bewez bevind anno sted wei...
413,12570386,Alweer reclame? En het programma is nog maar n...,84-83-64-69-55,1,1,1,1,1,alwer reclam programma net begonn zin twijfel ...
77,10289345,Van welvaartstoename tot psychische neerval\n\...,65-58-48-38-80,1,1,0,0,1,welvaartstoenam psychisch neerval belgie stat ...


In [8]:
essays_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 470 entries, 0 to 469
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   user_id            470 non-null    int64 
 1   essay              470 non-null    object
 2   personality        470 non-null    object
 3   Openness           470 non-null    int64 
 4   Conscientiousness  470 non-null    int64 
 5   Extroversion       470 non-null    int64 
 6   Agreeableness      470 non-null    int64 
 7   Neuroticism        470 non-null    int64 
 8   clean_essay        470 non-null    object
dtypes: int64(6), object(3)
memory usage: 33.2+ KB


The above dataset is already preprocessed in the SVM notebook, thus we can directly exploit the `clean_essay` document and the `personalities` features to build our LSTM model.

## 2. Feature engineering

### Create dictionaries and encode essays 

In [9]:
# getting the list of tokens we have in the essays corpus
all_words = ' '.join(essays_df.clean_essay.tolist())

#counting the word frequencies in the essay corpus
word_counts = Counter(all_words.split())

# sorted word list according to descending order
word_list = sorted(word_counts, key = word_counts.get, reverse = True)

# creating two dictionaries to map word to index, and map index to word.
word_to_index = {word:idx+1 for idx,word in enumerate(word_list)}
index_to_word = {idx+1:word for idx,word in enumerate(word_list)}

In [10]:
# samples:
print("word_to_index sample dict :\n")
pprint(dict(random.sample(word_to_index.items(), 10)))
print("\nindex_to_word sample dict :\n")
pprint(dict(random.sample(index_to_word.items(), 10)))

word_to_index sample dict :

{'gaybashingtr': 15325,
 'gebruikt': 161,
 'gerold': 12705,
 'gesabotteeerd': 12554,
 'ieder': 322,
 'losstond': 13829,
 'omschol': 4911,
 'productiev': 15873,
 'verdubbel': 7236,
 'woordbegrip': 5380}

index_to_word sample dict :

{50: 'belangrijk',
 59: 'bijvoorbeeld',
 2519: 'opgegroeid',
 5204: 'representatief',
 6197: 'rub',
 6373: 'stolt',
 9065: 'stramien',
 9196: 'fmri',
 12713: 'prijkt',
 14871: 'begrafeniskost'}


In [11]:
# encoding essays
encoded_essays = [[word_to_index[word] for word in essay.split()] for essay in essays_df['clean_essay']]

## Encode labels

In this section we will encode the labels for all the Big Five personality traits.

In [12]:
encoded_OPN_labels = essays_df['Openness'].values
encoded_CON_labels = essays_df['Conscientiousness'].values
encoded_EXT_labels = essays_df['Extroversion'].values
encoded_AGR_labels = essays_df['Agreeableness'].values
encoded_NEU_labels = essays_df['Neuroticism'].values

In [13]:
# Assering that length of essays be equal to length of labels
assert len(encoded_essays) == len(encoded_EXT_labels),"Number of of encoded essays and encoded labels should be same"

### Padding essays

In this step, we need to make the essays of the same length, so we will use padding.

In [14]:
# printing top 20 max length
len_max = ([len(x) for x in encoded_essays])
print(sorted(list(len_max), reverse=True)[:20])

[2196, 1958, 1786, 1735, 1650, 1594, 1493, 1472, 1470, 1463, 1434, 1430, 1420, 1404, 1394, 1344, 1329, 1325, 1325, 1322]


In order to get the right padding from the list of essay lengths above, we will use the median statistical function to get the right padding value.

In [15]:
print("The median of the essays lengths distributions is {}".format(statistics.median(len_max)))

The median of the essays lengths distributions is 362.0


It seems like the median of essay length is 362 words so a standard size of 400 should be enough to get all the features of an essay (especially because we assume that people will have expressed how they feel in their first 300 words). By fixing a standard essay size of 400, short essays will be padded with zeros and long ones will be truncated.

In [16]:
# function to pad our encoded essays/feature
def pad_features(essays, max_length):
    """
    Returns features of reviews where each review is padded with 0's or truncated to the max_length
    """
    
    features = []
    
    # pad or truncate each review
    for idx, row in enumerate(essays):
        if len(row) >= max_length:
            features.append(row[:max_length])
        else:
            features.append(np.concatenate((np.zeros(max_length-len(row)), np.array(row))))
        
    return np.array(features)

In [17]:
# a simple sanity check
test_array = [[1,2,3,4],
    [1,2,3,4,5,6,7,8,9,10]]

# pad the test_array to a maximum size of 8
pad_features(test_array,8)

array([[0., 0., 0., 0., 1., 2., 3., 4.],
       [1., 2., 3., 4., 5., 6., 7., 8.]])

In [18]:
# saving under a new variable:
padded_features = pad_features(essays = encoded_essays, max_length = 400)

In [20]:
# checking if the number of feature is equal to number of reviews we passed
assert len(padded_features) == len(encoded_essays),"Length Mismatch after padding"
assert len(padded_features[0]) == 400

## 3. Data Splitting

In [21]:
# get the total length of essays in the features
total = padded_features.shape[0]
# set the train size to 0.8
train_ratio = 0.8

# we will use 80% of the data for training and use remaining 20% for testing and validation
# we will split the remaining 20% into half and separate into them testing and validation sets
train_idx = int(total*0.8)
train_x, remaining_x  = padded_features[:train_idx], padded_features[train_idx:]

# doing the same for labels 
train_y_OPN, remaining_y_OPN = encoded_OPN_labels[:train_idx], encoded_OPN_labels[train_idx:]
train_y_CON, remaining_y_CON = encoded_CON_labels[:train_idx], encoded_CON_labels[train_idx:]
train_y_EXT, remaining_y_EXT = encoded_EXT_labels[:train_idx], encoded_EXT_labels[train_idx:]
train_y_AGR, remaining_y_AGR = encoded_AGR_labels[:train_idx], encoded_AGR_labels[train_idx:]
train_y_NEU, remaining_y_NEU = encoded_NEU_labels[:train_idx], encoded_NEU_labels[train_idx:]


# splitting the remaining 20% to validation and testing
test_idx = int(len(remaining_x)*0.5)
test_x, valid_x  = remaining_x[:test_idx], remaining_x[test_idx:]

# doing the same for labels
test_y_OPN, valid_y_OPN = remaining_y_OPN[:test_idx], remaining_y_OPN[test_idx:]
test_y_CON, valid_y_CON = remaining_y_CON[:test_idx], remaining_y_CON[test_idx:]
test_y_EXT, valid_y_EXT = remaining_y_EXT[:test_idx], remaining_y_EXT[test_idx:]
test_y_AGR, valid_y_AGR = remaining_y_AGR[:test_idx], remaining_y_AGR[test_idx:]
test_y_NEU, valid_y_NEU = remaining_y_NEU[:test_idx], remaining_y_NEU[test_idx:]

In [22]:
# let us see the shape of our training, validation and testing data
print("\t\t\t Features Shape")
print("Train Set:\t\t{}".format(train_x.shape),
     "\nValidation Set:\t\t{}".format(valid_x.shape),
     "\nTesting Set\t\t{}".format(test_x.shape))

			 Features Shape
Train Set:		(376, 400) 
Validation Set:		(47, 400) 
Testing Set		(47, 400)


In [23]:
#let us see the shape of labels for out training, validation and testing data
print("\t\t\t Label Shape")
print("Train Set:\t\t{}".format(train_y_OPN.shape),
     "\nValidation Set:\t\t{}".format(valid_y_OPN.shape),
     "\nTesting Set\t\t{}".format(test_y_OPN.shape))

			 Label Shape
Train Set:		(376,) 
Validation Set:		(47,) 
Testing Set		(47,)


In [24]:
print('Total data After preprocessing: \nFeatures:{}\nLabels:{}'.format(padded_features.shape, encoded_OPN_labels.shape))

Total data After preprocessing: 
Features:(470, 400)
Labels:(470,)


## 4. Data Modeling

### Loading the dataset and batching into DataLoaders

We will use TensorDataset and DataLoader for this purpose. TensorDataset takes features and labels with same dimension and creates a dataset, and DataLoader turns both features and labels in batch sizes.

#### 4.1 Openness prediction

In [25]:
# create TensorDatasets
train_data_OPN = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y_OPN))
valid_data_OPN = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y_OPN))
test_data_OPN =  TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y_OPN))

# set the batch size
batch_size = 47

In [26]:
# shuffling the data
train_loader_OPN = DataLoader(train_data_OPN, shuffle=True, batch_size=batch_size, drop_last=False)
valid_loader_OPN = DataLoader(valid_data_OPN, shuffle=True, batch_size=batch_size, drop_last=False)
test_loader_OPN = DataLoader(test_data_OPN, shuffle=True, batch_size=batch_size, drop_last=False)

In [27]:
# visualizing a batch of our training data
dataiter = iter(train_loader_OPN)
sample_x, sample_y = dataiter.next()

print('Sample input size:{}'.format(sample_x.size()))
print('Sample Input:\n{}\n'.format(sample_x))
print('Sample Label size:{}'.format(sample_y.size()))
print('Sample Label:\n{}'.format(sample_y))

Sample input size:torch.Size([47, 400])
Sample Input:
tensor([[1.6920e+03, 5.2000e+01, 3.6600e+02,  ..., 5.6100e+02, 5.3000e+01,
         1.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 2.4200e+02, 9.6200e+02,
         6.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 3.5000e+01, 9.7000e+01,
         4.9900e+02],
        ...,
        [1.5100e+02, 5.0000e+00, 3.9700e+02,  ..., 5.8000e+01, 3.0000e+00,
         1.7700e+02],
        [2.2000e+01, 1.0950e+03, 1.0950e+03,  ..., 1.4800e+02, 1.1969e+04,
         2.2000e+01],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 7.2500e+02, 1.0000e+01,
         1.3020e+03]], dtype=torch.float64)

Sample Label size:torch.Size([47])
Sample Label:
tensor([1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
        0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0])


### Defining our LTSM model

In [28]:
# checking if GPU is available
train_on_gpu = torch.cuda.is_available()
if train_on_gpu:
    print('Training on GPU')
else:
    print("Training on CPU")

Training on CPU


In [29]:
import torch.nn as nn

class PersonalityLSTM(nn.Module):
    """
    The RNN model that will be used to perform sentiment analysis
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initializing the model by setting up the layers
        """
        super(PersonalityLSTM, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Defining forward pass function
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] 
        
        # returns last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Creates two new tensors with a size of number of layers * batch size * hidden layers
        # Initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

### Instantiate the network

We will continue with instantiating the network. First, let us define the hyperparameters

* `vocab_size`: Size of our vocabulary or the range of values for our input.
* `output_size`: Size of our desired output: the number of class scores we want to output (here 1: if trait belongs to a subject and 0: if it does not)
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. We have decided to go by 256
* `n_layers`: Number of LSTM layers in the network. This value is typically between 1-3, we have opted for 2

In [30]:
# instantiating our model with hyperparameters

vocab_size = len(word_to_index) + 2
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

net = PersonalityLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

PersonalityLSTM(
  (embedding): Embedding(20461, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


### Training the LSTM model

In [31]:
# set the learning rate to 0.001
lr = 0.001

# loss and optimization functions
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

# save path to save our weights with best validation accuracy
save_OPN_path = './models/best_validation_OPN.pt'

In [32]:
# training parameters
def train(model, criterion, optimizer, train_loader, valid_loader, batch_size, train_on_gpu, save_path):
    valid_loss_min = np.Inf
        
    epochs = 4 

    counter = 0
    print_every = 10
    clip=5 # gradient clipping to avoid exploding gradients

    # move model to GPU, if available
    if(train_on_gpu):
        model.cuda()

    model.train()
    # train for some number of epochs
    for e in range(epochs):
        # initialize hidden state
        h = model.init_hidden(batch_size)

        # batch loop
        for inputs, labels in train_loader:
            counter += 1
            if(train_on_gpu):
                inputs, labels = inputs.cuda(), labels.cuda()

            # Creating new variables for the hidden state to
            # avoid backpropagation through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            model.zero_grad()

            # get the output from the model
            output, h = model(inputs, h)

            # calculate the loss and perform backpropagation
            loss = criterion(output.squeeze(), labels.float())
            loss.backward()

            nn.utils.clip_grad_norm_(net.parameters(), clip)
            optimizer.step()

            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                
                for inputs, labels in valid_loader:

                # Creating new variables for the hidden state to
                # avoid backpropagation through the entire training history
                    val_h = tuple([each.data for each in val_h])

                    if(train_on_gpu):
                        inputs, labels = inputs.cuda(), labels.cuda()

                    output, val_h = model(inputs, val_h)
                    val_loss = criterion(output.squeeze(), labels.float())

                    val_losses.append(val_loss.item())

                model.train()

                # saving the model with best validation accuracy. 
                if np.mean(val_losses) <= valid_loss_min:
                    print('Validation loss decreased ({:.6f} ---------> {:.6f}).\t Saving model...'.
                          format(valid_loss_min, np.mean(val_losses)))
                    torch.save(model.state_dict(), save_path)
                    valid_loss_min = np.mean(val_losses)

                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.6f}...".format(loss.item()),
                      "Val Loss: {:.6f}".format(np.mean(val_losses)))

In [33]:
# training the model
train(model=net, criterion=criterion, optimizer=optimizer, train_loader=train_loader_OPN, valid_loader=valid_loader_OPN, batch_size=batch_size, train_on_gpu=train_on_gpu, save_path=save_OPN_path)

Validation loss decreased (inf ---------> 0.694335).	 Saving model...
Epoch: 2/4... Step: 10... Loss: 0.663769... Val Loss: 0.694335
Epoch: 3/4... Step: 20... Loss: 0.450650... Val Loss: 0.697766
Epoch: 4/4... Step: 30... Loss: 0.216449... Val Loss: 0.954250


In [34]:
# load the model that got the best validation accuracy
net.load_state_dict(torch.load(save_OPN_path))

<All keys matched successfully>

### Testing the trained model

In [35]:
def test(model, criterion, test_loader, batch_size, train_on_gpu):
    """
    Tests and returns the accuracy and loss of the given model on the given dataset
    """
    test_losses = [] # track loss
    num_correct = 0
    
    # initial hidden state
    h = model.init_hidden(batch_size)
    net.eval() # turning of back propagation
    
    #iterating over test data
    for inputs, labels in test_loader:
        # Creating new variables for the hidden state to
        # avoid backpropagation through the entire training history
        h = tuple([each.data for each in h])
        
        if train_on_gpu:
            inputs, labels = inputs.cuda(), labels.cuda()
            
        #get predicted outputs
        output, h = net(inputs, h)
        
        #calculate loss
        test_loss = criterion(output.squeeze(), labels.float())
        test_losses.append(test_loss.item())
        
        # convert the output probabilities to predicted class( 0 or 1)
        pred = torch.round(output.squeeze()) # rounds to nearest integer
        
        # compare prediction to true label
        correct_tensor = pred.eq(labels.float().view_as(pred))
        correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
        num_correct += np.sum(correct)
    
    # printing stats
    print('Test loss: {:.3f}'.format(np.mean(test_losses)))
    
    #accuracy over all test_data
    test_acc = num_correct/len(test_loader.dataset)
    print("Test accuracy: {:.3f} %".format(test_acc))

In [36]:
# testing performance of our model
test(model=net, criterion=criterion, test_loader=test_loader_OPN, batch_size=batch_size, train_on_gpu=train_on_gpu)

Test loss: 0.688
Test accuracy: 0.574 %


#### 4.2 Conscientiousness prediction

In [57]:
train_data_CON = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y_CON))
valid_data_CON = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y_CON))
test_data_CON =  TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y_CON))

In [58]:
train_loader_CON = DataLoader(train_data_CON, shuffle=True, batch_size=batch_size, drop_last=False)
valid_loader_CON = DataLoader(valid_data_CON, shuffle=True, batch_size=batch_size, drop_last=False)
test_loader_CON = DataLoader(test_data_CON, shuffle=True, batch_size=batch_size, drop_last=False)

In [59]:
dataiter = iter(train_loader_CON)
sample_x, sample_y = dataiter.next()

print('Sample input size:{}'.format(sample_x.size()))
print('Sample Input:\n{}\n'.format(sample_x))
print('Sample Label size:{}'.format(sample_y.size()))
print('Sample Label:\n{}'.format(sample_y))

Sample input size:torch.Size([47, 400])
Sample Input:
tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 1.6920e+03, 4.4850e+03,
         3.4990e+03],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 1.0000e+00, 2.9400e+02,
         1.7400e+02],
        [3.4000e+01, 6.1000e+01, 2.2270e+03,  ..., 3.1000e+01, 4.8900e+02,
         2.7000e+01],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 3.5700e+02, 1.0200e+02,
         4.1300e+02],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 1.1680e+03, 9.0000e+00,
         4.5740e+03],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 1.0000e+00, 5.0000e+00,
         1.6120e+03]], dtype=torch.float64)

Sample Label size:torch.Size([47])
Sample Label:
tensor([0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0,
        1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0])


In [60]:
save_CON_path = './models/best_validation_CON.pt'

In [62]:
train(model=net, criterion=criterion, optimizer=optimizer, train_loader=train_loader_CON, valid_loader=valid_loader_CON, batch_size=batch_size, train_on_gpu=train_on_gpu, save_path=save_CON_path)

Validation loss decreased (inf ---------> 0.707953).	 Saving model...
Epoch: 2/4... Step: 10... Loss: 0.701135... Val Loss: 0.707953
Epoch: 3/4... Step: 20... Loss: 0.518467... Val Loss: 0.739995
Epoch: 4/4... Step: 30... Loss: 0.301660... Val Loss: 0.915529


In [64]:
net.load_state_dict(torch.load(save_CON_path))

<All keys matched successfully>

In [65]:
test(model=net, criterion=criterion, test_loader=test_loader_CON, batch_size=batch_size, train_on_gpu=train_on_gpu)

Test loss: 0.701
Test accuracy: 0.468 %


#### 4.3 Extroversion prediction

In [66]:
train_data_EXT = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y_EXT))
valid_data_EXT = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y_EXT))
test_data_EXT =  TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y_EXT))

In [67]:
train_loader_EXT = DataLoader(train_data_EXT, shuffle=True, batch_size=batch_size, drop_last=False)
valid_loader_EXT = DataLoader(valid_data_EXT, shuffle=True, batch_size=batch_size, drop_last=False)
test_loader_EXT = DataLoader(test_data_EXT, shuffle=True, batch_size=batch_size, drop_last=False)

In [68]:
dataiter = iter(train_loader_EXT)
sample_x, sample_y = dataiter.next()

print('Sample input size:{}'.format(sample_x.size()))
print('Sample Input:\n{}\n'.format(sample_x))
print('Sample Label size:{}'.format(sample_y.size()))
print('Sample Label:\n{}'.format(sample_y))

Sample input size:torch.Size([47, 400])
Sample Input:
tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 8.9000e+01, 2.8310e+03,
         5.3000e+01],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 3.0000e+01, 4.4800e+02,
         3.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 2.3100e+02, 1.4550e+03,
         3.1140e+03],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 2.4200e+02, 9.6200e+02,
         6.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 1.4004e+04, 1.5610e+03,
         9.3800e+03],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 3.3000e+01, 2.2800e+02,
         1.1200e+02]], dtype=torch.float64)

Sample Label size:torch.Size([47])
Sample Label:
tensor([0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,
        1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])


In [69]:
save_EXT_path = './models/best_validation_EXT.pt'

In [70]:
train(model=net, criterion=criterion, optimizer=optimizer, train_loader=train_loader_EXT, valid_loader=valid_loader_EXT, batch_size=batch_size, train_on_gpu=train_on_gpu, save_path=save_EXT_path)

Validation loss decreased (inf ---------> 0.721867).	 Saving model...
Epoch: 2/4... Step: 10... Loss: 0.662397... Val Loss: 0.721867
Epoch: 3/4... Step: 20... Loss: 0.452118... Val Loss: 0.896608
Epoch: 4/4... Step: 30... Loss: 0.236388... Val Loss: 1.450716


In [71]:
net.load_state_dict(torch.load(save_EXT_path))

<All keys matched successfully>

In [72]:
test(model=net, criterion=criterion, test_loader=test_loader_EXT, batch_size=batch_size, train_on_gpu=train_on_gpu)

Test loss: 0.697
Test accuracy: 0.553 %


### 4.4 Agreeableness prediction

In [73]:
train_data_AGR = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y_AGR))
valid_data_AGR = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y_AGR))
test_data_AGR =  TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y_AGR))

In [74]:
train_loader_AGR = DataLoader(train_data_AGR, shuffle=True, batch_size=batch_size, drop_last=False)
valid_loader_AGR = DataLoader(valid_data_AGR, shuffle=True, batch_size=batch_size, drop_last=False)
test_loader_AGR = DataLoader(test_data_AGR, shuffle=True, batch_size=batch_size, drop_last=False)

In [75]:
dataiter = iter(train_loader_AGR)
sample_x, sample_y = dataiter.next()

print('Sample input size:{}'.format(sample_x.size()))
print('Sample Input:\n{}\n'.format(sample_x))
print('Sample Label size:{}'.format(sample_y.size()))
print('Sample Label:\n{}'.format(sample_y))

Sample input size:torch.Size([47, 400])
Sample Input:
tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 7.3200e+02, 1.2050e+03,
         1.1080e+03],
        [1.0000e+00, 8.8000e+01, 6.1000e+01,  ..., 1.1520e+03, 1.1520e+03,
         2.4200e+02],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 2.9000e+01, 3.4990e+03,
         7.6300e+02],
        ...,
        [1.7995e+04, 7.0000e+00, 1.2200e+02,  ..., 6.5300e+02, 7.2000e+02,
         4.3100e+02],
        [1.6550e+03, 8.3000e+01, 1.8900e+02,  ..., 9.7800e+02, 1.5740e+03,
         3.2100e+02],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 3.9900e+02, 1.7430e+03,
         6.9000e+01]], dtype=torch.float64)

Sample Label size:torch.Size([47])
Sample Label:
tensor([0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1])


In [76]:
save_AGR_path = './models/best_validation_AGR.pt'

In [77]:
train(model=net, criterion=criterion, optimizer=optimizer, train_loader=train_loader_AGR, valid_loader=valid_loader_AGR, batch_size=batch_size, train_on_gpu=train_on_gpu, save_path=save_AGR_path)

Validation loss decreased (inf ---------> 0.736247).	 Saving model...
Epoch: 2/4... Step: 10... Loss: 0.532577... Val Loss: 0.736247
Epoch: 3/4... Step: 20... Loss: 0.345809... Val Loss: 0.826437
Epoch: 4/4... Step: 30... Loss: 0.170313... Val Loss: 1.058094


In [78]:
net.load_state_dict(torch.load(save_AGR_path))

<All keys matched successfully>

In [79]:
test(model=net, criterion=criterion, test_loader=test_loader_AGR, batch_size=batch_size, train_on_gpu=train_on_gpu)

Test loss: 0.666
Test accuracy: 0.617 %


### 4.5 Neuroticism prediction

In [80]:
train_data_NEU = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y_NEU))
valid_data_NEU = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y_NEU))
test_data_NEU =  TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y_NEU))

In [81]:
train_loader_NEU = DataLoader(train_data_NEU, shuffle=True, batch_size=batch_size, drop_last=False)
valid_loader_NEU = DataLoader(valid_data_NEU, shuffle=True, batch_size=batch_size, drop_last=False)
test_loader_NEU = DataLoader(test_data_NEU, shuffle=True, batch_size=batch_size, drop_last=False)

In [82]:
dataiter = iter(train_loader_NEU)
sample_x, sample_y = dataiter.next()

print('Sample input size:{}'.format(sample_x.size()))
print('Sample Input:\n{}\n'.format(sample_x))
print('Sample Label size:{}'.format(sample_y.size()))
print('Sample Label:\n{}'.format(sample_y))

Sample input size:torch.Size([47, 400])
Sample Input:
tensor([[5.5800e+02, 1.0000e+00, 4.2270e+03,  ..., 2.0000e+02, 4.3900e+02,
         4.3290e+03],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 5.1200e+02, 5.2000e+01,
         4.1000e+02],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 3.0000e+00, 1.6740e+03,
         6.4800e+02],
        ...,
        [3.1000e+01, 2.7800e+02, 5.7900e+02,  ..., 4.0160e+03, 3.0890e+03,
         5.5810e+03],
        [4.3600e+02, 5.2700e+02, 5.3000e+01,  ..., 1.1790e+03, 1.0050e+03,
         5.0930e+03],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 2.4510e+03, 7.2000e+01,
         2.3500e+02]], dtype=torch.float64)

Sample Label size:torch.Size([47])
Sample Label:
tensor([1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,
        0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1])


In [83]:
save_NEU_path = './models/best_validation_NEU.pt'

In [84]:
train(model=net, criterion=criterion, optimizer=optimizer, train_loader=train_loader_NEU, valid_loader=valid_loader_NEU, batch_size=batch_size, train_on_gpu=train_on_gpu, save_path=save_NEU_path)

Validation loss decreased (inf ---------> 0.695801).	 Saving model...
Epoch: 2/4... Step: 10... Loss: 0.696556... Val Loss: 0.695801
Validation loss decreased (0.695801 ---------> 0.677939).	 Saving model...
Epoch: 3/4... Step: 20... Loss: 0.627210... Val Loss: 0.677939
Validation loss decreased (0.677939 ---------> 0.676995).	 Saving model...
Epoch: 4/4... Step: 30... Loss: 0.459556... Val Loss: 0.676995


In [86]:
net.load_state_dict(torch.load(save_NEU_path))

<All keys matched successfully>

In [87]:
test(model=net, criterion=criterion, test_loader=test_loader_NEU, batch_size=batch_size, train_on_gpu=train_on_gpu)

Test loss: 0.726
Test accuracy: 0.489 %


Based on the above values, we can conclude that the LSTM gives better results in three personality traits than the one by SVM in the previous notebook. LSTM gives higher scores for Openness, Extroversion and Agreeableness, while SVM reached higher scores for Conscientiousness and Neuroticism.

## Combining the 5 built-in models

### Building a pipeline that takes an essay and predicts the personalities

* First, we will build a function that will preprocess a given essay. It will perform tokenization, cleaning, stopwords removal and finally, pad the review.

* Second, we will prepare a function that takes a review and outputs a dictionary of personalities

In [88]:
# get stopwords list
stoplist = stopwords.words('dutch') 
# get list of punctuations
punctuations = string.punctuation + "’¶•@°©®™"

def preprocess_text(text):
    """
    This function preprocess a given raw text by removing the urls, mentions,
    punctuations, stop words, numbers, emojis etc.
    
    @param text string
    @return text string
    """
        
    # string to lowercase
    txt = text.lower()
    
    # keep only ascii characters
    txt = re.sub(r"[^a-zA-ZÀ-ÿ]", " ", txt)
    
    # punctuation removal and map it to space
    translator = str.maketrans(punctuations, " "*len(punctuations))
    s = txt.translate(translator)
    
    # remove digits 
    no_digits = ''.join([i for i in s if not i.isdigit()])
    cleaner = " ".join(no_digits.split())
    
    # tokenize words and removing stop words 
    word_tokens = word_tokenize(cleaner)
    filtered_sentence = [w for w in word_tokens if not w in stoplist]
    filtered_sentence = " ".join(filtered_sentence)
    
    # a stemming word block
    filtered_sentence = [stemmer.stem(word) for word in word_tokenize(filtered_sentence)]
    filtered_sentence = " ".join(filtered_sentence)
    
    # encoding review using our list of words that we generated earler
    encoded_review = [word_to_index[word] for word in filtered_sentence.split() if word in word_to_index]
    
    return encoded_review

In [146]:
test_essay = """
Geen geld terug bij teerlongen
Een patiënt met longkanker krijgt vaak hoge ziekenhuisrekeningen voorgeschoteld. Gelukkig kan hij, in een land als België, rekenen op een medische terugbetaling. Maar achteraf blijkt dat de patiënt een roker is. Sommigen vinden het oneerlijk dat deze mensen ook recht hebben op medische terugbetaling. Dezelfde opinie heerst bij mensen over alchoholici die lijden aan levercirrose. Hebben deze mensen gelijk of prediken zij onzin? 
Uit een enquëte bij Belgische artsen van de Vlekho Business School in samenwerking met de Artsenkrant, blijkt dat drie op tien artsen de terugbetaling van rokers met longkanker overbodig vindt. Hierbij denkt ook ongeveer een kwart van de artsen hetzelfde over alcholici met levercirrose. Een significant aantal Belgische artsen staat blijkbaar onverschillig ten opzichte van rokers en alcoholici. De enquëte legde ook een ander voorbeeld voor. De artsen moesten bepalen of een bromfietser die zonder helm valt en daardoor blind wordt, recht heeft op medische terugbetaling. Een kwart van de artsen vond van niet. 
Uit deze statistieken kunnen we afleiden dat er geen overkoepelende mening heerst bij Belgische artsen. Alhoewel de meerderheid vindt dat deze patiënten nog steeds recht hebben op medische terugbetaling, is de tegenstand zeker niet klein. Om een duidelijker beeld te geven van de situatie, kunnen we best de argumenten van beide standpunten vergelijken. 
Bij rokers met longkanker denken de meesten meteen dat de longkanker veroorzaakt werd door de tabak. Roken kan zonder twijfel longkanker veroorzaken, maar in alle gevallen van longkanker is roken zeker niet de oorzaak. Zo kan bij een rokende longkankerpatiënt zijn kanker niet veroorzaakt zijn door zijn verslaving. Oordelen of deze patiënt terugbetaling verdient of niet wordt dan heel moeilijk. Hierbij is het ook moeilijk oordelen wanneer je een roker bent en wanneer niet. Dit zijn feiten die eerst grondig bepaald en onderzocht moeten worden. Bij alcoholici met levercirrose wordt het zelfs nog moeilijker. In België consumeert een meerderheid alcohol, maar wanneer ben je een alcoholicus? Daarbovenop kan levercirrose ook andere oorzaken hebben en kan je, zelfs bij een alcoholicus, nooit met 100% zekerheid verklaren dat alcohol de levercirrose heeft veroorzaakt. 
De artsen die daarentegen vinden dat deze mensen geen terugbetaling verdienen, vinden dat rokers en alcoholici deze aandoeningen aan hun eigen te danken hebben. Ze zijn tenslotte zelf begonnen met roken en drinken. Ze vinden het oneerlijk tegenover de andere longkanker- of levercirrosepatiënten. Tegenwoordig staan op alle pakjes sigaretten waarschuwingen over de risico's van roken. Rokers kunnen zeker niet verklaren dat ze de gevolgen van roken niet kenden. Ze gebruiken het ook op eigen risico. Bij alcohol is het weer een ander verhaal. Op de verpakking van alcoholische dranken staan nergens waarschuwingen over leveraandoeningen. Iemand kan daarmee alcohol consumeren zonder bewust te zijn van de gevolgen. 
Iemand definiëren als een roker of alcoholicus is heel moeilijk. Vooraleer ziekenfondsen geen medische terugbetalingen meer moeten geven aan rokende longkankerpatiënten of aloholici met een leveraandoening, moeten de begrippen 'alcoholicus' en 'roker' duidelijk gedefinieerd worden. Dit is onmogelijk en zou voor een nachtmerrie van berekeningen en papierwerk zorgen. Het is daarmee juridisch onmogelijk om een stop te zetten op de terugbetaling van bijvoorbeeld rokende longkankerpatiënten. 
Uiteindelijk zien we dat het onmogelijk is om deze specifieke patiënten niet meer terug te betalen. Een arts kan nooit met 100% zekerheid bepalen of de ziekte van de patiënt echt veroorzaakt is door zijn verslaving. Ook iemand bestempelen als roker of alcoholist is moeilijk, omdat er geen parameters bestaan waarbinnen iemand een roker of alcoholist is. Realistisch gezien is het onmogelijk om deze medische terugbetalingen stop te zetten, waardoor rokers en alcoholisten uiteindelijk als de gelukkigen uit de bus komen. 
"""
preprocessed_text = preprocess_text(test_essay)
print(pad_features([preprocessed_text], 400))

[[0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 9.7000e+01 2.6500e+02 1.

In [147]:
def get_predictions_sub_classes(pred):
    """
    This function will return the personality sub class from a given prediction
    """
    if pred > 0 and pred <= 0.2:
        return "very low"
    elif pred > 0.2 and pred <= 0.4:
        return "low"
    elif pred > 0.4 and pred <= 0.6:
        return "medium"
    elif pred > 0.6 and pred <= 0.8:
        return "high"
    else:
        return "very high"

In [148]:
def get_personality_prediction(model_path, feature_tensor):
    # Initializing the five saved models from the main LSTM model class `PersonalityLSTM`
    model = PersonalityLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
    
    # Loading Openness trained model from .pt file
    model.load_state_dict(torch.load(model_path))
    model.eval()
    
    batch_size = feature_tensor.size(0)
    
    # initialize the hidden state
    h = model.init_hidden(batch_size)
    
    # get the output from the model
    output, h = model(feature_tensor, h)
    pred_value = output[0].item()
    
    return (round(pred_value, 4), get_predictions_sub_classes(pred_value))

In [149]:
def predict_five_personality_traits(essay):
    """
    It will return the predicted personalities from the given essay as predicted by our model
    """
    
    # process and tokenize the review using our `preprocess` function
    essay = preprocess_text(essay)
    
    # padding
    features = pad_features([essay], 400)

    # convert this numpy array to tensor
    feature_tensor = torch.from_numpy(features)

    batch_size = feature_tensor.size(0)
    
    # Initializing the five saved models from the main LSTM model class `PersonalityLSTM`
    model_OPN = model_CON = model_EXT = model_AGR = model_NEU = PersonalityLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
    
    # get the predictions
    OPN_value = get_personality_prediction(save_OPN_path, feature_tensor)
    CON_value = get_personality_prediction(save_CON_path, feature_tensor)
    EXT_value = get_personality_prediction(save_EXT_path, feature_tensor)
    AGR_value = get_personality_prediction(save_AGR_path, feature_tensor)
    NEU_value = get_personality_prediction(save_NEU_path, feature_tensor)
    
    # build the final dictionary with prediction
    final_prediction = {
        "Openness": OPN_value,
        "Conscientiousness": CON_value,
        "Extroversion": EXT_value,
        "Agreeableness": AGR_value,
        "Neuroticism": NEU_value
    }
    
    return final_prediction

In [150]:
predict_five_personality_traits(test_essay)

{'Openness': (0.5231, 'medium'),
 'Conscientiousness': (0.5378, 'medium'),
 'Extroversion': (0.5487, 'medium'),
 'Agreeableness': (0.2483, 'low'),
 'Neuroticism': (0.5338, 'medium')}