# Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

by Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan1, Sune Lehmann4

Reference [ACL 2017 Paper](http://aclweb.org/anthology/D/D17/D17-1169.pdf)

## Abstract

The current NLP tasks are limited due to the scarcity of annotated data since manual labeling is time-costly and sometimes contains misinterpretation. So co-occurring emotional expressions like emoji and hashtags can be used for distant supervision. The focus of this DeepMoji project is to extend the distant supervision on more diverse set of noisy labels so that we can predict human emotions beyond the 1-dimensional sentiment measure via projecting the given sentence to different emoji. In this paper, engineers were trying to train the model to predict which emoji was suitable to a given sentence, then understood the emotional content of that sentence. And the comparison of performance on experimental datasets have proved that the inclusion of diverse emotional labels can lead to a better performance over previous distant supervision approaches. The experiment was conducted on a dataset of 1246 million tweets containing one of 64 common emoji and obtained performance on 8 benchmark datasets. The result showed that agreement of the random MTurk rater is 76.1% and DeepMoji model achieves 82.4% agreement, which means it is better at capturing the average human sentiment-rating than a single MTurk rater.

## Key Ideas
The training model is based on a modified pre-trained two-layer Long Short-Term Memory model. First, the twitter data is pre-trained to filter out URL and repetitions and also tokenized to be more normalized. Secondly, we use two bidirectional LSTM layers together with an attention layer to listen to all layer's input. Finally, the researchers adopt the transfer learning approach ‘chain-thaw' which sequentially unfreezes and fine-tunes a single layer. In this way, they could train each layer separately and the model is able to adjust the individual patterns to reduce the risk of overfitting.
### Data Pretraining
Since the parsed twitter contain noisy data, the researchers want to first pretrain them to improve the performance:
1.	URL filtration: Only tweets without URL have been used due to the hypothesis that content obtained from the URL is likely to be important for understanding the emotional content of the text in the tweet. Therefore, emoji associated with these tweets are more likely to be noise labels.
2.	Tokenization: To better generalize the data, all tweets must be tokenized on a word-by-word basis. Words with repeated characters will be shortened and repetitions will be removed.
3.	Repetition: For tweet that contains multiple different emoji or same emoji, we save a separate tweet for a unique emoji type regardless of the number of emoji appearances in that tweet. In this way, we can retain the pre-training task to be single-label classification while capture the multiple types of emotional contents.


### Model generation
We use a variant of the Long Short-Term Memory (LSTM) model to accomplish the training work. The DeepMoji model uses an embedding layer of 256 dimensions with a hyperbolic tangent activation function to enforce a constraint of each embedding dimension to range [−1, 1]. Then we implement two bidirectional LSTM layers with 1024 hidden units in each (512 in each direction) to capture the context of each word.

Bidirectional LSTMs are supported in Keras via the Bidirectional layer wrapper.

BLSTM = Bidirectional(LSTM(encoder_units, return_sequences=True))

The parameter encoder_units is the size of the weight matrix. We use return_sequences=True here because we'd like to access the complete encoded sequence rather than the final summary state. We can define 2 Bidirectional LSTM with 512 hidden units for each and concatenate the output.
```python
lstm_0_output = Bidirectional(LSTM(512, return_sequences=True), name="bi_lstm_0")(x)
lstm_1_output = Bidirectional(LSTM(512, return_sequences=True), name="bi_lstm_1")(lstm_0_output)
x = concatenate([lstm_1_output, lstm_0_output, x])
```
But this model can have 2 potential issues:
1. The features learned by the last LSTM layer might be too complex for the new transfer learning task. So it is better to include the results of layers earlier or at least allow access to them. 
2. When the model is used for new domains, where the understanding of a specific word as given by its embedding in vector space will need to be updated. However, the datasets for new domains may be quite small and thus simply training the entire model with its 22.4M parameters will quickly cause overfitting.

The first problem can be solved by adding a simple attention layer to the LSTM model that takes all prior layers as input, thereby allowing easy access for the Softmax layer to any previous time step at any layer of the architecture. And the second issue is solved by a proposed ‘chain-thaw’ fine-tuning procedure, which iteratively unfreezes part of the network and trains it. 
<img src="image/layers.png" width="450">
As shown in the image above, the procedure starts out by training any new layers, then fine-tunes the first layer to the last layer individually and then finally trains the entire model.

And the final model will be defined in a process below:
```python
from deepmoji import SentenceTokenizer, finetune_chainthaw, define_deepmoji
vocab_path = '..'
pretrained_path = '..'
maxlen = 100
nb_classes = 2
# Load your dataset into two Python arrays, 'texts' and 'labels'
...
# Splits the dataset into train/val/test sets. Then tokenizes each text into separate words and convert them to our vocabulary.
st = SentenceTokenizer(vocab_path, maxlen)
split_texts, split_labels = st.split_train_val_test(texts, labels)
# Defines the DeepMoji model and loads the pretrained weights stored in hdf5 files
model = define_deepmoji(nb_classes, maxlen, pretrained_path)
# Finetunes the model using our chain-thaw approach and evaluates it
model, acc = finetune_chainthaw(model, split_texts, split_labels)
print("Accuracy: {}".format(acc)
```

## Running sample

In [1]:
from __future__ import print_function, division
import json
import numpy as np
from deepmoji.sentence_tokenizer import SentenceTokenizer
from deepmoji.model_def import deepmoji_emojis
from deepmoji.global_variables import PRETRAINED_PATH, VOCAB_PATH
TEST_SENTENCES = [u'This movie can be so much better if they put more efforts',
                  u'You think it is funny, huh?',
                  u'Amazing, my flight gets delayed again']

Using TensorFlow backend.


In [2]:
def top_elements(array, k):
    ind = np.argpartition(array, -k)[-k:]
    return ind[np.argsort(array[ind])][::-1]
maxlen = 30
batch_size = 32

In [3]:
with open(VOCAB_PATH, 'r') as f:
    vocabulary = json.load(f)
st = SentenceTokenizer(vocabulary, maxlen)
tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)

In [4]:
model = deepmoji_emojis(maxlen, PRETRAINED_PATH)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 30)           0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 256)      12800000    input_1[0][0]                    
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 30, 256)      0           embedding[0][0]                  
__________________________________________________________________________________________________
bi_lstm_0 (Bidirectional)       (None, 30, 1024)     3149824     activation_1[0][0]               
__________________________________________________________________________________________________
bi_lstm_1 

In [5]:
prob = model.predict(tokenized)

In [6]:
scores = []
for i, t in enumerate(TEST_SENTENCES):
    t_tokens = tokenized[i]
    t_score = [t]
    t_prob = prob[i]
    ind_top = top_elements(t_prob, 5)
    t_score.append(sum(t_prob[ind_top]))
    t_score.extend(ind_top)
    t_score.extend([t_prob[ind] for ind in ind_top])
    scores.append(t_score)
    print(t_score)
    
with open(OUTPUT_PATH, 'wb') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', lineterminator='\n')
    writer.writerow(['Text', 'Top5%',
                     'Emoji_1', 'Emoji_2', 'Emoji_3', 'Emoji_4', 'Emoji_5',
                     'Pct_1', 'Pct_2', 'Pct_3', 'Pct_4', 'Pct_5'])
    for i, row in enumerate(scores):
        try:
            writer.writerow(row)
        except:
            print("Exception at row {}!".format(i))

[u'This movie can be so much better if they put more efforts', 0.28318660706281662, 35, 5, 27, 34, 46, 0.063187309, 0.058821324, 0.056749657, 0.055143535, 0.049284782]
[u'You think it is funny, huh?', 0.31099872291088104, 9, 1, 0, 38, 55, 0.083211482, 0.067881264, 0.055250548, 0.053413376, 0.051242054]
[u'Amazing, my flight gets delayed again', 0.38134982064366341, 32, 55, 19, 1, 37, 0.094081961, 0.090335786, 0.081730463, 0.063715212, 0.051486399]


As you can see the code takes each test sentences and tokenized them to be put into the prediction model. The final result will only take the top 5 emoji with highest probability estimates.

## Experiment and result analysis
Although the current focus in emotional content in text is on positive/negative sentiment, human beings actually have more complicated emotion representations. So, the focus of this DeepMoji project is to predict human emotions (emotion, sentiment and sarcasm detection) beyond the 1-dimensional sentiment measure. The result of the sample above is quite accurate in dealing with sarcasm and ambiguous sentences. It has a better performance than the normal positive/negative sentiment analysis when dealing with real-life data.  
<img src="image/goodexample.png" width="350">

However, there are still a lot to do to improve the accuracy of this technique. According to the paper. the attention mechanism lets the model decide the importance of each word for the prediction task by weighing them when constructing the representation of the text. For instance, a word such as ‘amazing’ is likely to be very informative of the emotional meaning of a text and it should thus be assigned a larger weight. But when a sentence is constructed with a lot of "weighted" words, it becomes more difficult to distinguish which one should enjoy a higher weight. 
<img src="image/badexample.png" width="350">

But in general, DeepMoji performs pretty well in emoji detection on benchmark datasets. And it has a better accuracy over the fastText classifier. 
<img src="image/experimentdata1.png" width="300">
<center>Figure 1: Accuracy of classifiers on the emoji prediction task. d refers to the dimensionality of each LSTM layer</center>


<img src="image/benchmark.png" width="500">
<center>Figure 2: Comparison across benchmark datasets. Reported values are averages across five runs. </center>

The difference between the 3 different transfer learning approaches is that: The <b>chain-thaw</b> approach first fine-tunes any new layers to the target task until convergence on a validation set. Then the approach fine-tunes each layer individually starting from the first layer in the network. Lastly, the entire model is trained with all layers. The <b>Last</b> approach tunes any new layers and the <b>Full</b> approach requires that all layers are fine-tuned together. 

Last, the researchers compared DeepMoji with human-level agreement by using the Amazon Mechanical Turkers (MTurk’s). Tweets were rated on a scale from 1 to 9 with a ‘Do not know’ option by human raters. The results show that the agreement of the random MTurk rater is 76.1%, meaning that the randomly selected rater will agree with the average of the nine other MTurk-ratings of the tweet’s polarity 76.1% of the time while the DeepMoji model achieves 82.4% agreement. 
<img src="image/Mturk.png" width="300">

## Summary
A lot of people might question the practicality of this project since usually we want to use the emoji simply as an emotion tag. But actually, there are a lot of application potentials related to this research development.  For instance, companies may want to make sense of what their customers are commenting about them. DeepMoji can achieve that easily with the data collected from SNS like Twitter and YouTube. Also, for the Chatbot service like Siri, it will be better for them to have a comprehensive and nuanced understanding of the emotional content in text (especially applicable since DeepMoji can detect emotion, sentiment and sarcasm with good accuracy). It also can help detect offensive language since the model can have a decent understanding of the nuances of offensive language as DeepMoji can deal with complex sequential patterns and sarcasm (for instance, a smiling face emoji is not always treated as a “happy” signal).

In short, it is a very interesting paper to read and try. And I am looking forward to knowing how this technique can be applied to not only real-business but also NLP researches. 