# 1 Motivation

The emergence and ubiquitousness of the social media like Facebook, Twitter and Instagram have given each individual a platform to connect with people and also speak for himself/herself. People are becoming more and more used to expressing their thoughts though social media which have already become a "We media". People are also affected by other people's comments and opinions, which could be heard by more through social media. Interesingly also, one could get a sense of whether a certain product, organization or people is liked or not by monitoring social media/forums. In fact, Companies are already taking steps of managing branding or reputation over the internet.

In the mean time, the widely use of social medias have also provided with researchers a large amount of data and interesting topics to dive into, for example social network analysis. Twitter, in particular, being a platform where people make comments, produces tons of natural language data everyday. Utilizing NLP and machine learning algorithms, we could do many intersting things with the tweets people posted. In this project, we are interested in analyzing the reputation of certain entities by sentiment analysis on tweets.

# 2 Related Work

## 2.1 Sentiment Analysis

Sentiment analysis is a well studied area of Natural Language Processing (NLP). Just in the past year, there have been a number of papers looking at sentiment analysis on different datasets [10; 11; 12]. In traditional ways of sentiment analysis, this task was usually tackled using hand-crafted features or sentiment lexicons [13; 14], feeding them to classifiers such as Naive Bayes or Support Vector Machines (SVM). These methods require a laborious feature engineering process and may require domain-specific knowledge, often resulting in redundant and missing features. With the recent development of deep learning, more solutions [3; 4; 5] utilizing deep learning and neural networks based on Deep Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) achieved very good results compared with traditional methods. The sentiment analysis in Twitter has also been a annual task run by SemEval since 2013 [3; 8; 9]. In our project we ran the SemEval task as comparison and benchmarking. It is shown that our model outperforms the state-of-the-art method in SemEval 2017.

## 2.2 Reputation analysis on Twitter

Reputation research and modeling have aroused the interest of scientists in different fields such as: sociology [15, 16], economics [17], psychology [18] and computer science [19]. Reputation analysis of celebrities based on social media Twitter usually involves number of followers, frequently used words analysis, number of likes, comments and re-tweets as metrics [20]. In our project we observed that the sentiments expressed by ordinary users to celebrities should also be considered as important metrics since they showed the attitudes of ordinary people towards celebrities.

# 3 Approach

Sentiment analysis is essentially a sentence classification problem (positive, negative and/or neutral) where machine learning algorithms like neural networks could be applied. In this project, we use the Bidirectional Encoder Representations from Transformers(BERT)[1] as a sentence encoder and then add a classification layer to predict the final class label.

After training, the model can be applied on tweets crawled from twitter which @ a certain account to produce a "reputation score". we compute the scores by week and show the abosolute level as well as the relative variation of the score. 

## BERT with classifier

BERT is a deep transformer network[6] trained on very large corpus. We use the base uncased version of BERT initialized with pretrained weight, which has 12 layers, 768 hidden unit, 12 heads, 110M parameters. The input(tweet) is preprocessed and tokenized(detail below). Tokens `[CLS]` and `[SEP]` are added to beginning and end of the input tokens respectively. The inputs are truncated or padded, depending on the length of the tokens, to a fixed length of 100, including the two added tokens. And a input mask is created with 1 indicating not padding and 0 indicating padding. Then the tokens are converted to ids using BERT vocabulary and feed to BERT with the mask. 

BERT produces a sentence encoding which is a 768 dimentional vector output coresponding to `[CLS]`. A dropout layer with probability of 0.1 is added to the output and then follows a fully connected layer, which outputs the logits of the labels.

## Compute "reputation score"

We crawl the tweets seperated by week that @ a certain account and preprocessed them in the same way and feed them to the model. The output of the model are the logits of the corresponding labels (positive, negative and neutral). Then we compute softmax and get probabilities. The final "reputation score" is computed by:

$$ S = \frac{1}{N} \sum_{i = 1}^{N}\left( P_i(positive) - P_i(negative) \right) $$

where $P_i(positive)$ is the probability of tweet $i$ being positive and $P_i(negative)$ is the probability of tweet $i$ being negative. $N$ is the total number of tweets crawled in one week period.

# 4 Data
We train the model on the training data from SemEval 2017 task 4A and also evaluate on the test set. The SemEval 2017 task 4A data has 50k train data with 3 labels, i.e. negative, positive and neutral.

We also did experiments on sentiment140 dataset. The dataset has 1.6m data and we split it to 90% training and 10% testing. 

All the data are preprocessed before inputing to the network. The data are first clean by the following:

- All the @s are removed 
- Http addresses are also removed.
- Words contain invalid ascii symbols are removed
- All the characters that are not alphanumeric and not one of `'"?!` are converted to a space.
- After the above steps, tweets with less than 1 character are removed

Then, the sentences are tokenized with BertTokenizer. BertTokenizer consists of basic tokenizer, which does simple spliting and converting to lower case, and a word piece tokenizer[2].

# 5 Code

We used the pretrained BertForSequenceClassification implementation and BertTokenizer from pytorch-pretrained-bert package. We used code from [7] to crawl tweets.

# 6 Experimental Setup

As mentioned above, we did experiments on SemEval 2017 task 4A and compare out method to others. We compare the average recall, F1 score and accuracy. Note that average recall is the main metric used in the task because it is a better metric for unbalanced data(e.g. Positive more that negative). So here we mainly look at average recall.

And we also did experiments on sentiment140 dataset. Howeverm, this dataset does not come from a shared task so there is not any comparison that can be made. We'll report our result below.


# 7 Results

## SemEval Result

We compare our result with other SemEval 2017 participants.The best ranking teams were *BB twtr* and *DataStories*, both achieving a macro average recall of 0.681. Both top teams used deep learning; *BB twtr* used an ensemble of LSTMs and CNNs with multiple convolution operations, while *DataStories* used deep LSTM networks with an attention mechanism. As shown in the result below, our system utilizing Bert for sequence classification accieved the best result among the participant of that shared task.

| #          | System        | AvgRec    | F1        | Accuracy  | Architecture                    |
| ---------- | ------------- | --------- | --------- | --------- | -------------------------------- |
| Our System | Windows Vista | **0.701** | **0.702** | **0.714** | Bert for Sequence Classification |
| 1          | DataStories[3]   | 0.681     | 0.677     | 0.651     | LSTM with attention              |
| 1          | BB twtr[4]       | 0.681     | 0.685     | 0.658     | LSTM and CNNs                    |
| 3          | LIA[5]           | 0.676     | 0.674     | 0.661     | LSTM and CNNs                    |
| Baseline   | All Positive  | 0.333     | 0.162     | 0.193     |                                  |



## Sentiment140 Result

This dataset does not come from a shared task so there is not any comparison that can be made. However, we here by report our test result on the test set(10%) of this dataset splited by ourselves.

`Accuracy: 86.48% F1 score: 0.8648 AvgRec: 0.8648`

# 8 Analysis of the Results

The result on the SemEval task shows the power of BERT as a large scale pretrained transformer for language representation. Initialized with pretrained weights, tuning BERT on other tasks is simple and feasible yet produces very good results. 

Most existing methods use LSTM and/or CNNs. The problem with LSTM is that it is often hard to train, making it almost impossible to train a deep LSTM on very large dataset. While CNNs allieviate this problem, it turns a sequence into n-grams features and does not pay enough attention to the order. Transformer, on the other hand, does not use recurrent connections but uses many trick to maintain the ordering information as well as bidirection information flow and attention via mechanisms like multihead attention and positional embedding. In the era of deep learning, more data usually means more powerfull networks, which is exactly what BERT achieves.


# 9 Reputation Analysis

<span style="color:red">**TODO**</span>: Include figures and analysis of reputations 

# 10 Future Work

Conducting sentiment analysis on twitter is a bit difficult because the language used in tweeter are informal. There are slangs, mis-spellings, abbreviations, emojis, multi-media contents and so on. A challenging task is to recognize those informal language usage and utilize them for prediction. For example, emojis are often obvious indicators of sentiment and is also ubiquitous on social media. It would be very useful if those information could be captured and utilized.

Besides text contents, visual contents are also a important part of social media cotents and it also involves a lot of sentiment ques. In fact, there are already works on conbining text and visual contents for sentiment analysis on twitter.

# References

[1] Jacob   Devlin,   Ming-Wei   Chang,   Kenton   Lee,   and   KristinaToutanova. Bert: Pre-training of deep bidirectional transformers forlanguage understanding.arXiv preprint arXiv:1810.04805, 2018.

[2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. arXiv preprint

[3] Baziotis C, Pelekis N, Doulkeridis C. Datastories at semeval-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis. InProceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) 2017 (pp. 747-754).

[4] Mathieu Cliche. 2017. BB twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs. In Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver, Canada, SemEval ’17, pages 572–579.

[5] Rouvier M. LIA at SemEval-2017 Task 4: An Ensemble of Neural Networks for Sentiment Classification. InProceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) 2017 (pp. 760-765).

[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010.

[7] https://github.com/Jefferson-Henrique/GetOldTweets-python


[8] Rosenthal S, Nakov P, Kiritchenko S, Mohammad S, Ritter A, Stoyanov V. Semeval-2015 task 10: Sentiment analysis in twitter. InProceedings of the 9th international workshop on semantic evaluation (SemEval 2015) 2015 (pp. 451-463).

[9] Nakov P, Ritter A, Rosenthal S, Sebastiani F, Stoyanov V. SemEval-2016 task 4: Sentiment analysis in Twitter. InProceedings of the 10th international workshop on semantic evaluation (semeval-2016) 2016 (pp. 1-18).

[10] Cambria E, Poria S, Gelbukh A, Thelwall M. Sentiment analysis is a big suitcase. IEEE Intelligent Systems. 2017 Nov;32(6):74-80.


[11] Soleymani M, Garcia D, Jou B, Schuller B, Chang SF, Pantic M. A survey of multimodal sentiment analysis. Image and Vision Computing. 2017 Sep 1;65:3-14.

[12] Cummins N, Amiriparian S, Ottl S, Gerczuk M, Schmitt M, Schuller B. Multimodal Bag-of-Words for Cross Domains Sentiment Analysis. InProc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Calgary, Canada 2018 (pp. 1-5).

[13] Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. NRC-Canada: Building the stateof-the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242. 2013.

[14] Svetlana Kiritchenko, Xiaodan Zhu, and Saif M. Mohammad. Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research 50:723–762. 2014.

[15] P. Hage and F. Harary. Island Networks. Cambridge University Press, 1996.

[16] V. Buskens. The social structure of trust. Social Networks, (20):265—298, 1998.

[17] M. Celentani, D. Fudenberg, D.K. Levine, and W. Psendorfer. Maintaining a reputation against a long-lived opponent. Econometrica, 64(3):691—704, 1966.

[18] D.B. Bromley. Reputation, Image and Impression Management. John Wiley & Sons, 1993.

[19] Sabater J, Sierra C. Reputation and social network analysis in multi-agent systems. InProceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1 2002 Jul 15 (pp. 475-482). ACM.

[20] Snape Jennifer. Does Taylor Swift have a big (and bad) reputation? Twitter scraping using R. https://statfr.blogspot.com/2018/10/does-taylor-swift-have-big-and-bad.html. 2018.
