# Sentiment Analysis
## Benchmarking State-of-the-Art Classifiers

Oleksandra Kovalenko (578447)   
Cosima Heymann (569413)  
Sascha Geyer (546266)       

![sentiment](https://camo.githubusercontent.com/899f79e8a2d62fd642eba0791ff66d13d38e427901bfc3cd89c6f613311e1789/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f70726f78792f312a5f4a57314a614d704b5f6656476c64387064315f4a512e676966 'sentiment')

## Table of Contents

1. Introduction
2. Research Overview
3. Community Datasets Overview
4. Experimental Setup
       4.1 Datasets
       4.2 Models
       4.3 NLP Preprocessing Pipeline
5. Results
6. Conclusion
7. Resources

## 1. Introduction: What is Sentiment Analysis?

The growth of user-generated content in web sites and social networks, just to mention a few: Yelp, Twitter, Amazon, Tripadvisor, Rottentomatoes and IMDB has led to an increasing power for expressing opinions. In recent years, the automatic extraction of opinions from a text has become an area of growing interest in Natural Language Processing (NLP). Online opinions have turned into a valuable asset since the fast spreading nature of online content. In order to analyze the massive amount of data, many NLP tasks are being used. In particular, Sentiment Analysis, also known as Opinion Mining (from now on: SA), became an increasingly growing task, whose goal it is to classify opinions and sentiments expressed in user-generated text. SA is on the rise due to the increased requirement of analyzing and structuring hidden information, which comes from user-generated content in the form of unstructured data (Ain, Ali, Riaz, Noureen, Kamran, Hayat & Rehman, 2017). It allows to detect the emotion and sentiment that an author of a text felt towards a described subject or entity. It is interesting in many fields and branches and helps solving various tasks, e.g.:

- companies are able to measure the feedback about a product or service,
- sociologists can look at people’s reaction about certain public events,
- psychologists can study the general mind state of communities with regard to various issues, i.e. a depression detection model that is based on SA in micro-blog social networks (Wang, Zhang, Ji, Sun, Wu & Bao, 2013),
- governments and political parties are able to correct their actions according to social approval or disapproval,
- etc.

The challenge is that sentiments are not always expressed explicitly and meanings can be hidden in the context. In these cases, additional word and language knowledge is necessary. Moreover, opinions may involve sarcasm and negations, which can be interpreted differently in various domains and contexts. Sentiment classification is rather easy for humans (Pang, Lee & Vaithyanathan, 2002), but manual review and analysis of texts is very time consuming and expensive. Due to this fact, automatic sentiment classifiers are selected instead. 

#### Sentiment Analysis: Definition, Application & Classification 

Sentiment Analysis is an active research area in NLP that refers to the use of text analysis, statistical learning and often Machine Learning to extract subjective information in source materials such as user-generated texts from social networks, blogs, forums and product or service reviews.
Selecting the basic emotions is a difficult task for a computer because of the variety of human emotions. Most of the authors in the NLP community agree on the classification proposed by Ekman, Friesen and Ellsworth (1982) that six basic emotions exist: anger, disgust, fear, joy, sadness and surprise. As such a division requires a complex processing and analysis of the input data, the majority of researchers and authors accept a simpler representation of sentiments according to their polarity (Pang & Lee, 2008). Kurosu (2015) defines sentiment polarity as follows: “The polarity of a sentiment is the point on the evaluation scale that corresponds to our positive or negative evaluation of the meaning of this sentiment.”. Sentiment polarity allows researchers to use a binary or ternary measurement, either positive, negative or neutral and therefore, simplifies the representation and management of the sentiment information. The granularity of SA can be either coarse-grained or fine-grained, where coarse-grained stands usually for a binary classification (positive, negative). On ther other hand, fine-grained can use five (or more) possible levels of granularity (high positive, low positive, neutral, low negative, high negative). 
Liu et al. (2015) presented three levels of SA: document level, sentence level and entity / aspect level. While document level studies the polarity of the whole text with respect to a single entity (e.g. a product), sentence level studies the polarity of single sentences, analyzing clauses and phrases for its sentiment. Contrary, entity / aspect level analyzes what people especially liked or disliked. An entity-aspect might be a single token and its polarity might be different from the overall polarity of the text (Liu et al., 2015).

#####  Application
SA can be applied in many areas. Below are a few application areas listed:

- Social media monitoring,
- Customer support / feedback,
- Brand monitoring and reputation management,
- Voice of customer,
- Voice of employee,
- Product analysis,
- Market research and competitive research.

##### Classification

All methods used to solve sentiment classification fall into three main categories: lexicon-based, machine learning-based and hybrid approaches.

In lexicon-based approaches, also known as knowledge-based methods, sentiment is seen as a function of keywords and is based on their count. The main task is the construction of sentiment word lexicons with the indicated class labels positive or negative. In some cases also with their intensiveness, which becomes important for a fine-grained classification.

An alternative to the knowledge-based method is Machine Learning (ML), which is gaining more and more interest of researchers due to its adaptability and higher accuracy. Traditional Machine Learning methods were the dominant approach in SA (Pang et al., 2002) with the three main algorithms: Naïve Bayes (NB), Support Vector Machines (SVM) and Maximum Entropy (MaxEnt, in Statistics called: Logisitic Regression). Part of Machine Learning models are Deep Learning models (DL) and Transformer models.

DL is an area of ML research that attempts to learn in multiple levels, corresponding to different levels of abstraction. Traditional ML relies on non-deep nets: composed of one input layer, one output layer and maximum one hidden layer inbetween. More than three layers (including input and output layer) qualifiy a net as “deep”.

A Transformer is a type of neural network architecture developed by Vaswani et al. (2017). In short, this model architecture consists of a multi-head self-attention mechanism combined with an encoder-decoder structure. In a bit more detail, transformers work like this: first, the input embedding is multi-dimensional in the sense that it can process complete sentences and not a series of words one by one. Second, it has a powerful multi-headed attention mechanism that enables sentences to maintain context and relationships between words within a sentence. This attention analysis gets performed for each word several times to ensure adequate sampling. Lastly, it uses a feed forward neural network to normalize the results and provide a sentiment prediction. To learn more about the architecture of transformer models be sure to visit the the transformers library provided by Hugging Face [huggingface website](https://huggingface.co/docs/transformers/index) as this library gives you access to more than 32 pre-trained state-of-the-art (SOTA) models.

The hybrid approach, also known as combined analysis or ensemble model, combines both knowledge-based and Machine Learning-based methods and thus, can lead to a superior performance. Researchers were attracted to explore the hybrid approach that collectively could exhibit the accuracy of a ML approach and the speed of a lexical approach.

## 2. Research Overview

The following section describes related works that exploits ML and DL approaches to solve SA tasks on different data sets and from different perspectives in the past 5 years. This review is conducted on the basis of numerous latest studies and researches in the field of SA. The first table presents several methods for English texts, whereas the second literature table presents papers for other languages like Greek, German or French. This field of research (SA for different languages) could be a topic for future studies. 

We compared the papers and models based on the following parameters:
- paper name, 
- year of publication,
- used datasets,
- superior classification of used algorithm(s),
- used algorithms,
- used performance evaluation metrics and
- where to find the paper.

There are several papers out there that exploit the methods of lexicon-based models. In this eassy, we have only focused on ML-, DL-based and ensemble methods. If you want to get an overview of lexicon-based models, you could have a look at the paper from Vizcarra et al. (2021). 

In our literature review, we focussed on papers about ML and DL approaches (Singh, J., Singh, G., & Singh, R. (2017); Rustam, F., Ashraf, I., Mehmood, A., Ullah, S., & Choi, G. S. (2019); Ahuja, R., Chug, A., Kohli, S., Gupta, S., & Ahuja, P. (2019); Purchases, C. J. O. I., Stoyanova, L., & Wallace, W. (2019); da Silva, N. F. F., Coletta, L. F., Hruschka, E. R., & Hruschka Jr, E. R. (2016); Yi, S., & Liu, X. (2020); Ouyang, X., Zhou, P., Li, C. H., & Liu, L. (2015)). 

We also looked into hybrid approaches exploiting the power of ML approach and the stability from a lexicon-based approach (Gaye, B., Zhang, D., Wulamu, A. (2021); Novikova, A., Stupnikov, S. (2017); Alsayat, A. (2021); Araque, O., Corcuera-Platas, I., Sánchez-Rada, J. F., & Iglesias, C. A. (2017)). 

Last but not least, we also studied some papers about the latest trend and state-of-the-art in SA: Transformer models (Bacco, L., Cimino, A., Dell’Orletta, F., & Merone, M. (2021); Jiang, M., Wu, J., Shi, X., & Zhang, M. (2019); Wu, Z., Ying, C., Dai, X., Huang, S., & Chen, J. (2020)).

English:

<table>
  <tr>
   <td><strong>Paper Name</strong>
   </td>
   <td><strong>Year of Publication</strong>
   </td>
   <td><strong>Dataset(s)</strong>
   </td>
   <td><strong>Classification</strong>
   </td>
   <td><strong>Algorithms</strong>
   </td>
   <td><strong>Performance Evaluation Criteria</strong>
   </td>
   <td><strong>Source</strong>
   </td>
  </tr>
  <tr>
   <td>Explainable Sentiment Analysis: A Hierarchical Transformer-Based Extractive Summarization Approach
   </td>
   <td>2021
   </td>
   <td>IMDB
   </td>
   <td>Transformer 
   </td>
   <td>Explainable Hierarchical Transformer (ExHiT),  Sentence Classification Combiner Model (SCC)
   </td>
   <td>accuracy
   </td>
   <td><a href="https://www.mdpi.com/2079-9292/10/18/2195/pdf">https://www.mdpi.com/2079-9292/10/18/2195/pdf</a>
   </td>
  </tr>
  <tr>
   <td>A Tweet Sentiment Classification Approach Using a Hybrid Stacked Ensemble Technique
   </td>
   <td>2021
   </td>
   <td>Sentiment140
   </td>
   <td>Hybrid 
   </td>
   <td>stacked ensemble of three long short-term memory (LSTM) as base classifiers and logistic regression (LR) as a meta classifier
   </td>
   <td>accuracy, F1 
   </td>
   <td><a href="https://www.mdpi.com/2078-2489/12/9/374">https://www.mdpi.com/2078-2489/12/9/374</a>
   </td>
  </tr>
  <tr>
   <td>Optimization of sentiment analysis using machine learning classifiers
   </td>
   <td>2017
   </td>
   <td>3 manually compiled datasets; two of them are captured from Amazon and one dataset is assembled from IMDB movie reviews
   </td>
   <td>Machine Learning
   </td>
   <td>Naïve Bayes, J48, BFTree and OneR
   </td>
   <td>accuracy, F-measure, correctly classified instances
   </td>
   <td><a href="https://doi.org/10.1186/S13673-017-0116-3">https://doi.org/10.1186/S13673-017-0116-3</a>
   </td>
  </tr>
  <tr>
   <td>Sentiment Analysis of Short Texts from Social Networks Using Sentiment Lexicons and Blending of Machine Learning Algorithms
   </td>
   <td>2017
   </td>
   <td>VKontakte social network posts
   </td>
   <td>Hybrid
   </td>
   <td>Logistic Regression, Random Forest Classifier, SVM, Gradient Boosting Classifier, KNeighbors Classifier, Multinomial Naive Bayes
   </td>
   <td>F1 
   </td>
   <td>http://ceur-ws.org/Vol-2268/paper21.pdf
   </td>
  </tr>
  <tr>
   <td>Tweets Classification on the Base of Sentiments for US Airline Companies
   </td>
   <td>2019
   </td>
   <td>Twitter US Airline Sentiment
   </td>
   <td>Machine Learning
   </td>
   <td>Voting Classifier based on logistic regression (LR) and stochastic gradient descent classifier (SGDC) <strong>vs</strong> a variety of machine learning classifiers
   </td>
   <td>accuracy, F1
   </td>
   <td><a href="https://doi.org/10.3390/e21111078">https://doi.org/10.3390/e21111078</a>
   </td>
  </tr>
  <tr>
   <td>The Impact of Features Extraction on the Sentiment Analysis
   </td>
   <td>2019
   </td>
   <td>Sentiment Strength Twitter Dataset		
   </td>
   <td>Machine Learning
   </td>
   <td>TFIDF vs N-gram on 6 ML algos (LR, SVM, Decision Tree, Random Forest, KNN, Naive Bayes)
   </td>
   <td>accuracy, F1 
   </td>
   <td>https://www.sciencedirect.com/science/article/pii/S1877050919306593
   </td>
  </tr>
  <tr>
   <td>TOPIC MODELLING, SENTIMENT ANALSYS AND CLASSIFICATION OF SHORT-FORM TEXT
   </td>
   <td>2019
   </td>
   <td>data was obtained through
Twitter and Facebook’s public APIs with Netlytic
   </td>
   <td>Lexicon-based, Machine Learning, Deep Learning
   </td>
   <td>LDA (Latent Dirichlet Allocation), 
LSA (Latent Semantic Allocation) vs LR, SVM and Naive Bayes
   </td>
   <td>technical performance (perplexity score and topic coherence score), ease of application, as well as proximity to human agent performance on the same problem
   </td>
   <td>https://local.cis.strath.ac.uk/wp/extras/msctheses/papers/strath_cis_publication_2733.pdf
   </td>
  </tr>
  <tr>
   <td>Using unsupervised information to improve semi-supervised tweet sentiment classification
   </td>
   <td>2016
   </td>
   <td>6 datasets: SemEval 2013, LiveJournal, SMS2013, Twitter2013, Twitter2014, Twitter Sarcasm 2014 
   </td>
   <td>Machine Learning
   </td>
   <td>semi-supervised C3E algorithmvs SVM
   </td>
   <td>F-Scores
   </td>
   <td>https://www.researchgate.net/publication/295244270_Using_unsupervised_information_to_improve_semi-supervised_tweet_sentiment_classification
   </td>
  </tr>
  <tr>
   <td>Improving Sentiment Analysis for Social Media Applications Using an Ensemble Deep Learning Language Model
   </td>
   <td>2021
   </td>
   <td>3 datasets: own Twitter coronavirus hashtag dataset as well as public review datasets from Amazon and Yelp
   </td>
   <td>Hybrid
   </td>
   <td>customized deep learning model with an advanced word embedding technique and create a long short-term memory (LSTM)
   </td>
   <td>accuracy
   </td>
   <td>https://pubmed.ncbi.nlm.nih.gov/34660170/
   </td>
  </tr>
  <tr>
   <td>Enhancing Deep Learning Sentiment Analysis with Ensemble Techniques in Social Applications
   </td>
   <td>2017
   </td>
   <td>7 datasets on movie reviews and microblogging 
   </td>
   <td>Deep Learning, Hybrid
   </td>
   <td>
   </td>
   <td>F1 
   </td>
   <td>https://www.researchgate.net/publication/313332224_Enhancing_Deep_Learning_Sentiment_Analysis_with_Ensemble_Techniques_in_Social_Applications
   </td>
  </tr>
  <tr>
   <td>Machine learning based customer sentiment analysis for recommending shoppers, shops based on customers’ review
   </td>
   <td>2020
   </td>
   <td>product data with customer reviews is collected from benchmark Unified computing system (UCS)
   </td>
   <td>Machine Learning 
   </td>
   <td>Hybrid Recommendation System
   </td>
   <td>MAPE
   </td>
   <td>https://link.springer.com/article/10.1007/s40747-020-00155-2
   </td>
  </tr>
  <tr>
   <td>Sentiment Analysis Using Convolutional Neural Network
   </td>
   <td>2015
   </td>
   <td>IMDB
   </td>
   <td>Deep Learning
   </td>
   <td>RNN, LSTM, CNN,
   </td>
   <td>accuracy
   </td>
   <td>https://ieeexplore.ieee.org/abstract/document/7363395
   </td>
  </tr>
  <tr>
   <td>Transformer Based Memory Network for Sentiment Analysis of Web Comments
   </td>
   <td>2019
   </td>
   <td>2 datasets: Weibo, Semeval 
   </td>
   <td>Transformer
   </td>
   <td>Transformer based memory network (TF-MN)
   </td>
   <td>accuracy, F1 
   </td>
   <td>https://www.researchgate.net/publication/337697651_Transformer_Based_Memory_Network_for_Sentiment_Analysis_of_Web_Comments
   </td>
  </tr>
  <tr>
   <td>Transformer-based Multi-Aspect Modeling for Multi-Aspect Multi-Sentiment Analysis
   </td>
   <td>2020
   </td>
   <td>MultiAspect Multi-Sentiment (MAMS) dataset
   </td>
   <td>Transformer
   </td>
   <td>RoBERTa-Transformer-based
Multi-aspect Modeling method (TMM)
   </td>
   <td>accuracy, F1
   </td>
   <td>https://arxiv.org/abs/2011.00476
   </td>
  </tr>
</table>

<strong>Table 1. Literature Review for English Datasets</strong>

SA in languages other than English became a popular research area in the last 1-2 years. These papers also featured only ML and DL based methods. We studied the following papers: Elfaik, H. (2021); Alexandridis, G., Varlamis, I., Korovesis, K., Caridakis, G., & Tsantilas, P. (2021); Flender, M., & Gips, C. (2017); Tellez, E. S., Miranda-Jiménez, S., Graff, M., Moctezuma, D., Siordia, O. S., & Villaseñor, E. A. (2017); Carosia, A. E. O., Coelho, G. P., & Silva, A. E. A. (2020); Rhouati, A., Berrich, J., Belkasmi, M. G., & Bouchentouf, T. (2018).

<table>
  <tr>
   <td><strong>Language</strong>
   </td>
   <td><strong>Paper Name</strong>
   </td>
   <td><strong>Year of Publication</strong>
   </td>
   <td><strong>Dataset(s)</strong>
   </td>
   <td><strong>Classification</strong>
   </td>
   <td><strong>Algorithms</strong>
   </td>
   <td><strong>Performance Evaluation Criteria</strong>
   </td>
   <td><strong>Source</strong>
   </td>
  </tr>
  <tr>
   <td>Arabic
   </td>
   <td>Deep Bidirectional LSTM Network Learning-Based Sentiment Analysis for Arabic Text
   </td>
   <td>2021
   </td>
   <td>6 benchmark sentiment analysis datasets
   </td>
   <td>Deep Learning
   </td>
   <td>Bidirectional LSTM Network (BiLSTM)
   </td>
   <td>
   </td>
   <td>https://www.degruyter.com/document/doi/10.1515/jisys-2020-0021/html
   </td>
  </tr>
  <tr>
   <td>Greek
   </td>
   <td>A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media
   </td>
   <td>2021
   </td>
   <td>self-collected and annotated Greek Social Media Texts 
   </td>
   <td>Deep Learning
   </td>
   <td>PaloBert, GreekBERT
   </td>
   <td>accuracy, F1
   </td>
   <td>https://doi.org/10.3390/info12080331
   </td>
  </tr>
  <tr>
   <td>German
   </td>
   <td>Sentiment analysis of a German Twitter-Corpus
   </td>
   <td>2017
   </td>
   <td>German tweets from a bigger dataset
   </td>
   <td>Machine Learning
   </td>
   <td>Multinomial NB,  LinearSVC, Decision Tree Classifier, Maxent Classifier
   </td>
   <td>accuracy, F1
   </td>
   <td>http://ceur-ws.org/Vol-1917/paper06.pdf
   </td>
  </tr>
  <tr>
   <td>Spanish
   </td>
   <td>A case study of Spanish text transformations for twitter sentiment analysis
   </td>
   <td>2021
   </td>
   <td>2 Spanish datasets
   </td>
   <td>Machine Learning
   </td>
   <td>SVM
   </td>
   <td>accuracy, computing time
   </td>
   <td>https://www.sciencedirect.com/science/article/abs/pii/S0957417417302312?via%3Dihub
   </td>
  </tr>
  <tr>
   <td>(Brazilian) Portuguese
   </td>
   <td>Analyzing the Brazilian Financial Market Through Portuguese Sentiment Analysis in Social Media
   </td>
   <td>2018
   </td>
   <td>self annotated Twitter dataset on financial market
   </td>
   <td>Machine Learning
   </td>
   <td>Naive Bayes, Support Vector Machines, Maximum Entropy and Multilayer Perceptron
   </td>
   <td>accuracy
   </td>
   <td>https://www.researchgate.net/profile/Arthur-Carosia/publication/336933355_Analyzing_the_Brazilian_Financial_Market_through_Portuguese_Sentiment_Analysis_in_Social_Media/links/5e67edc24585153fb3d5b305/Analyzing-the-Brazilian-Financial-Market-through-Portuguese-Sentiment-Analysis-in-Social-Media.pdf
   </td>
  </tr>
  <tr>
   <td>French
   </td>
   <td>Sentiment Analysis of French Tweets based on Subjective Lexicon Approach: Evaluation of the use of OpenNLP and CoreNLP Tools
   </td>
   <td>2018
   </td>
   <td>French tweets using "Public Opinion Knowledge (POK)" platform
   </td>
   <td>Lexicon based in comparison to Machine Learning
   </td>
   <td>OpenNLP, CoreNLP, dependency analysis implemented by CoreNLP
   </td>
   <td>F-Scores
   </td>
   <td>https://www.researchgate.net/publication/326514882_Sentiment_Analysis_of_French_Tweets_based_on_Subjective_Lexicon_Approach_Evaluation_of_the_use_of_OpenNLP_and_CoreNLP_Tools
   </td>
  </tr>
</table>

<strong>Table 2. Literature Review for other Languages</strong>

You may ask what the future holds for Sentiment Analysis. It is pretty clear to us that there is still lots of room to improve the performance measures of transformer models. 

## 3. Community Datasets Overview

Below is an overview of community datasets for the use of SA that are publicly available. We have outlined them with the following features: 
- name,
- platform,
- domain,
- size, 
- evaluation, 
- language,
- source.

This list is by far not extensive and lacks for example datasets in different languages other than English.



<table>
  <tr>
   <td><strong>Name</strong>
   </td>
   <td><strong>Platform</strong>
   </td>
   <td><strong>Domain</strong>
   </td>
   <td><strong>Size</strong>
   </td>
   <td><strong>Evaluation (binary or more)</strong>
   </td>
   <td><strong>Language</strong>
   </td>
   <td><strong>Source</strong>
   </td>
  </tr>
  <tr>
   <td>Twitter US Airline Sentiment
   </td>
   <td>Twitter 
   </td>
   <td>US Airline user experiences
   </td>
   <td>3.42 MB
   </td>
   <td>ternary = positive, negative, neutral
   </td>
   <td>English
   </td>
   <td>https://www.kaggle.com/crowdflower/twitter-airline-sentiment
   </td>
  </tr>
  <tr>
   <td>Sentiment140
   </td>
   <td>Twitter
   </td>
   <td>user responses to different products, brands, or topics
   </td>
   <td>228 MB Training (1.600.000) 
   </td>
   <td>0 = negative, 
2 = neutral, 4 = positive
   </td>
   <td>English
   </td>
   <td>http://help.sentiment140.com/for-students
   </td>
  </tr>
  <tr>
   <td>Stanford Sentiment Treebank
   </td>
   <td>Rotten Tomatoes
   </td>
   <td>movie reviews
   </td>
   <td>10.000
   </td>
   <td>1-25 (25: most positive)
   </td>
   <td>English
   </td>
   <td>https://nlp.stanford.edu/sentiment/code.html
   </td>
  </tr>
  <tr>
   <td>Large IMDB Movie Reviews
   </td>
   <td>IMDB
   </td>
   <td>movie reviews
   </td>
   <td>25.000 training, 25.000 test
   </td>
   <td>binary
   </td>
   <td>English
   </td>
   <td>https://ai.stanford.edu/~amaas/data/sentiment/
   </td>
  </tr>
  <tr>
   <td>Polarity v2.0
   </td>
   <td>
   </td>
   <td>movie reviews
   </td>
   <td>3MB (1000 positive and 1000 negative processed reviews)
   </td>
   <td>binary
   </td>
   <td>English
   </td>
   <td>http://www.cs.cornell.edu/people/pabo/movie-review-data/
   </td>
  </tr>
  <tr>
   <td>Paper Reviews
   </td>
   <td>conference of computing
   </td>
   <td>user’s opinion about a paper
   </td>
   <td>405
   </td>
   <td>-2: very negative,
-1: negative,
0: neutral,
1: positive,
2: very positive
   </td>
   <td>English, Spanish
   </td>
   <td>https://archive.ics.uci.edu/ml/datasets/Paper+Reviews
   </td>
  </tr>
  <tr>
   <td>Multi-Domain Sentiment Dataset
   </td>
   <td>Amazon
   </td>
   <td>reviews of amazon products
   </td>
   <td>unprocessed: 1.9 GB, processed: 19 MB
   </td>
   <td>reviews contain ratings from 1 to 5 stars (can be converted to binary)
   </td>
   <td>English
   </td>
   <td>https://www.cs.jhu.edu/~mdredze/datasets/sentiment/
   </td>
  </tr>
  <tr>
   <td>Opin-Rank Review Dataset
   </td>
   <td>Tripadvisor, Edmunds
   </td>
   <td>hotel, car reviews
   </td>
   <td>300.000
   </td>
   <td>ratings that can be turned into binary
   </td>
   <td>English
   </td>
   <td>https://archive.ics.uci.edu/ml/datasets/opinrank+review+dataset
   </td>
  </tr>
  <tr>
   <td>Sentiment Lexicons For 81 Languages
   </td>
   <td>-
   </td>
   <td>-
   </td>
   <td>2 text files per language
   </td>
   <td>binary
   </td>
   <td>81 languages: Afrikaans to Yiddisch
   </td>
   <td>https://sites.google.com/site/datascienceslab/projects/multilingualsentiment
   </td>
  </tr>
  <tr>
   <td>Lexicoder
   </td>
   <td>-
   </td>
   <td>-
   </td>
   <td>2,858 negative sentiment words and 1,709 positive sentiment words
   </td>
   <td>binary
   </td>
   <td>English
   </td>
   <td>http://www.snsoroka.com/data-lexicoder/
   </td>
  </tr>
  <tr>
   <td>DynaSent
   </td>
   <td>Dynabench
   </td>
   <td>naturally occurring sentences with sentences created using the open-source Dynabench Platform
   </td>
   <td>121.634 sentences
   </td>
   <td>ternary
   </td>
   <td>English
   </td>
   <td>https://github.com/cgpotts/dynasent
   </td>
  </tr>
  <tr>
   <td>Amazon Fine Foods
   </td>
   <td>Amazon
   </td>
   <td>product reviews
   </td>
   <td>5.000.000 reviews
   </td>
   <td>ratings that can be turned into binary
   </td>
   <td>English
   </td>
   <td>https://snap.stanford.edu/data/web-FineFoods.html
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Germeval2017
   </td>
   <td>Social Media 
   </td>
   <td>messages from various social media and web sources
   </td>
   <td>22.000 messages 
   </td>
   <td>ternary
   </td>
   <td>German
   </td>
   <td>https://sites.google.com/view/germeval2017-absa/data
   </td>
  </tr>
  <tr>
   <td>Yelp_polarity_reviews
   </td>
   <td>Yelp
   </td>
   <td>business reviews
   </td>
   <td>600.000 reviews for training, 38.000 for testing
   </td>
   <td>binary (1 - bad, 2 - good)
   </td>
   <td>English
   </td>
   <td>https://www.kaggle.com/irustandi/yelp-review-polarity">https://www.kaggle.com/irustandi/yelp-review-polarity
   </td>
  </tr>
  <tr>
   <td>Financial PhraseBank
   </td>
   <td>
   </td>
   <td>financial news (rated as pos/neg/neutral) for investor
   </td>
   <td>4840 (4 configurations available (size depends on the level of agreement of annotators))
   </td>
   <td>ternary
   </td>
   <td>English
   </td>
   <td>https://github.com/huggingface/datasets/tree/master/datasets/financial_phrasebank">https://github.com/huggingface/datasets/tree/master/datasets/financial_phrasebank
   </td>
  </tr>
  <tr>
   <td>The SigmaLaw- Aspect-Based-SA dataset
   </td>
   <td>Court cases
   </td>
   <td>Legal opinion texts
   </td>
   <td>2.000 sentences
   </td>
   <td>ternary
   </td>
   <td>English
   </td>
   <td>https://osf.io/efrqt/">https://osf.io/efrqt/
   </td>
  </tr>
</table>

<strong>Table 3. Overview of Community Datasets</strong>

## 4. Experimental Setup

### 4.1 Datasets

We aim to compare selected DL and Transformer models with the performance of a traditional ML model (our baseline) on a document-level sentiment analysis task. For this purpose we selected two datasets from different domains: 

- Large IMDB Movie reviews

This dataset from Stanford researchers consists of 50.000 polarized, binary labeled reviews. It is important to note that the data was originally split by researchers 50/50 for training and test purposes, so the dataset is balanced. In our experiment, we used the joined dataset and performed the train-test split by ourselves aiming to leave more data for training.


- Sentiment 140

Sentiment140 was created by Alec Go, Richa Bhayani, and Lei Huang, who were Computer Science graduate students at Stanford University. The dataset allows you to discover the sentiment of a brand, product or topic on Twitter. We use the version of the dataset which consists of 1.6 mln. tweets with equally distributed labels (50% positive, 50% negative).

### 4.2 Models

#### Baseline model

This experiment used the popular and simple ML algorithm Logistic Regression as a baseline algorithm. Logistic Regression is a statistical technique capable of predicting a binary outcome. Using the built-in functions of scikit-learn, the Logistic Regression was very easy to build. After loading the respective dataset, the code had to re-create all the words from the preprocessed data set to build an index, which translates all lists of word-indices to strings and then used Term Frequence - Inverse Document Frequency (TF-IDF) as text representation. TF-IDF is a statistical measure used to evaluate how important a word is in a document. First, it computes the Term Frequence (TF) for each review, the Inverse Document Frequency (IDF) using each review and finally, the TF-IDF for each review. It transforms on the Test data which computes the TF for each review, then the TF-IDF for each review using the IDF from the Training data. Finally, the model was fit to classify the sentiment of the movie reviews and tweets.

#### DL model

As Long Short Term Memory networks (LSTM) are the most often used DL models for SA (Ligthart et. al., 2021), we decided to test them in our experiment. Our vanilla LSTM architecture consists of an embedding, LSTM, and a fully connected layer. In between we use two dropout layers to prevent overfitting.
We found the blog of Christopher Olah particulary helpful and suggest to check it out for gaining a first understanding of how LSTMs work: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

According to the findings of Ligthart et al. (2021), convolutional neural networks (CNN) are the second most often used DL models for SA. Recent research also indicates that a hybrid CNN & LSTM model can outperform traditional ML algorithms on SA tasks (e.g. Jain et al.,2021; Yadav & Vishwakarma, 2020). Our proposed architecture consists of an embedding layer, a convolution layer which receives the input from the embedding layer, a pooling layer, the output of which is fed into the LSTM layer, two fully connected, as well as dropout layers.

We tokenize and pad the data using Keras built-in functions in order to bring all reviews / tweets in the suitable format. An Embedding layer is necessary for dimensionality reduction which is achieved by converting each word into a vector of defined length. Similar words have similar embeddings. We tried out two approaches: 
- training embeddings from scratch on the vocabularies from our respective datasets and 
- using pre-trained embeddings (transfer learning). We used GloVe (Global Vectors for Word Representation) embeddings from Stanford researchers (Pennington et al., 2014) trained on 1) Wikipedia texts and 2) Tweets.

|Short name|Training data|Number of tokens|Vocabulary|Dimensionality|
|------|------|------|------|------|
|Glove Wiki|combination of Wikipedia 2014 & Gigaword5|6B tokens| 400 thousand |50d, 100d, 200d, & 300d vectors|
|Glove Tweets |2B tweets | 27B tokens| 1.2 million |25d, 50d, 100d, & 200d vectors|


GloVe embeddings can be easily downloaded from the respective Stanford website which takes around 5-10 minutes depending on your internet connection. We provide the links below in the code. There also exist GloVe embeddings pretrained on other data which we don't use in our essay. They are also publicly available on the project web page: https://nlp.stanford.edu/projects/glove/

#### Transformer model


DistilBERT is a Transformer model based on the BERT architecture which is smaller, faster, lighter and cheaper to pre-train. Knowledge distillation is performed during the pre-training phase in order to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses (Sanh, Debut, Chaumond & Wolf, 2019).

#### GPU usage in experiment

All experiments were run using NVIDIA TESLA P100 GPU from Kaggle. The usage is free of charge and limited to 30-40 hours/week (see more here: https://www.kaggle.com/docs/efficient-gpu-usage).

### 4.3 NLP Preprocessing Pipeline

We compare the performance of the selected models on raw and preprocessed datasets. For this task we setup a Preprocessor Class which runs a NLP Pipeline to take care of all necessary preprocessing tasks such as lemmatization, stopword removal and text cleaning with regular expressions. In addition, the Preprocessor also factorizes the label to obtain a binary encoding.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from Preprocessor import Preprocessor
from LogisticRegression import LogisticRegression
from LSTM import LSTM
from DistilBert import DistilBert

imdb_df = pd.read_csv("./data/IMDB.csv", names=["text", "sentiment"])
sentiment140_df = pd.read_csv("./data/Sentiment140.csv", header=None, index_col=False, encoding='latin-1', usecols=[0,5], names=["sentiment", "text"])

In [3]:
#small_sentiment140_df = sentiment140_df.tail(5000)
#small_sentiment140_df = small_sentiment140_df.append(sentiment140_df.head(5000))

In [5]:
configs = {
    "imdb": {
         "preprocessor": {
            "name": "imdb",
            "df": imdb_df,
            "cache": True,
            "test_size": 0.3,
             "random_state": 42,
        },
        "logistic_regression": {
            "ngram_range": (1,1),
            "random_state": 42,
            "max_iter": 100,
        },
        "LSTM": {
            "lstm_units": 80,
            "batch_size": 256,
            "dropout_rate": 0.1,
            "activation": "sigmoid",
            "epochs": 3,
            "random_state": 42,
        }
    },
    "sentiment140": {
       "preprocessor": {
            "name": "sentiment140",
            "df": sentiment140_df,
            "cache": True
        },
        "logistic_regression": {
            "ngram_range": (1,3),
            "random_state": 42,
            "max_iter": 500,
        },
        "LSTM": {
            "lstm_units": 27,
            "batch_size": 4096,
            "dropout_rate": 0.1,
            "activation": "sigmoid",
            "epochs": 3,
            "random_state": 42,
        }
    }
}

def run_models(config):
    # -------------- Preprocessor -------------- #
    preprocessor = Preprocessor(**config['preprocessor'])
    Xy = preprocessor.run()
    Xy_train_test_dict = preprocessor.split(Xy)
    
    # ----------- LogisticRegression ----------- #
    lr_model = LogisticRegression(**config['logistic_regression'])
    lr_model.fit(Xy_train_test_dict)
    
    # ------------------ LSTM ------------------ #
    lstm_model = LSTM(**config['LSTM'])
    lstm_model.fit(Xy_train_test_dict)

In [8]:
run_models(configs["imdb"])

Read imdb_preprocessed.parquet.gzip from cache...
Successfully read imdb_preprocessed.parquet.gzip into memory.
Logistic Regression validation accuracy: 0.8904


2022-02-11 16:25:44.357271: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/3
Epoch 2/3
Epoch 3/3
LSTM validation accuracy: 88.11%


In [6]:
run_models(configs["sentiment140"])

Read sentiment140_preprocessed.parquet.gzip from cache...
Successfully read sentiment140_preprocessed.parquet.gzip into memory.
Logistic Regression validation accuracy: 0.7786958333333334
Epoch 1/3
Epoch 2/3
Epoch 3/3
LSTM validation accuracy: 76.98%


In [None]:
imdb_preprocessed = pd.read_parquet("/cache/imdb_preprocessed.parquet.gzip")
sent140_preprocessed = pd.read_parquet("/cache/sentiment140_preprocessed.parquet.gzip")

sets_raw_imdb=preprocessor.split(imdb_df,0.3,42)
sets_preprocessed_imdb=preprocessor.split(imdb_preprocessed,0.3,42)
sets_raw_sent140=preprocessor.split(sentiment140_df,0.3,42)
sets_preprocessed_sent140=preprocessor.split(sent140_preprocessed,0.3,42)

#### Baseline model on raw datasets

#### LSTM on raw datasets

In [6]:
from LSTM_new import LSTM
from CNN_LSTM import CNN_LSTM

In [None]:
import tensorflow 

# Allow memory growth for the GPU
physical_devices = tensorflow.config.experimental.list_physical_devices('GPU')
tensorflow.config.experimental.set_memory_growth(physical_devices[0], True)

In [None]:
#import embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

!wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
!unzip glove.twitter.27B.zip

GLOVE_EMB_WIKI_50 = './glove.6B.50d.txt'
GLOVE_EMB_WIKI_100 = './glove.6B.100d.txt'
GLOVE_EMB_WIKI_200 = './glove.6B.200d.txt'
GLOVE_EMB_WIKI_300 = './glove.6B.300d.txt'

GLOVE_EMB_TWI_25 = '/glove.twitter.27B.25d.txt'
GLOVE_EMB_TWI_50 = '/glove.twitter.27B.50d.txt'
GLOVE_EMB_TWI_100 = '/glove.twitter.27B.100d.txt'
GLOVE_EMB_TWI_200 = '/glove.twitter.27B.200d.txt'

- LSTM on IMDB Dataset

In [30]:
lstm_model.fit_lstm(sets_raw_imdb,epochs=20, batch_size=128,embedding_dim=300,embeddings_name=GLOVE_EMB_WIKI_300)

Vocabulary of the dataset is :  105600
2493
Data padded!
Found 400000 word vectors.
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 2493, 300)         31680300  
_________________________________________________________________
dropout_9 (Dropout)          (None, 2493, 300)         0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dropout_10 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 64)                6464      
_________________________________________________________________
dropout_11 (Dropout)         (None, 64)                0         
____________________________________

- LSTM on Sentiment140 dataset

In [54]:
lstm_model.fit_lstm(sets_raw_sent140,epochs=10, batch_size=128,embedding_dim=200,
                    embeddings_name=GLOVE_EMB_TWI_200)

Vocabulary of the dataset is :  543420
118
Data padded!
Found 1193514 word vectors.
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 118, 200)          108684200 
_________________________________________________________________
dropout_15 (Dropout)         (None, 118, 200)          0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               120400    
_________________________________________________________________
dropout_16 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 64)                6464      
_________________________________________________________________
dropout_17 (Dropout)         (None, 64)                0         
____________________________________

### DistilBert 

In [5]:
from DistilBert import DistilBert
distilbert_model=DistilBert()

- DistilBert on IMDB dataset

In [24]:
distilbert_model.fit_distil_bert(sets_raw_imdb,1,256)

Tokenizing
Tokenizing completed


2022-02-12 17:27:50.234867: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-12 17:27:50.236804: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-12 17:27:50.237957: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-12 17:27:50.239066: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zer

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

2022-02-12 17:28:11.131690: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the 

----Building the model----
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 256)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 256)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB TFBaseModelOutput(la 66362880    input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
tf.__operators__.getitem (Slici (None, 768)          0           tf

2022-02-12 17:28:27.131209: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Test score: [0.2185227870941162, 0.9147999882698059]


- DistilBert on Sentiment140 dataset

In [25]:
distilbert_model.fit_distil_bert(sets_raw_sent140,epochs=1,max_len=50)

Tokenizing
Tokenizing completed


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


----Building the model----
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 50)]         0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 50)]         0                                            
__________________________________________________________________________________________________
tf_distil_bert_model_1 (TFDisti TFBaseModelOutput(la 66362880    input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
tf.__operators__.getitem_1 (Sli (None, 768)          0           

### 5. Results


#### Baseline

The logistic regression achieves better accuracy on the raw data. While the difference is very small for the IMDB reviews, it's quite significant for tweets from the Sentiment140 dataset.

IMDB raw|IMDB preprocessed|Sentiment140 raw|Sentiment140 preprocessed
-|-|-|-
89.20|89.04|82.07|77.85


#### DL Models

As it was the case with our baseline model, both LSTM and CNN-LSTM models perform better when applied to the raw data. Our CNN-LSTM model, however, didn't outperform neither the baseline nor the LSTM model. The latter model does give better accuracy compared to the logistic regression, but even for the highest results the improvement over the baseline model is marginal (around 1%).

Below you can find an overview of the results for both datasets. While presenting a bunch of results from running the models with slightly different parameters usually doesn't belong to the best practices of ML, we did it in this essay on purpose to illustrate the effect of changing the embeddings, their dimensions, and epochs. 

We found out that pretrained embeddings had a slight positive effect on the models' performance. This is consistent with our reviewed literature. As expected, LSTM with GloVe embeddings which were trained on Wikipedia & Gigaword texts performed better than LSTM with embeddings trained on tweets. The opposite was true for the Sentiment140 dataset.

We can also observe a trend where increased dimensionality of the embeddings leads to slight improvements in the models' accuracy. This is most visible for the pretrained embeddings while the results for embeddings which were trained from scratch are not so straightforward.

Finally, increasing the number of epochs for the LSTM model with pretrained embeddings also provides us with small accuracy gains. Interestingly, this is not the case with the embeddings trained from scratch. Here our results for both datasets usually got worse with increased number of epochs.
- IMDB dataset

Dimension|Embeddings|Batch size|Epochs|LSTM|CNN-LSTM||LSTM|CNN-LSTM
-|-|-|-|-|-|-|-|-
||||raw |raw ||preprocessed|preprocessed 
100|GloVe Wiki |128|3|87.39|87.16||88.85|85.75
100|GloVe Twitter |128|3|85.94|87.02||83.85|84.62
100||128|3|**89.31**|84.77||87.83|87.74
100|GloVe Wiki |128|20|**90.18**|88.25||86.94|85.65
100|GloVe Twitter|128|20|**89.47**|88.02||86.80|84.55
100||128|20|87.07|87.63||86.32|86.30
||||||||
200|GloVe Wiki |128|3|87.09|88.70||86.47|86.62
200|GloVe Twitter |128|3|86.01|86.42||85.95|86.37
200||128|3|86.14|88.15||87.00|87.16
200|GloVe Wiki |128|20|**89.97**|87.21||86.82|85.89
||||||||
300|GloVe Wiki |128|3|88.60|88.91||89.03|87.44
300||128|3|89.01|88.90||87.10|87.05
300|GloVe Wiki|128|20|**90.31**|88.52||87.88|86.28|
300||128|20|87.27|88.11||86.47|86.21
||||||||

- Sentiment140 dataset

Dimension|Embeddings|Batch size|Epochs|LSTM|CNN-LSTM||LSTM|CNN-LSTM
-|-|-|-|-|-|-|-|-
||||raw data|raw data||preprocessed data|preprocessed data
25|GloVe Twitter|2048|3|76.42|72.40||73.82|72.40
25|GloVe Twitter|512|3|78.61|75.85||75.14|73.54
25|GloVe Twitter|128|3|79.58|76.78||75.53|73.92
||||||||
50|GloVe Twitter|2048|3|79.34|78.36||75.98|75.00
50|GloVe Twitter|128|3|81.58|79.89||76.86|75.87
50|GloVe Wiki|128|3|80.05|78.03||75.96|74.09
50||128|3|79.14|80.41||77.46|77.26
||||||||
100|GloVe Twitter|128|3|**82.28**|81.29||76.98|76.55
100|GloVe Twitter|128|10|**82.81**|81.95||77.93|77.01
100|GloVe Wiki|128|3|81.10|80.27||76.53|75.62
100|GloVe Wiki|128|10|82.05|80.81||76.94|75.79
100||128|3|80.15|78.75||77.44|76.99
||||||||
200|GloVe Twitter|128|10|**83.16**|**82.35**||77.97|76.95
200|GloVe Wiki|128|10|**82.26**|77.05||77.11|75.87
200||128|3|80.40|80.06||77.22|76.66
200||128|10|79.22|79.03||76.31|74.60
||||||||
300|GloVe Wiki|128|10|**82.41**|75.58||77.18|75.61




#### Transformer: DistilBert

Our DistilBert model does outperform the baseline and LSTM models for both datasets. Consistent with our previous findings, the results are better when the model is trained on and applied to the raw data. In the table below we only present the results from training the model for one epoch, as the accuracy decreased with increasing number of epochs. 

Epochs|IMDB raw|IMDB preprocessed|Sentiment140 raw|Sentiment140 preprocessed
-|-|-|-|-
1|**91.47**|88.32|**86.10**|78.65

#### Best results: Accuracy vs. Time

While the accuracy gain of DistilBert is, again, rather small for the IMDB reviews predictions, it is more significant for the Sentiment140 dataset with around +4% vs. baseline. This, however, goes with a significant increase in training time. 

Data/Time|LogReg|LSTM|CNN-LSTM|DistilBert
-|-|-|-|-
IMDB (raw)|89.20|90.31|88.90|91.47
Time|3.5 min|19 min|11 min|14 min
||||
Sentiment140 (raw)|82.07|83.16|82.35|86.10
Time||15 min|14 min|1h 30 min

#### Preprocessed Data vs. Raw Data
We were surprised to see that running the different classifiers on the raw data of our datasets actually brought better performance scores than the preprocessed versions. This made us think that our preprocessing might have been ripping off parts of the information of the texts and might have led to wrong predictions. Another thought is that the more complex the sentences and more irony / sarcasm are „hidden“ in the texts, the more DL models are the better approach in comparison to traditional ML models. Unfortunately with scarce time at the end of the semester, we couldn’t look deeper into that topic but would recommend this as a topic for further research.

### 6. Conclusion

Consistent with the current state of the art, our LSTM model performed better than the baseline logistic regression model, and DistilBert provides the best results on the sentiment classification task. This, however, requires additional time / computational resources. The appropriatness of application of complex models would depend on the specific use cases. In many real-world business applications, marginal accuracy improvements might not justify increases in computational ressources. This is especially true if classifications are made on a frequent basis. At the same time, the training, of course, doesn't necessarily need to be done each time before the classification. It might be feasible to make an one-time-investment to train the transformer model once, save and reuse it.

As for the classroom setting, the choice of the appropriate model would strongly depend on the available computational resources and the overall application setting. If the goal is to explore / improve the state-of-the-art, transformers are surely the best choice. Due to time constraints we weren't able to explore all possible fine-tuning opportunities for our DistilBert model and, nevertheless, did get good results. Therefore, it would be intersting to explore whether further fine-tuning of the model (including the preprocessing of the respective data) can result in better performance and results.

### 7. Resources

Gaye, B., Zhang, D., & Wulamu, A. (2021). A Tweet Sentiment Classification Approach Using a Hybrid Stacked Ensemble Technique. Information, 12(9), 374.

Alsayat, A. (2021). Improving Sentiment Analysis for Social Media Applications Using an Ensemble Deep Learning Language Model. Arabian Journal for Science and Engineering, 1-13.

Vizcarra, J., Kozaki, K., Ruiz, M. T., & Quintero, R. (2021). Knowledge-based sentiment analysis and visualization on social networks. New Generation Computing, 39(1), 199-229.

B. Pang, L. Lee, S. Vaithyanathan. (2002). Thumbs up? Sentiment Classification using
Machine Learning Techniques. Proceedings of EMNLP 2002. pp. 79-86.

Q. Ain, M. Ali, A. Riaz, A. Noureen, M. Kamran, B. Hayat, A. Rehman. (2017). Sentiment
Analysis using Deep Learning techniques. (IJACSA) International Journal of Advanced
Computer Science and Applications, Vol. 8, No. 6.

B. Pang, L. Lee. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in
Information Retrieval Vol. 2, Nos. 1–2. DOI: 10.1561/1500000001. pp. 1-135.

P. Ekman, W. Friesen, P. Ellsworth. (1982). What emotion categories or dimensions can
observers judge from facial behavior?. Emotion in the human face. Cambridge University
Press, New York. pp 39-55.

M. Kurosu. (2015). Human-Computer Interaction. 17th International Conference, HCI
International 2015, Los Angeles. Proceedings, Part II. p. 423. [book]

X. Wang, Y. Liu, C. Sun, B. Wang, X. Wang. (2015). Predicting Polarities of Tweets by
Composing Word Embeddings with Long Short-Term Memory. Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing. pp. 1343-1353.

X. Wang, C. Zhang, Y. Ji, L. Sun, L. Wu, Z. Bao. (2013). A Depression Detection Model
Based on Sentiment Analysis in Micro-blog Social Network. PAKDD 2013: Trends and
Applications in Knowledge Discovery and Data Mining. pp. 201-213.

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuglu, P. Kuksa. (2011). Natural
Language Processing (Almost) from Scratch. Journal of Machine Learning Research 12. pp.
2493-2537.

Bacco, L., Cimino, A., Dell’Orletta, F., & Merone, M. (2021). Explainable Sentiment Analysis: A Hierarchical Transformer-Based Extractive Summarization Approach. Electronics, 10(18), 2195.

Wu, Z., Ying, C., Dai, X., Huang, S., & Chen, J. (2020, October). Transformer-Based Multi-aspect Modeling for Multi-aspect Multi-sentiment Analysis. In CCF International Conference on Natural Language Processing and Chinese Computing (pp. 546-557). Springer, Cham.

Jiang, M., Wu, J., Shi, X., & Zhang, M. (2019). Transformer based memory network for sentiment analysis of web comments. IEEE Access, 7, 179942-179953.

Novikova, A., & Stupnikov, S. (2017, July). Sentiment analysis of short texts from social networks using sentiment lexicons and blending of machine learning algorithms. In Proc. CEUR Workshop (pp. 190-201).

Singh, J., Singh, G., & Singh, R. (2017). Optimization of sentiment analysis using machine learning classifiers. Human-centric Computing and information Sciences, 7(1), 1-12.

Rustam, F., Ashraf, I., Mehmood, A., Ullah, S., & Choi, G. S. (2019). Tweets classification on the base of sentiments for US airline companies. Entropy, 21(11), 1078.

Ahuja, R., Chug, A., Kohli, S., Gupta, S., & Ahuja, P. (2019). The impact of features extraction on the sentiment analysis. Procedia Computer Science, 152, 341-348.

PURCHASES, C. J. O. I., STOYANOVA, L., & WALLACE, W. (2019). TOPIC MODELLING, SENTIMENT ANALSYS AND CLASSIFICATION OF SHORT-FORM TEXT.

da Silva, N. F. F., Coletta, L. F., Hruschka, E. R., & Hruschka Jr, E. R. (2016). Using unsupervised information to improve semi-supervised tweet sentiment classification. Information Sciences, 355, 348-365.

Araque, O., Corcuera-Platas, I., Sánchez-Rada, J. F., & Iglesias, C. A. (2017). Enhancing deep learning sentiment analysis with ensemble techniques in social applications. Expert Systems with Applications, 77, 236-246.

Yi, S., & Liu, X. (2020). Machine learning based customer sentiment analysis for recommending shoppers, shops based on customers’ review. Complex & Intelligent Systems, 6(3), 621-634.

Ouyang, X., Zhou, P., Li, C. H., & Liu, L. (2015, October). Sentiment analysis using convolutional neural network. In 2015 IEEE international conference on computer and information technology; ubiquitous computing and communications; dependable, autonomic and secure computing; pervasive intelligence and computing (pp. 2359-2364). IEEE.

Jiang, M., Wu, J., Shi, X., & Zhang, M. (2019). Transformer based memory network for sentiment analysis of web comments. IEEE Access, 7, 179942-179953.

Wu, Z., Ying, C., Dai, X., Huang, S., & Chen, J. (2020, October). Transformer-Based Multi-aspect Modeling for Multi-aspect Multi-sentiment Analysis. In CCF International Conference on Natural Language Processing and Chinese Computing (pp. 546-557). Springer, Cham.

Elfaik, H. (2021). Deep Bidirectional LSTM Network Learning-Based Sentiment Analysis for Arabic Text. Journal of Intelligent Systems, 30(1), 395-412.

Alexandridis, G., Varlamis, I., Korovesis, K., Caridakis, G., & Tsantilas, P. (2021). A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media. Information, 12(8), 331.

Flender, M., & Gips, C. (2017, September). Sentiment Analysis of a German Twitter-Corpus. In LWDA (p. 25).

Tellez, E. S., Miranda-Jiménez, S., Graff, M., Moctezuma, D., Siordia, O. S., & Villaseñor, E. A. (2017). A case study of Spanish text transformations for twitter sentiment analysis. Expert Systems with Applications, 81, 457-471.

Carosia, A. E. O., Coelho, G. P., & Silva, A. E. A. (2020). Analyzing the Brazilian financial market through Portuguese sentiment analysis in social media. Applied Artificial Intelligence, 34(1), 1-19.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.

Jain, P. K., Saravanan, V., & Pamula, R. (2021). A hybrid CNN-LSTM: A deep learning approach for consumer sentiment analysis using qualitative user-generated contents. Transactions on Asian and Low-Resource Language Information Processing, 20(5), 1-15.

Ligthart, A., Catal, C., & Tekinerdogan, B. (2021). Systematic reviews in sentiment analysis: a tertiary study. Artificial intelligence review, 54(7), 4997-5053. 

Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.