## Sentiment Analysis
### Benchmarking State-of-the-Art Classifiers

Oleksandra Kovalenko (???)   
Cosima Heymann (569413)  
Sascha Geyer (546266)

![sentiment](https://camo.githubusercontent.com/899f79e8a2d62fd642eba0791ff66d13d38e427901bfc3cd89c6f613311e1789/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f70726f78792f312a5f4a57314a614d704b5f6656476c64387064315f4a512e676966 'sentiment')

### Introduction: What is Sentiment Analysis?

The growth of user-generated content in web sites and social networks, just to mention a few: Yelp, Twitter, Amazon, Tripadvisor, Rottentomatoes and IMDB has led to an increasing power for expressing opinions. In recent years, the automatic extraction of opinions from a text has become an area of growing interest in Natural Language Processing (NLP). Online opinions have turned into a valuable asset since the fast spreading nature of online content. In order to analyze the massive amount of data, many NLP tasks are being used. In particular, Sentiment Analysis, also known as Opinion Mining (from now on: SA), became an increasingly growing task (Liu, 2015), whose goal it is to classify opinions and sentiments expressed in user-generated text. SA is on the rise due to the increased requirement of analyzing and structuring hidden information, which comes from user-generated content in the form of unstructured data (Ain, Ali, Riaz, Noureen, Kamran, Hayat & Rehman, 2017). It allows to detect the emotion and sentiment that an author of a text felt towards a described subject or entity. It is interesting in many fields and branches and helps solving various tasks, e.g.:

- companies are able to measure the feedback about a product or service,
- sociologists can look at people’s reaction about certain public events,
- psychologists can study the general mind state of communities with regard to various issues, i.e. a depression detection model that is based on SA in micro-blog social networks (Wang, Zhang, Ji, Sun, Wu & Bao, 2013),
- governments and political parties are able to correct their actions according to social approval or disapproval,
- etc.

The challenge is that sentiments are not always expressed explicitly and meanings can be hidden in the context. In these cases, additional word and language knowledge is necessary. Moreover, opinions may involve sarcasm and negations, which can be interpreted differently in various domains and contexts. Sentiment classification is rather easy for humans (Pang, Lee & Vaithyanathan, 2002), but manual review and analysis of texts is very time consuming and expensive. Due to this fact, automatic sentiment classifiers are selected instead. 

#### Sentiment Analysis: Definition, Application & Classification 

Sentiment Analysis is an active research area in NLP that refers to the use of text analysis, statistical learning and often Machine Learning to extract subjective information in source materials such as user-generated texts from social networks, blogs, forums and product or service reviews.
Selecting the basic emotions is a difficult task for a computer because of the variety of human emotions. Most of the authors in the NLP community agree on the classification proposed by Ekman, Friesen and Ellsworth (1982) that six basic emotions exist: anger, disgust, fear, joy, sadness and surprise. As such a division requires a complex processing and analysis of the input data, the majority of researchers and authors accept a simpler representation of sentiments according to their polarity (Pang & Lee, 2008). Kurosu (2015) defines sentiment polarity as follows: “The polarity of a sentiment is the point on the evaluation scale that corresponds to our positive or negative evaluation of the meaning of this sentiment.”. Sentiment polarity allows researchers to use a binary or ternary measurement, either positive, negative or neutral and therefore, simplifies the representation and management of the sentiment information. The granularity of SA can be either coarse-grained or fine-grained. Coarse-grained means usually a binary classification (positive, negative). On ther other hand, fine-grained uses for example five possible levels of granularity (high positive, low positive, neutral, low negative, high negative). Liu (2012) presented three levels of SA: document level, sentence level and entity / aspect level. 

While document level studies the polarity of the whole text with respect to a single entity (e.g. a product), sentence level studies the polarity of single sentences, analyzing clauses and phrases for its sentiment. Contrary, entity / aspect level analyzes what people especially liked or disliked. An entity-aspect might be a single token and its polarity might be different from the overall polarity of the text (Liu, 2012).

#####  Application
To mention a few application areas:

- Social media monitoring,
- Customer support / feedback,
- Brand monitoring and reputation management,
- Voice of customer (VoC),
- Voice of employee,
- Product analysis,
- Market research and competitive research.

##### Classification

All methods used to solve sentiment classification fall into three main categories: lexicon-based, machine learning-based and hybrid approaches.

In lexicon-based approaches, also known as knowledge-based methods, sentiment is seen as a function of keywords and is based on their count. The main task is the construction of sentiment word lexicons with the indicated class labels positive or negative. In some cases also with their intensiveness, which becomes important for a fine-grained classification.

An alternative to the knowledge-based method is Machine Learning, which is gaining more and more interest of researchers due to its adaptability and higher accuracy. Traditional Machine Learning methods were the dominant approach in SA (Pang et al., 2002) with the three main algorithms: Naïve Bayes (NB), Support Vector Machines (SVM) and Maximum Entropy (MaxEnt, in Statistics called: Logisitic Regression). Part of Machine Learning models are Deep Learning models (DL) and Transformer models.

The hybrid approach, also known as combined analysis or ensemble models, combines both knowledge-based and Machine Learning-based methods and thus, can lead to a superior performance. Researchers were attracted to explore the possibility of a hybrid approach that collectively could exhibit the accuracy of a Machine Learning approach and the speed of a lexical approach.

Although traditional Machine Learning algorithms like Support Vector Machines have shown good performance in various NLP tasks for the past decades, they have a few shortcomings, where DL has the potential to overcome these limitations to a large extent and has already shown excellent performance in NLP tasks, including SA (Collobert, Weston, Bottou, Karlen, Kavukcuglu & Kuksa, 2011). 

### Research Overview

- brief historical overview
- research/literature streams and focus
- what is the state-of-the-art research towards 2022

The following section describes related works that exploit different approaches to solve SA tasks on different data sets and from different perspectives in the past 5 years. This review is conducted on the basis of numerous latest studies and researches in the field of SA. The first table presents several methods for English texts, whereas the second literature table mentions a few papers for languages like Greek, German or French. This field of research (SA for different languages) is for sure a topic for future studies.

There are several papers that exploit the methods of lexicon-based models, i.e. 

Hybrid: 
Gaye B, Zhang D, Wulamu A., 2021
Anastasia Novikova, Sergey Stupnikov, 2018
Alsayat A, 2021

English:

<table>
  <tr>
   <td><strong>Paper Name</strong>
   </td>
   <td><strong>Year of Publishment</strong>
   </td>
   <td><strong>Dataset(s)</strong>
   </td>
   <td><strong>Classification</strong>
   </td>
   <td><strong>Algorithms</strong>
   </td>
   <td><strong>Performance Evaluation Criteria</strong>
   </td>
   <td><strong>Source</strong>
   </td>
  </tr>
  <tr>
   <td>Explainable Sentiment Analysis: A Hierarchical Transformer-Based Extractive Summarization Approach
   </td>
   <td>2021
   </td>
   <td>Large IMDB
   </td>
   <td>Transformer Models
   </td>
   <td>Explainable Hierarchical Transformer (ExHiT),  Sentence Classification Combiner Model (SCC)
   </td>
   <td>Accuracy
   </td>
   <td><a href="https://www.mdpi.com/2079-9292/10/18/2195/pdf">https://www.mdpi.com/2079-9292/10/18/2195/pdf</a>
   </td>
  </tr>
  <tr>
   <td>A Tweet Sentiment Classification Approach Using a Hybrid
<p>
Stacked Ensemble Technique
   </td>
   <td>2021
   </td>
   <td>Sentiment140
   </td>
   <td>Hybrid of Lexicon-, ML- and DL-based models
   </td>
   <td>stacked ensemble of three long short-term
<p>
memory (LSTM) as base classifiers and logistic regression (LR) as a meta classifier
   </td>
   <td>accuracy,
<p>
precision, recall, F1 Score
   </td>
   <td><a href="https://www.mdpi.com/2078-2489/12/9/374">https://www.mdpi.com/2078-2489/12/9/374</a>
   </td>
  </tr>
  <tr>
   <td>Optimization of sentiment analysis using
<p>
machine learning classifers
   </td>
   <td>2017
   </td>
   <td>three manually compiled datasets; two of them are captured
<p>
from Amazon and one dataset is assembled from IMDB movie reviews
   </td>
   <td>Machine Learning
   </td>
   <td>Naïve
<p>
Bayes, J48, BFTree and OneR
   </td>
   <td>accuracy, F-measure, correctly classifed
<p>
instances
   </td>
   <td><a href="https://doi.org/10.1186/S13673-017-0116-3">https://doi.org/10.1186/S13673-017-0116-3</a>
   </td>
  </tr>
  <tr>
   <td>Sentiment Analysis of Short Texts from Social
<p>
Networks Using Sentiment Lexicons and
<p>
Blending of Machine Learning Algorithms
   </td>
   <td>2018
   </td>
   <td>VKontakte social network posts
   </td>
   <td>Hybrid
   </td>
   <td>Logistic Regression, Random Forest Classifier, SVM, Gradient Boosting Classifier, KNeighbors Classifier, Multino-
<p>
mial Naive Bayes
   </td>
   <td>F1 Score
   </td>
   <td>http://ceur-ws.org/Vol-2268/paper21.pdf
   </td>
  </tr>
  <tr>
   <td>Tweets Classification on the Base of Sentiments for US Airline Companies
   </td>
   <td>2019
   </td>
   <td>Twitter US Airline Sentiment
   </td>
   <td>Machine Learning
   </td>
   <td>Voting Classifier is based on logistic regression (LR) and stochastic gradient descent classifier (SGDC) <strong>vs</strong> a variety of machine learning classifiers
   </td>
   <td>accuracy, precision, recall and F1 score
   </td>
   <td><a href="https://doi.org/10.3390/e21111078">https://doi.org/10.3390/e21111078</a>
   </td>
  </tr>
  <tr>
   <td>The Impact of Features Extraction on the Sentiment Analysis
   </td>
   <td>2019
   </td>
   <td>Sentiment Strength Twitter Dataset 
<p>
				
<p>
			
<p>
		
   </td>
   <td>Machine Learning
   </td>
   <td>TFIDF vs N-gram on 6 ML algos (LR, SVM, Decision Tree, Random Forest, KNN, Naive Bayes)
   </td>
   <td>accuracy, precision, recall and F1 score
   </td>
   <td>https://www.sciencedirect.com/science/article/pii/S1877050919306593
   </td>
  </tr>
  <tr>
   <td>TOPIC MODELLING, SENTIMENT ANALSYS
<p>
AND CLASSIFICATION OF SHORT-FORM TEXT
   </td>
   <td>2019
   </td>
   <td>data was obtained through
<p>
Twitter and Facebook’s public APIs with Netlytic
   </td>
   <td>Lexicon-based, Machine Learning, Deep Learning
   </td>
   <td>LDA (Latent Dirichlet Allocation), 
<p>
LSA (Latent Semantic Allocation) vs LR, SVM and Naive Bayes
   </td>
   <td>technical performance (perplexity score and topic coherence score), ease of application, as well as
<p>
proximity to human agent performance on the same problem
   </td>
   <td>https://local.cis.strath.ac.uk/wp/extras/msctheses/papers/strath_cis_publication_2733.pdf
   </td>
  </tr>
  <tr>
   <td>Using unsupervised information to improve semi-supervised tweet sentiment classification
   </td>
   <td>2016
   </td>
   <td>6 datasets: SemEval 2013, LiveJournal, SMS2013, Twitter2013, Twitter2014, Twitter Sarcasm 2014 
   </td>
   <td>Machine Learning
   </td>
   <td>semi-supervised C3E algorithmvs SVM
   </td>
   <td>F-Scores
   </td>
   <td>https://www.researchgate.net/publication/295244270_Using_unsupervised_information_to_improve_semi-supervised_tweet_sentiment_classification
   </td>
  </tr>
  <tr>
   <td>Improving Sentiment Analysis for Social Media Applications Using an Ensemble Deep Learning Language Model
   </td>
   <td>2021
   </td>
   <td>3 datasets: own Twitter coronavirus hashtag dataset as well as public review datasets from Amazon and Yelp
   </td>
   <td>Hybrid
   </td>
   <td>customized deep learning model with an advanced word embedding technique and create a long short-term memory (LSTM)
   </td>
   <td>accuracy
   </td>
   <td>https://pubmed.ncbi.nlm.nih.gov/34660170/
   </td>
  </tr>
  <tr>
   <td>Enhancing Deep Learning Sentiment Analysis with Ensemble Techniques in Social Applications
   </td>
   <td>2017
   </td>
   <td>7 datasets on movie reviews & microblogging 
   </td>
   <td>Deep Learning, Hybrid
   </td>
   <td>
   </td>
   <td>F1 score
   </td>
   <td>https://www.researchgate.net/publication/313332224_Enhancing_Deep_Learning_Sentiment_Analysis_with_Ensemble_Techniques_in_Social_Applications
   </td>
  </tr>
  <tr>
   <td>Machine learning based customer sentiment analysis for recommending shoppers, shops based on customers’ review
   </td>
   <td>2020
   </td>
   <td>product data with customer reviews is collected from benchmark Unified computing system (UCS)
   </td>
   <td>Machine Learning 
   </td>
   <td>Hybrid Recommendation System
   </td>
   <td>MAPE
   </td>
   <td>https://link.springer.com/article/10.1007/s40747-020-00155-2
   </td>
  </tr>
  <tr>
   <td>Sentiment Analysis Using Convolutional Neural Network
   </td>
   <td>2020
   </td>
   <td>IMDB movie reviews
   </td>
   <td>Deep Learning
   </td>
   <td>RNN, LSTM, CNN,
   </td>
   <td>accuracy
   </td>
   <td>https://ieeexplore.ieee.org/abstract/document/7363395
   </td>
  </tr>
</table>


Other languages: 


<table>
  <tr>
   <td><strong>Language</strong>
   </td>
   <td><strong>Paper Name</strong>
   </td>
   <td><strong>Year of Publishment</strong>
   </td>
   <td><strong>Dataset(s)</strong>
   </td>
   <td><strong>Classification</strong>
   </td>
   <td><strong>Algorithms</strong>
   </td>
   <td><strong>Performance Evaluation Criteria</strong>
   </td>
   <td><strong>Source</strong>
   </td>
  </tr>
  <tr>
   <td>Arabic
   </td>
   <td>Deep Bidirectional LSTM Network Learning-Based Sentiment Analysis for Arabic Text
   </td>
   <td>2021
   </td>
   <td> six benchmark sentiment analysis datasets
   </td>
   <td>Deep Learning
   </td>
   <td> Bidirectional LSTM Network (BiLSTM)
   </td>
   <td>
   </td>
   <td>https://www.degruyter.com/document/doi/10.1515/jisys-2020-0021/html
   </td>
  </tr>
  <tr>
   <td>Greek
   </td>
   <td>A Survey on Sentiment Analysis and Opinion Mining in Greek
<p>
Social Media
   </td>
   <td>2021
   </td>
   <td>self-collected and annotated Greek Social Media Texts 
   </td>
   <td>Deep Learning
   </td>
   <td>PaloBert, GreekBERT
   </td>
   <td>F1 Score, Accuracy
   </td>
   <td>https://doi.org/10.3390/info12080331
   </td>
  </tr>
  <tr>
   <td>German
   </td>
   <td>Sentiment analysis of a German Twitter-Corpus
   </td>
   <td>2017
   </td>
   <td>German tweets from a bigger dataset
   </td>
   <td>Machine Learning
   </td>
   <td>Multinomial NB,  LinearSVC, Decision Tree Classifier, Maxent Classifier
   </td>
   <td>F-measure, accuracy
   </td>
   <td>http://ceur-ws.org/Vol-1917/paper06.pdf
   </td>
  </tr>
  <tr>
   <td>Spanish
   </td>
   <td>A case study of Spanish text transformations for twitter sentiment analysis
   </td>
   <td>2021
   </td>
   <td>two Spanish datasets
   </td>
   <td>Machine Learning
   </td>
   <td>SVM
   </td>
   <td>accuracy, computing time
   </td>
   <td>https://www.sciencedirect.com/science/article/abs/pii/S0957417417302312?via%3Dihub
   </td>
  </tr>
  <tr>
   <td>(Brazilian) Portuguese
   </td>
   <td>Analyzing the Brazilian Financial Market Through Portuguese Sentiment Analysis in Social Media
   </td>
   <td>2018
   </td>
   <td>self annotated Twitter dataset on financial market
   </td>
   <td>Machine Learning
   </td>
   <td>Naive
<p>
Bayes, Support Vector Machines, Maximum Entropy and Multilayer Perceptron
   </td>
   <td>accuracy
   </td>
   <td>https://www.researchgate.net/profile/Arthur-Carosia/publication/336933355_Analyzing_the_Brazilian_Financial_Market_through_Portuguese_Sentiment_Analysis_in_Social_Media/links/5e67edc24585153fb3d5b305/Analyzing-the-Brazilian-Financial-Market-through-Portuguese-Sentiment-Analysis-in-Social-Media.pdf
   </td>
  </tr>
  <tr>
   <td>French
   </td>
   <td>Sentiment Analysis of French Tweets based on Subjective Lexicon Approach: Evaluation of the use of OpenNLP and CoreNLP Tools
   </td>
   <td>2021
   </td>
   <td>French tweets using "Public Opinion Knowledge (POK)" platform
   </td>
   <td>Lexicon based in comparison to Machine Learning
   </td>
   <td>OpenNLP, CoreNLP, dependency analysis implemented by CoreNLP
   </td>
   <td>precision, F-measure
   </td>
   <td>https://www.researchgate.net/publication/326514882_Sentiment_Analysis_of_French_Tweets_based_on_Subjective_Lexicon_Approach_Evaluation_of_the_use_of_OpenNLP_and_CoreNLP_Tools
   </td>
  </tr>
</table>

### Classifier Benchmarking

Benchmarking classifiers using community data sets

- introduce data sets, justify choice
    - social media texts from twitter (about several domains)
    - user reviews from IMDB, Rotten Tomatoes (about movies)
- compare classifiers
    - baseline model (Traditional ML): 
    Logistic Regression on TFIDF-based (LASSO) (movie reviews?)
    SVM (Twitter?)
    - Deep Learning: Hierarchical Attention Network (HAN) / LSTM/CNN/ULMFIT ELMO
    - Transformer Model: BERT oder  XLNet: Generalized Autoregressive Pretraining for Language Understanding, albert

### Dataset Overview

<table>
  <tr>
   <td><strong>Name</strong>
   </td>
   <td><strong>Platform</strong>
   </td>
   <td><strong>Domain</strong>
   </td>
   <td><strong>Size</strong>
   </td>
   <td><strong>Evaluation (binary or more)</strong>
   </td>
   <td><strong>Language</strong>
   </td>
   <td><strong>Source</strong>
   </td>
  </tr>
  <tr>
   <td>Twitter US Airline Sentiment
   </td>
   <td>Twitter 
   </td>
   <td>US Airline user experiences
   </td>
   <td>3.42 MB
   </td>
   <td>ternary = positive, negative, neutral
   </td>
   <td>English
   </td>
   <td>https://www.kaggle.com/crowdflower/twitter-airline-sentiment
   </td>
  </tr>
  <tr>
   <td>Sentiment140
   </td>
   <td>Twitter
   </td>
   <td>user responses to different products, brands, or topics
   </td>
   <td>228 MB Training (1.600.000) 
   </td>
   <td>0 = negative, 
<p>
2 = neutral, 4 = positive
   </td>
   <td>English
   </td>
   <td>http://help.sentiment140.com/for-students
   </td>
  </tr>
  <tr>
   <td>Stanford Sentiment Treebank
   </td>
   <td>Rotten Tomatoes
   </td>
   <td>movie reviews
   </td>
   <td>10.000
   </td>
   <td>1-25 (25: most positive)
   </td>
   <td>English
   </td>
   <td>https://nlp.stanford.edu/sentiment/code.html
   </td>
  </tr>
  <tr>
   <td>Large IMDB Movie Reviews
   </td>
   <td>IMDB
   </td>
   <td>movie reviews
   </td>
   <td>25.000 training, 25.000 test
   </td>
   <td>binary
   </td>
   <td>English
   </td>
   <td>https://ai.stanford.edu/~amaas/data/sentiment/
   </td>
  </tr>
  <tr>
   <td>Polarity v2.0
   </td>
   <td>
   </td>
   <td>movie reviews
   </td>
   <td>3MB (1000 positive and 1000 negative processed reviews)
   </td>
   <td>binary
   </td>
   <td>English
   </td>
   <td>http://www.cs.cornell.edu/people/pabo/movie-review-data/
   </td>
  </tr>
  <tr>
   <td>Paper Reviews
   </td>
   <td>conference of computing
   </td>
   <td>user’s opinion about a paper
   </td>
   <td>405
   </td>
   <td>-2: very negative
<p>
-1: negative
<p>
0: neutral
<p>
1: positive
<p>
2: very positive
   </td>
   <td>English, Spanish
   </td>
   <td>https://archive.ics.uci.edu/ml/datasets/Paper+Reviews
   </td>
  </tr>
  <tr>
   <td>Multi-Domain Sentiment Dataset
   </td>
   <td>Amazon
   </td>
   <td>reviews of amazon products
   </td>
   <td>unprocessed: 1.9 GB, processed: 19 MB
   </td>
   <td>reviews contain ratings from 1 to 5 stars (can be converted to binary)
   </td>
   <td>English
   </td>
   <td>https://www.cs.jhu.edu/~mdredze/datasets/sentiment/
   </td>
  </tr>
  <tr>
   <td>Opin-Rank Review Dataset
   </td>
   <td>Tripadvisor, Edmunds
   </td>
   <td>hotel & car reviews
   </td>
   <td>300.000
   </td>
   <td>ratings that can be turned into binary?
   </td>
   <td>English
   </td>
   <td>https://archive.ics.uci.edu/ml/datasets/opinrank+review+dataset
   </td>
  </tr>
  <tr>
   <td>Sentiment Lexicons For 81 Languages
   </td>
   <td>-
   </td>
   <td>-
   </td>
   <td>2 text files per language
   </td>
   <td>binary
   </td>
   <td>81 languages: Afrikaans to Yiddisch
   </td>
   <td>https://sites.google.com/site/datascienceslab/projects/multilingualsentiment
   </td>
  </tr>
  <tr>
   <td>Lexicoder
   </td>
   <td>-
   </td>
   <td>-
   </td>
   <td>2,858 negative sentiment words and 1,709 positive sentiment words
   </td>
   <td>binary
   </td>
   <td>English
   </td>
   <td>http://www.snsoroka.com/data-lexicoder/
   </td>
  </tr>
  <tr>
   <td>DynaSent
   </td>
   <td>Dynabench
   </td>
   <td>naturally occurring sentences with sentences created using the open-source Dynabench Platform
   </td>
   <td>121,634 sentences
   </td>
   <td>ternary
   </td>
   <td>English
   </td>
   <td>https://github.com/cgpotts/dynasent
   </td>
  </tr>
  <tr>
   <td>Amazon Fine Foods
   </td>
   <td>Amazon
   </td>
   <td>product reviews
   </td>
   <td>5.000.000 reviews
   </td>
   <td>ratings that can be turned into binary?
   </td>
   <td>English
   </td>
   <td>https://snap.stanford.edu/data/web-FineFoods.html
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Germeval2017
   </td>
   <td>Social Media 
   </td>
   <td>messages
   </td>
   <td>22,000 messages from various social media and web sources
   </td>
   <td>ternary
   </td>
   <td>German
   </td>
   <td>https://sites.google.com/view/germeval2017-absa/data
   </td>
  </tr>
  <tr>
   <td>Yelp_polarity_reviews
   </td>
   <td>Yelp
   </td>
   <td>business reviews
   </td>
   <td>600,000 reviews for training, 38,000 for testing
   </td>
   <td>binary (1 - bad, 2 - good)
   </td>
   <td>English
   </td>
   <td><a href="https://www.kaggle.com/irustandi/yelp-review-polarity">https://www.kaggle.com/irustandi/yelp-review-polarity</a> 
   </td>
  </tr>
  <tr>
   <td>Financial PhraseBank
   </td>
   <td>
   </td>
   <td>financial news (rated as pos/neg/neutral) for investor
   </td>
   <td>4840 
<p>
4 configurations available (size depends on the level of agreement of annotators)
   </td>
   <td>ternary
   </td>
   <td>English
   </td>
   <td><a href="https://github.com/huggingface/datasets/tree/master/datasets/financial_phrasebank">https://github.com/huggingface/datasets/tree/master/datasets/financial_phrasebank</a> 
   </td>
  </tr>
  <tr>
   <td>The SigmaLaw- Aspect-Based-SA dataset
   </td>
   <td>Court cases
   </td>
   <td>Legal opinion texts
   </td>
   <td>2,000 sentences
   </td>
   <td>ternary
   </td>
   <td>English
   </td>
   <td><a href="https://osf.io/efrqt/">https://osf.io/efrqt/</a> 
   </td>
  </tr>
  <tr>
   <td>Skytrax Users Review Dataset 
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>English
   </td>
   <td><a href="https://github.com/quankiquanki/skytrax-reviews-dataset">https://github.com/quankiquanki/skytrax-reviews-dataset</a> 
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
   <td>https://mpqa.cs.pitt.edu/corpora/mpqa_corpus/
   </td>
  </tr>
</table>

#### Dataset: Movie reviews 

We will compare three classifier on two datasets that contain movie reviews: 
- IMDB dataset: 50.000 movie reviews 
- Stanford Sentiment Treebank: 10.000 reviews 

##### Baseline algorithm
This experiment used the popular and simple ML algorithm Logistic Regression as a baseline algorithm. Logistic Regression is a statistical technique capable of predicting a binary outcome. Using the built-in functions of scikit-learn, the Logistic Regression was very easy to build. After loading the data set, the code had to re-create all the words from the pre- processed data set to build an index, which translates all lists of word-indices to strings and then used Term Frequence - Inverse Document Frequency (TF-IDF) as text representation. TF-IDF is a statistical measure used to evaluate how important a word is in a document. First, it computes the Term Frequence (TF) for each review, the Inverse Document Frequency (IDF) using each review and finally, the TF-IDF for each review. It transforms on the Test data which computes the TF for each review, then the TF-IDF for each review using the IDF from the Training data. Finally, the model was fit to classify the sentiment of the movie reviews.

##### Deep Learning
The DL algorithm of this experiment is an LSTM model, which was built with Keras. Keras is an easy usable, high-level neural network API, which is capable to run on top of either TensorFlow or Theano (Keras Documentation, n.d.). The DL algorithm was built with an LSTM architecture using a Sequential model, which consists of five layers: an embedding layer, two dropout layers, an LSTM layer and an output / dense layer. The Sequential model is a linear stack of layers. 

##### Transformer Model

### NLP Preprocessing Pipeline
Before diving deep into predictive modeling we need to preprocess our textual data. For this task we setup a Preprocessor Class which runs a NLP Pipeline to take care of all necessary preprocessing tasks such as lemmatization, stopword removal and text cleaning with regular expressions. In addition, the Preprocessor also factorizes the label to obtain a binary encoding.

In [2]:
import numpy as np
import pandas as pd
from Preprocessor import Preprocessor

config = {
    "name": "imdb",
    "df": pd.read_csv("./data/IMDB.csv"),
    "text_feature": "review",
    "label": "sentiment"
}

preprocessor = Preprocessor(**config)

In [3]:
preprocessed_imdb = preprocessor.run()
preprocessed_imdb.head(10)

Read imdb_preprocessed.parquet.gzip from cache...
Successfully read imdb_preprocessed.parquet.gzip into memory.


Unnamed: 0,review,sentiment
0,reviewer mention watch oz episode hook right e...,0
1,wonderful little production the film technique...,0
2,think wonderful way spend time hot summer week...,0
3,basically family little boy jake think zombie ...,1
4,petter mattei love time money visually stunnin...,0
5,probably time favorite movie story selflessnes...,0
6,sure like resurrection date seahunt series tec...,0
7,amazing fresh innovative idea air year brillia...,1
8,encourage positive comment film look forward w...,1
9,like original gut wrenching laughter like movi...,0


### Train-Test-Split

In [4]:
from sklearn.model_selection import train_test_split

X_imdb = preprocessed_imdb['review']
y_imdb = preprocessed_imdb['sentiment']

X_train_imdb, X_test_imdb, y_train_imdb, y_test_imdb = train_test_split(
    X_imdb, y_imdb, test_size = 0.3, random_state = 42
)

### TFIDF

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,3))
X_train_imdb_tfidf = vectorizer.fit_transform(X_train_imdb)
X_test_imdb_tfidf = vectorizer.transform(X_test_imdb)

print('X_train_imdb_tfidf:', X_train_imdb_tfidf.shape)
print('X_test_imdb_tfidf:', X_test_imdb_tfidf.shape)

X_train_imdb_tfidf: (35000, 5287629)
X_test_imdb_tfidf: (15000, 5287629)


### Logistic Regression



In [6]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=100, random_state=42)
log_reg.fit(X_train_imdb_tfidf, y_train_imdb)

LogisticRegression(random_state=42)

In [11]:
y_hat_imdb = log_reg.predict(X_test_imdb_tfidf)
print(y_hat_imdb)

[1 0 1 ... 1 0 0]


In [12]:
score = log_reg.score(X_test_imdb_tfidf, y_test_imdb)
print(f'accuracy: {score}')

accuracy: 0.8787333333333334


#### Dataset: Tweets

##### Baseline Algorithm

##### Deep Learning

##### Transformer Model

### Results

- benchmark ranking of classifiers on different datasets

### Conclusion
- valuable insights on method selection


### Resources

In [None]:
Gaye, B.; Zhang, D.; Wulamu, A. A Tweet Sentiment Classification Approach Using a Hybrid Stacked Ensemble Technique. Information 2021, 12, 374. https://doi.org/10.3390/info12090374
Alsayat A. Improving Sentiment Analysis for Social Media Applications Using an Ensemble Deep Learning Language Model. Arab J Sci Eng. 2021 Oct 11:1-13. doi: 10.1007/s13369-021-06227-w. Epub ahead of print. PMID: 34660170; PMCID: PMC8502794.
                
