## Sentiment Analysis: benchmarking state-of-the-art classifiers

### Introduction

What is Sentiment Analysis?
- NLP frame, one of many NLP tasks
- Definition: extract subjective information to determine polarity
- areas of application (e. g. user generated content)
- research timeline for SA/ deep learning in NLP (starting 2010s)

The growth of user-generated content in web sites and social networks, e.g. Twitter, Amazon, Tripadvisor, Rottentomatoes and IMDB has led to an increasing power for expressing opinions. In recent years, the automatic extraction of opinions from a text has become an area of growing interest. Combined with the fast spreading nature of online content, online opinions have turned into a valuable asset. In order to analyze the massive amount of information, many Natural Language Processing (NLP) tasks are being used. In particular, Sentiment Analysis (SA), also known as Opinion Mining (from now on: SA), became an increasingly growing task (Liu, 2015), whose goal it is to classify opinions and sentiments expressed in user-generated text. SA is on the rise due to the increased requirement of analyzing and structuring hidden information, which comes from <span style="background-color:yellow">social media [user-generated content in general?]</span> in the form of unstructured data (Ain, Ali, Riaz, Noureen, Kamran, Hayat & Rehman, 2017). SA allows to detect emotions and sentiment that the author of a text felt towards a described subject or entity. It is interesting in many fields and branches and helps solving various tasks, e.g.:

- companies can measure the feedback about a product or service,
- sociologists can look at people’s reaction about public events,
- psychologists can study the general mind state of communities with regard to various
issues, i.e. a depression detection model that is based on SA in micro-blog social
networks (Wang, Zhang, Ji, Sun, Wu & Bao, 2013),
- governments and political parties are able to correct their actions according to social
approval or disapproval,
- etc.

Sentiments are not always expressed explicitly and meanings can be hidden in the context, where additional word and language knowledge is necessary. Moreover, opinions may involve sarcasm and negations, which can be interpreted differently in various domains and contexts. Sentiment classification is rather easy for humans (Pang, Lee & Vaithyanathan, 2002), but manual review and analysis of texts is very time consuming and thus, expensive. Due to this fact, automatic sentiment classifiers are selected instead. There are three traditional methods how SA can be classified: lexicon-based method, machine learning-based method and hybrid methods, which is mixing the two former methods.

Although traditional machine learning algorithms like Support Vector Machines have shown good performance in various NLP tasks for the past few decades, they have a few shortcomings, where <span style="background-color:yellow">DL models [wird erst im nächsten Satz definiert]</span> have the potential to overcome these limitations to a large extent. 
A promising alternative to traditional machine learning based methods is Deep Learning (DL). It has shown excellent performance in NLP tasks, including Sentiment Analysis (Collobert, Weston, Bottou, Karlen, Kavukcuglu & Kuksa, 2011). The main idea of DL is to learn complex features extracted from data using deep neural networks with a minimum of external human contribution.

#### Sentiment Analysis: <span style="background-color:yellow">Definition & Classification [wurde oben bereits definiert?]</span> 

Sentiment Analysis, also known as opinion mining or sentiment polarity, is an active research area in NLP that refers to the use of text analysis, statistical learning and often Machine Learning to extract subjective information in source materials such as user-generated texts from social networks, blogs, forums and product or service reviews.
Selecting the basic emotions is a difficult task for a computer because of the variety of human emotions. Most of the authors in the NLP community agree on the classification proposed by Ekman, Friesen and Ellsworth (1982), who mentioned that six basic emotions exist: anger, disgust, fear, joy, sadness and surprise. Such a division requires a complex processing and analysis of the input data, which is most of the time not feasible. Therefore, the majority of researchers and authors accept a simpler representation of sentiments according to their polarity (Pang & Lee, 2008). Kurosu (2015) defines sentiment polarity as follows: “The polarity of a sentiment is the point on the evaluation scale that corresponds to our positive or negative evaluation of the meaning of this sentiment.”. Sentiment polarity allows researchers to use a single dimension, either positive or negative and therefore, simplifies the representation and management of the sentiment information.
Liu (2012) presented three levels of SA: (i) document level, (ii) sentence level and (iii) entity / aspect level. Document level studies the polarity of the whole text with respect to a single entity (e.g. a product). Sentence level studies the polarity of sentences, analyzing clauses and phrases for its sentiment. Entity / aspect level analyzes what people especially liked or disliked. An entity-aspect might be a single token and its polarity might be different from the overall polarity of the text (Liu, 2012).
The granularity of SA can be either coarse-grained or fine-grained. Coarse-grained means usually a binary classification (positive, negative). On ther other hand, fine-grained uses for example five possible levels of granularity (high positive, low positive, neutral, low negative, high negative).

Application:

- product reviews,
- customer e-mails about a product or service,
- people’s reaction on Twitter about an advertising, a campaign, a product release, etc.,
- blogs / news articles about recent topics, e.g. the presidential election.

Classification:

All methods used to solve sentiment classification fall into three main categories: lexicon-based, machine learning-based and hybrid approaches.
In knowledge-based approaches, also called lexicon-based approaches, sentiment is seen as a function of keywords and usually, is based on their count. The main task is the construction of sentiment word lexicons with the indicated class labels positive or negative. In some cases also with their intensiveness, which becomes important for a fine-grained classification.
An alternative to the knowledge-based method is Machine Learning, which is gaining more and more interest of researchers due to its adaptability and higher accuracy. <span style="background-color:yellow">Currently, Machine Learning methods are the dominant approach in SA (Pang et al., 2002). The three main algorithms are Naïve Bayes (NB), Support Vector Machines (SVM) and Maximum Entropy (MaxEnt, in Statistics called: Logisitic Regression). [This seems a little bit outdated? The source is from 2002 :D]</span>
The hybrid approach, also known as combined analysis, combines both knowledge-based and machine learning-based methods and thus, can lead to a superior performance. Researchers were attracted to explore the possibility of a hybrid approach that collectively could exhibit the accuracy of a machine learning approach and the speed of a lexical approach.

Transformer???

### Research Overview

- brief historical overview
- research/literature streams and focus
- what is the state-of-the-art research towards 2022

### Classifier Benchmarking

Benchmarking classifiers using community data sets

- introduce data sets, justify choice
    - social media texts from twitter (about several domains)
    - user reviews from IMDB, Rotten Tomatoes (about movies)
- compare classifiers
    - baseline model (Traditional ML): 
    Logistic Regression on TFIDF-based (LASSO) (movie reviews?)
    SVM (Twitter?)
    - Deep Learning: Hierarchical Attention Network (HAN) / LSTM/CNN/ULMFIT ELMO
    - Transformer Model: BERT oder  XLNet: Generalized Autoregressive Pretraining for Language Understanding, albert

#### Dataset: Movie reviews 

We will compare three classifier on two datasets that contain movie reviews: 
- IMDB dataset: 50.000 movie reviews 
- Stanford Sentiment Treebank: 10.000 reviews 

##### Baseline algorithm
This experiment used the popular and simple ML algorithm Logistic Regression as a baseline algorithm. Logistic Regression is a statistical technique capable of predicting a binary outcome. Using the built-in functions of scikit-learn, the Logistic Regression was very easy to build. After loading the data set, the code had to re-create all the words from the pre- processed data set to build an index, which translates all lists of word-indices to strings and then used Term Frequence - Inverse Document Frequency (TF-IDF) as text representation. TF-IDF is a statistical measure used to evaluate how important a word is in a document. First, it computes the Term Frequence (TF) for each review, the Inverse Document Frequency (IDF) using each review and finally, the TF-IDF for each review. It transforms on the Test data which computes the TF for each review, then the TF-IDF for each review using the IDF from the Training data. Finally, the model was fit to classify the sentiment of the movie reviews.

##### Deep Learning
The DL algorithm of this experiment is an LSTM model, which was built with Keras. Keras is an easy usable, high-level neural network API, which is capable to run on top of either TensorFlow or Theano (Keras Documentation, n.d.). The DL algorithm was built with an LSTM architecture using a Sequential model, which consists of five layers: an embedding layer, two dropout layers, an LSTM layer and an output / dense layer. The Sequential model is a linear stack of layers. 

##### Transformer Model

#### Preprocessing

In [1]:
import numpy as np
import pandas as pd
from Preprocessor import Preprocessor

In [2]:
imdb = pd.read_csv("./data/IMDB.csv")

In [3]:
preprocessor = Preprocessor(imdb)

In [4]:
preprocessed_imdb = preprocessor.run(imdb)

Read preprocessed.parquet.gzip from cache...
Successfully read preprocessed.parquet.gzip into memory.


In [7]:
preprocessed_imdb['review'][5]

'probably time favorite movie story selflessness sacrifice dedication noble cause preachy bore old despite having time year paul lukas performance bring tear eye bette davis truly sympathetic role delight kid grandma like dress midget child fun watch mother slow awakening happen world roof believable startling dozen thumb movie'

#### Dataset: Tweets

##### Baseline Algorithm
- SVM?

##### Deep Learning

##### Transformer Model

### Results

- benchmark ranking of classifiers on different datasets

### Conclusion
- valuable insights on method selection
