# <center> Sentiment Mining </center>

References:
* http://spark-public.s3.amazonaws.com/nlp/slides/sentiment.pptx
* https://pythonprogramming.net/twitter-sentiment-analysis-nltk-tutorial/

## 1. What is Sentiment Mining ?
* a.k.a Opinion extraction, Opionion mining, Sentiment analysis, Subjectivity analysis
<img src="what_is_sentiment_mining.png" width='70%'>
<img src="what_is_sentiment_mining2.png" width='70%'>
source: http://spark-public.s3.amazonaws.com/nlp/slides/sentiment.pptx



## 2. Why sentiment mining
* Movie:  is this review positive or negative?
* Products: what do people think about the new iPhone?
* Public sentiment: how is consumer confidence? Is despair increasing?
* Politics: what do people think about this candidate or issue?
* Prediction: predict election outcomes or market trends from sentiment


## 3. Objectives of Sentiment Mining
* Detecting attitudes: “enduring, affectively colored beliefs, dispositions towards objects or persons”
  * Holder (**object**) of attitude
  * Target (**aspect/feature**) of attitude
  * Type of attitude
    * **positive** or **negative**
    * **Scale of the attitute**, e.g. [1, 5], [strongly agree, agree, neutral, disagree, strongly disagree]
* Example: The picture quality of this camera is amazing
  * Holder (object): camera
  * Target (aspect/feature): picture quality
  * Attitude: positive

## 4. Sentiment analysis tasks 
Giving a set of text (reviews, documents etc.):
1. Identify objects of the sentiment analysis
    * **Named entities**: company names, brands, proper names, hashtags etc
    * Usually object names or synonyms are explicitly mentioned 
2. For each object, identify and extract object aspects/features that have been commented on in each review text
    * **Explicit** features
      * e.g. the **battery life** of this camera is too short
    * **Implicit** features 
      * e.g. the camera is too large (implicit feature: **size**)
3. Determine whether the sentiment on the features are positive, negative or neutral.
4. Generate a summary of sentiment on each feature and on each object 

## 4.1. Aspect/feature detection
* **Explicit features/aspects**: typically can be extracted by keywords and synonyms
  - Question: how to find synonyms?
  - Challenge: it may be difficult to find an exhaustive list of synonyms for an aspect
  - e.g. <img src="hotel_feature.png" width='50%'> Source: Lappas, T., Sabnis, G., & Valkanas, G. (2016). The <a href=https://www.researchgate.net/publication/309875086_The_Impact_of_Fake_Reviews_on_Online_Visibility_A_Vulnerability_Assessment_of_the_Hotel_Industry> impact of fake reviews on online visibility: A vulnerability assessment of the hotel industry</a>. Information Systems Research, 27(4), 940-961.  
  
* However, **implicit features**: may need a **supervised approach** (e.g. the camera is too large)
  - Naive bayes, SVM, CNN with word embedding are perhaps good approaches here
    - single-label or multi-label classification?
  - Process:
    - Select a set of documents with features/aspects both explicitly/implicitly mentioned
    - Label each of the documents with features/aspects as classes
    - Train a classification model

## 4.2. Sentiment Detection

### 4.2.1. Challenges of sentiment analysis
* Negation: 
  * e.g., This film should be <font color='blue'>brilliant</font>.  It sounds like a <font color='blue'>great</font> plot, the actors are <font color='blue'>first grade</font>, and the supporting cast is <font color='blue'>good</font> as well, and Stallone is attempting to deliver a good performance. However, it <font color='red'>can’t hold up</font>.
* Language sublety: 
  * e.g. This is the kind of movie you go because the theater has air-conditioning
* Domain Dependency
  * e.g. unpredictable movie vs. unpredictable steering (car domain)
* Lots of emoticons

### 4.2.2. Unsupervised Sentiment Analysis 
* Lexicon-based method where sentiment is determined based on **opinion words** (e.g. “amazing”, “great”, “poor”) counted near features/aspects. 
    - Some useful rules:
      - **Negative** sentiment:
        - negative words not preceded by a negation within $n$ (e.g. three) words in the same sentence. 
        - positive words preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - **Positive** sentiment (in the similar fashion):
        - positive words not preceded by a negation within $n$ (e.g. three) words in the same sentence. 
        - negative terms following a negation within $n$ (e.g. three) words in the same sentence
* **Polarity**-based (Postive or Negative) approaches:
    - <a href="https://www3.nd.edu/~mcdonald/Word_Lists.html"> WordStat sentiment Dictionary</a>: This is probably one of the largest lexicons freely available. It contains ~14.000 words ( 9164 negative and 4847 positive words ) and gives words a binary classification (positive or a negative ) score.
    - <a href="http://sentiwordnet.isti.cnr.it"> SentiWordNet</a>; gives the words a positive or negative score between 0 and 1. It contains about 117.660 words, however only ~29.000 of these words have been scored (either positive or negative).
    - LIWC (Linguistic Inquiry and Word Count)(http://www.liwc.net/)
    - Turney Algorithm (<a href="https://arxiv.org/abs/cs/0212032"> Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews</a>)
      1. extract phrases, 
      2. detect sentiment of phrases
         - Use search engine queries to check with cooccurrence of a phrase (e.g. low fees) with "excellence"/"poor" (Pointwise Mutual Inforamtion)
      3. and average the sentiments
* **Valence**-based where the **intensity** of the sentiment is considered, e.g. excellent, good, average
     - VADER: <a href="http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf"> A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text </a>

### 4.2.3. VADER 
- The method of VADER:
    1. Created lexicons of sentiment-related words (~9000)
      -  Built based on existing well-established sentiment word-banks (e.g. LIWC). 
      - Incorporated many lexical features 
        - Western-style emoticons10 (for example, ":-")
        - Sentiment-related acronyms (e.g., LOL) and  commonly used slangs with sentiment value (e.g., "nah", "meh" and "giggly"). 
    2. Rated sentiment-related words were manually rated in terms of sentiment intensity through Amazon Mechancical Turk: positive or negative (and optionally, to what degree)
    3. Implemented heurestics rules:
        - **Punctuation exclamation mark(!)** increases sentiment intensity, e.g. *"The food here is good!!!"*
        - **Capitalization, specifically ALL-CAPS** of a sentiment-relevant word increases the sentiment intensity, e.g. *"The food here is GREAT!"*
        - **Degree modifiers** (also called intensifiers, e.g. extremely) increases intensity
        - **Contrastive conjunction "but"** signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant. e.g. *"The food here is great, but the service is horrible"*. 

- VADER analyzes a piece of text to see if any of the words in the text is present in the lexicon. Sentiment metrics are derived from the ratings of such words
    - **Positive**, **neutral** and **negative**, represent the proportion of the text that falls into those categories. 
    - The final metric, the compound score, is the sum of all of the lexicon ratings which have been standardized to range between -1 and 1 based on some heuristics. 

In [None]:
# Exercise 4.1. SentimentIntensityAnalyzer

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

text='The food is good and the atmosphere is nice'
ss = sid.polarity_scores(text)
print(ss)

In [None]:
# Exercise 4.2. Easy sentences

#http://www.nltk.org/howto/sentiment.html
#http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html
#https://www.researchgate.net/publication/275828927_VADER_A_Parsimonious_Rule-based_Model_for_Sentiment_Analysis_of_Social_Media_Text

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from nltk import tokenize

sentences = ["VADER is smart, handsome, and funny.", # positive sentence example
 "VADER is smart, handsome, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted)
 "VADER is very smart, handsome, and funny.",  # booster words handled correctly (sentiment intensity adjusted)
 "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
 "VADER is VERY SMART, handsome, and FUNNY!!!",# combination of signals - VADER appropriately adjusts intensity
 "VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!",# booster words & punctuation make this close to ceiling for score
 "The book was good.",         # positive sentence
 "The book was kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
 "The plot was good, but the characters \
 are uncompelling and the dialog is not great.", # mixed negation sentence
 "A really bad, horrible book.",       # negative sentence with booster words
 "At least it isn't a horrible book.", # negated negative sentence with contraction
 ":) and :D"     # emoticons handled
 ]

# initalize analyzer

sid = SentimentIntensityAnalyzer()

for sentence in sentences:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]))
    print("\n")

In [None]:
# Exercise 4.3. Tricky sentences
# How do you think the performance of VADER
# for this group of sentences?

tricky_sentences = [
    "Sentiment analysis has never been good.",
    "Sentiment analysis with VADER has never been this good.",
    "Warren Beatty has never been so entertaining.",
    "I won't say that the movie is astounding and I wouldn't claim that "]

# initalize analyzer
sid = SentimentIntensityAnalyzer()

for sentence in tricky_sentences:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]))
    print("\n")

In [None]:
# Exercise 4.4. Tricky Paragraph

# Deal with Paragraph
# question: if a paragraph contains mixed positive and 
# negative sentences, how do you determine the sentiment
# of the entire paragraph?

paragraph = "This film should be brilliant. \
             It sounds like a great plot, the actors are first grade, \
             and the supporting cast is good as well, \
             and Stallone is attempting to deliver a good performance. \
             However, it can’t hold up."

# split into sentences
lines_list = tokenize.sent_tokenize(paragraph)

# initalize analyzer
sid = SentimentIntensityAnalyzer()

# analyze the sentiment sentence by sentence

for sentence in lines_list:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]))
    print("\n")
    
# what if you analyze the entire sentence as a whole?

In [None]:
# Exercise 4.5. Design a document sentiment classifier based on VADER
# test your classifier using amazon review dataset
# and estimate its accuracy

### 4.2.3 Supervised Sentiment Analysis
- Naive Bayes (Base line), SVM, CNN. 
- Different ways to generate feature space:
  * TF-IDF with all tokens
  * with binary counts only
  * Word embedding
- Check lecture notes for "Text Classification" and "Deep Learning II"