# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

VADER is a tool which classifies text into three sentiment categories : positive, negative and neutral.The last value listed "compound" provides a single, continuous measure of the text's overall sentiment (thus aggregating the individual positive, negative and neutrial without delving into specifics) for a quick assessment.All of these metrics can rande from -1 (very megative) to +1 (very positive). 

**Sentence 1** 
"I love apples"

The word "love" has a strong positive sentiment. The high positive score and positive compound score seem reasonable in regards to the sentence analysed,this is validated by the abbsernce of negations.
The analysis is correctly evaluated. 

**Sentence 2**
"I don't love apples"

Vader works with rules, in this case the output follows Vader's logic : the negation "don't" enhances the negative score for the word "love".  This negation erxplains the high negative score and negative compound score. 
The analysis is is correct (negation rule applied).

**Sentence 3**
"I love apples :-)"
When comparing this sentence to the first one "I love apple" we can identify a higher positive score; this is due to how Vader evaluates emoticons as part of the lexicon. In this case a smiley emoticon boosts the positive sentiment score, therefore resulting in a higher positive and compound score than the first sentence. 
The analysis is correctly evaluated. 

**Sentence 4**
"These houses are ruins"

It is likely that the word "ruins" refers to a negative sentiment in Vader lexicon. Because this tool cannot grasp when the sentiment is context-dependent this outpput could be potentially ambiguous. The scores could be correct (could be neutral or negative depending on the interpretation) though it is not fully clear wehter the scores are accurate. A historical context for examples could mostly be neutral, while if it is a sad circumstance it negative score should be enhanced.

**Sentence 5**
"These houses are certainly not considered ruins"

The word "certainly" describes a strong sentiment. As we saw with rule two a negation flips the sentiment from one extreme to the other. In this case we saw as the non negated sentence (number 4) had as output a tendency to be evaluated as negative or neutral. In this case, with the application of the negation rule, "not" intensifies "certainly" and the sentence is identified as positive ("ruins" has positive connotation).
In our opinion the analysis is correct, but as sentence 4 could be ambiguous in the balance between quantites due to the complexity of the double negation. 

**Sentence 6**
"He lies in the chair in the garden"

Incorrect analysis . The veb lies is considered to have a negative connnotations (when referring to its meaning as cheating or misleading), while in this case the verb has only neutral connotation as it is describing a physical postion. This shows that this tool can misinterpret words with multiple meaning based on the context in which they are inserted.

**Sentence 7**
"This house is like any house"

Incorrect or ambiguous analysis. The reasoning over this opinion is that Vader seems to assign some positivity (due to the lack of negation) or probably for the association of the word "house" with positive sentiments while the sentence is overall just neutral. 

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [1]:
import json

In [2]:
my_tweets = json.load(open('my_tweets.json'))

In [3]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    #break

1 {'sentiment_label': 'positive', 'text_of_tweet': 'Just had the most amazing coffee at Cafe Bliss! ☕️ Starting my day off right!', 'tweet_url': 'https://twitter.com/example/status/0000000000001'}
2 {'sentiment_label': 'positive', 'text_of_tweet': 'Finally finished my marathon, and I beat my personal best! Could not be happier! 🏃\u200d♂️🎉', 'tweet_url': 'https://twitter.com/example/status/0000000000002'}
3 {'sentiment_label': 'positive', 'text_of_tweet': 'Huge shoutout to the team for pulling off an incredible project under tight deadlines. #TeamworkMakesTheDreamWork', 'tweet_url': 'https://twitter.com/example/status/0000000000003'}
4 {'sentiment_label': 'positive', 'text_of_tweet': 'The sunset today was breathtaking, nature is truly an incredible artist. 🌅', 'tweet_url': 'https://twitter.com/example/status/0000000000004'}
5 {'sentiment_label': 'positive', 'text_of_tweet': "Our garden's first tomatoes of the season! There's nothing like homegrown food. 🍅😊", 'tweet_url': 'https://twitte

### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [4]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [5]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = ''# run vader
    vader_label = ''# convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
# use scikit-learn's classification report

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [4]:
import pathlib
from sklearn.datasets import load_files


cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:', 
      airline_tweets_folder.exists())

airline_tweets_train = load_files(str(airline_tweets_folder))

path: /Users/pc/Documents/GitHub/Text_Mining_Group1/Lab3/airlinetweets
this will print True if the folder exists: True


## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [47]:
import numpy
import nltk
import pathlib
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report



cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:', 
      airline_tweets_folder.exists())

airline_tweets_train = load_files(str(airline_tweets_folder))

vectorizers = {
    'airline_tfidf': TfidfVectorizer,
    'airline_count': CountVectorizer
}

min_df_values = [2, 5, 10]

for vectorizer_name, Vectorizer in vectorizers.items():
    for min_df in min_df_values:
        print(f"\nExperiment: {vectorizer_name} with min_df={min_df}")
        vectorizer = Vectorizer(min_df=min_df)
        
        airline_counts = vectorizer.fit_transform(airline_tweets_train.data)
        
        docs_train, docs_test, y_train, y_test = train_test_split(airline_counts,
        airline_tweets_train.target,
        test_size = 0.20 
        )
        
        clf = MultinomialNB().fit(docs_train, y_train)
        y_pred = clf.predict(docs_test)

        report = classification_report(y_test, y_pred)
        print(report)
        

path: /Users/pc/Documents/GitHub/Text_Mining_Group1/Lab3/airlinetweets
this will print True if the folder exists: True

Experiment: airline_tfidf with min_df=2
              precision    recall  f1-score   support

           0       0.79      0.92      0.85       326
           1       0.85      0.68      0.76       321
           2       0.81      0.83      0.82       304

    accuracy                           0.81       951
   macro avg       0.81      0.81      0.81       951
weighted avg       0.81      0.81      0.81       951


Experiment: airline_tfidf with min_df=5
              precision    recall  f1-score   support

           0       0.79      0.91      0.85       333
           1       0.83      0.71      0.77       321
           2       0.84      0.83      0.83       297

    accuracy                           0.82       951
   macro avg       0.82      0.82      0.82       951
weighted avg       0.82      0.82      0.81       951


Experiment: airline_tfidf with min_d

Question 5.b

Based on the classification reports from the experiments conducted with different settings we can analyze the results. The performance of each category varies with the vectorizing technique and the document frequency threshold (min_df). The BoW representation, especially at lower min_df values, tends to achieve slightly higher scores in precision and recall. The reason for that is probably can capture more specific terms that might be crucial. The best-performing category is BoW with min_df = 2.

Increasing the min_df value leads to a decreases in scores. It affects scores because it determines which words are included in the models vocabulary. Words that appear in very low numbrt of documents might be too specific, reducing the model's ability to generalize. On the other hand, words that are too common (when min_df is set too high) may lead to the exclusion of terms that could have been useful for classification. 


### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [58]:
vectorizer = CountVectorizer(min_df=2)
airline_counts = vectorizer.fit_transform(airline_tweets_train.data)
        
docs_train, docs_test, y_train, y_test = train_test_split(airline_counts,
    airline_tweets_train.target,
    test_size = 0.20)
        
clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.83      0.92      0.87       368
           1       0.83      0.69      0.76       290
           2       0.81      0.84      0.82       293

    accuracy                           0.82       951
   macro avg       0.82      0.82      0.82       951
weighted avg       0.82      0.82      0.82       951



In [59]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names_out()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

important_features_per_class(vectorizer, clf)

Important words in negative documents
0 1365.0 united
0 738.0 to
0 525.0 the
0 392.0 flight
0 380.0 and
0 379.0 you
0 320.0 for
0 316.0 my
0 303.0 on
0 275.0 is
0 265.0 in
0 216.0 your
0 196.0 it
0 194.0 of
0 185.0 that
0 171.0 me
0 168.0 not
0 159.0 no
0 158.0 have
0 152.0 at
0 149.0 was
0 148.0 with
0 118.0 can
0 113.0 this
0 108.0 we
0 107.0 from
0 104.0 virginamerica
0 104.0 service
0 104.0 be
0 97.0 now
0 96.0 an
0 92.0 cancelled
0 89.0 get
0 86.0 they
0 84.0 bag
0 84.0 are
0 83.0 plane
0 83.0 but
0 81.0 customer
0 79.0 just
0 79.0 delayed
0 77.0 why
0 74.0 time
0 72.0 been
0 71.0 what
0 70.0 co
0 69.0 hours
0 69.0 do
0 68.0 still
0 68.0 so
0 68.0 gate
0 67.0 will
0 65.0 http
0 63.0 hour
0 62.0 out
0 61.0 help
0 60.0 when
0 60.0 up
0 60.0 airline
0 59.0 again
0 57.0 amp
0 56.0 how
0 55.0 or
0 55.0 about
0 54.0 our
0 54.0 if
0 53.0 late
0 52.0 as
0 51.0 would
0 51.0 had
0 50.0 flights
0 49.0 there
0 49.0 has
0 48.0 all
0 48.0 after
0 47.0 waiting
0 47.0 don
0 46.0 worst
0 46.0 one


Answer B

Expected features:
1. For Negative we see words that are associated with the negative experience. Words like "delay", "cancelled", "waiting", "late" as well as general negative words like "worst".

2. Negative features: Mostly we can nottice the name of the airlines which could be the part of general tweets about companies.

3. Positive features: Here we can see expected positive words like "thanks", "great", "love", "best" and etc. which are highlighting satisfactory with airlines.

Unexpected Features:

Unexpected features across the classes include common stopwords like "to", "the", "and", and pronouns "you", "my". While their high frequency is understandable in NLP, their relevance to sentiment analysis is minimal. Moreover, the presence of specific airline names in all categories might be unexpected since these could either be part of positive feedback or complaints, indicating their presence doesn’t necessarily dictate sentiment.

The words that we would remove:
1. Stop words and pronouns, "to", "the", "and", "you", etc. They offer little sentiment-specific insight and removing them could help model to focus on the more meaningful features.

2. Names of the airlines. They could be relevant to see the subject of the sentiment but they don't carry an sentiment value by themselves and could skew the results since some airlines (usually low-cost ones) getting more negative reviews than others.

Words that must be keeped:

 1. Words with clear sentiment, like "delayed", "cancelled", "worst", "great" are directly indicative of sentiment and are valuable for the model.
 2. Service-Related like "customer", "service", "bag", "gate" and "flight" are relevant as they can be qualified by positive or negative descriptors, offering context to sentiment analysis.

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook