# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-3 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data)
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works.
* Read more about the VADER tool in [this blog](https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt)


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced.

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

1. We find that the first sentence is valid. The sentence "I love apples" is a positive sentence and the compound score should reflect the positivity exhibited in the sentence. In this case, since the compound score is 0.6369 (a high positive score), we can conclude that the sentence is positive.

2. The sentence "I don't love apples" is the opposite of the first sentence and thus is a negative sentence. In this case, we expect VEDAR to produce a high negative score for the sentence, which is reflected in the -0.5216 (high negative score) score.

3. "I love apples :-)" is similar to the first sentence. While it is arguable whether in this sentence the extra smiley face really increases the positivity of the sentence, we at least expect the score to be as positive as the first sentence since the same wording is used. We see that the score for this sentence is 0.7579, which is higher than the 1st sentence's observed score of 0.6369.

4. "These houses are in ruins" is a sentence with a negative sentiment. We expect a high negative score from these sentences, which should distinguish them from the neutral score. With a VADER score of -0.4404, we find that the sentence is indeed classified as a negative sentence with a high negative compound score.

5. "These houses are certainly not considered ruins". We would consider this sentence to be a neutral sentence. If we compare it to the first sentence: "I love apples" we see that the score difference between the 2 is only about 0.0502, but we can tell that the first sentence is much more positive than the second one. From the VADAR output, we can see that the sentence receives a more neutral score than a positive one, but because the compound score is a "weighted" score, the words, which are considered positive, have much more "weight" and thus VADER gives this sentence a much more positive score than we would agree with. We believe this should be much closer to neutral (still a positive score, just not as positive) than VADER currently classifies it as.

6. "He lies in the chair in the garden". This sentence is classified as a "negative" when the sentence should have been classified as a "neutral". VADER classified the sentence as "negative" because it did not distinguish between the two different versions of "lies": "lies to a person", meaning that you tell a false statement to another person that you might or might not know is not true, whilst "lies on a bed", meaning to sit/rest on top of a bed. In this case, VEDAR only considered the "lies" word to refer to the act of providing false information, thus the sentence was classified as a negative sentence, when, in fact, it should have been classified as a neutral sentence.

7. "This house is like any other house." The following sentence was classified as a positive sentence with a score of 0.3612 when it should have been classified as a neutral sentence. As seen in sentence 5, VADER gave a much higher neutral score to the sentence than a positive one, but because the compound score is "weighted", the compound score produced a much more positive result than expected. This is due to the relatively high "weight" of the word "like" which in this context means "similar to", while VEDAR considered the word meaning to be "enjoy/prefer". Thus the score became much more positive.

From these examples, we can see that while judging if a word has a positive meaning or a negative one, VADER does not consider the surrounding context around a word.

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream. If you have trouble accessing Twitter, try to find an existing dataset (on websites like kaggle or huggingface).

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [None]:
import json

In [None]:
my_tweets = json.load(open('my_tweets.json'))

In [None]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'negative', 'text_of_tweet': "They don't want you to know what SIDS REALLY is.", 'tweet_url': 'https://x.com/DiedSuddenly_/status/1917356956936986696'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point.
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [None]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
from nltk.sentiment import vader

from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader_model = SentimentIntensityAnalyzer()

import spacy
nlp = spacy.load('en_core_web_sm')

def run_vader(textual_unit,
              lemmatize=False,
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy

    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output

    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)

    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-':
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add)
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))

    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [None]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'

    :param dict vader_output: output dict from vader

    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']

    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'

assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [None]:
tweets = []
all_vader_output = []
gold = []

data = {'negative': [], 'positive': [], 'neutral': []}

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet) # run vader
    vader_label = vader_output_to_label(vader_output) # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    if tweet_info['sentiment_label'] != vader_label:
      print(f"{tweets[-1]}:\n Predicted: {all_vader_output[-1]}\n Actual: {gold[-1]}\n")
      if tweet_info['sentiment_label'] == 'negative':
        data['negative'].append([vader_label, tweets[-1], vader_output])
      elif tweet_info['sentiment_label'] == 'positive':
        data['positive'].append([vader_label, tweets[-1], vader_output])
      else:
        data['neutral'].append([vader_label, tweets[-1], vader_output])

# use scikit-learn's classification report
from sklearn.metrics import classification_report
print(classification_report(gold, all_vader_output))

Canada has fallen to a slimy banker.  Goodbye free speech and guns.  It’s a sad day for the good Canadians.:
 Predicted: positive
 Actual: negative

Does mint chocolate chip ice cream taste good?:
 Predicted: positive
 Actual: neutral

A Tree Frog demonstrates how it 'adheres' to surfaces like leaves (or glass):
 Predicted: positive
 Actual: neutral

Mmm... Boston Cream Croissants! 🥐 #recipe https://recipesbyclare.com/recipes/boston-cream-pie-croissants:
 Predicted: neutral
 Actual: positive

🚨| Kimi Antonelli reportedly checked 5+7 on his calculator “just to be sure.”:
 Predicted: positive
 Actual: neutral

What’s the most stunning restaurant or café you’ve ever been to?:
 Predicted: positive
 Actual: neutral

Welcome to Country should not be performed at ANZAC Day services. It is disrespectful to our veterans and must stop. We are there to pay our respects to those who served our country and remember their sacrifices.:
 Predicted: positive
 Actual: negative

Harold forgot his medal a

In [None]:
print("Negative:")
for tweet_data in data['negative'][:(10 if len(data['negative']) > 10 else len(data))]: # <-- The fact that you can do this is both amazing and gives me a headache at the thought of trying to read this is in an actual codebase...
  print(f"Tweet: {tweet_data[1]}")
  print(f"Precidted: {tweet_data[0]}")
  print(f"Scores: {tweet_data[2]}")

print("\nPositive:")
for tweet_data in data['positive'][:(10 if len(data['positive']) > 10 else len(data))]:
  print(f"Tweet: {tweet_data[1]}")
  print(f"Precidted: {tweet_data[0]}")
  print(f"Score: {tweet_data[2]}")

print("\nNeutral:")
for tweet_data in data['neutral'][:(10 if len(data['neutral']) > 10 else len(data))]:
  print(f"Tweet: {tweet_data[1]}")
  print(f"Precidted: {tweet_data[0]}")
  print(f"Score: {tweet_data[2]}")

Negative:
Tweet: Canada has fallen to a slimy banker.  Goodbye free speech and guns.  It’s a sad day for the good Canadians.
Precidted: positive
Scores: {'neg': 0.209, 'neu': 0.56, 'pos': 0.231, 'compound': 0.1531}
Tweet: Welcome to Country should not be performed at ANZAC Day services. It is disrespectful to our veterans and must stop. We are there to pay our respects to those who served our country and remember their sacrifices.
Precidted: positive
Scores: {'neg': 0.086, 'neu': 0.788, 'pos': 0.126, 'compound': 0.4019}
Tweet: These cops are pigs who have names.
Precidted: neutral
Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Positive:
Tweet: Mmm... Boston Cream Croissants! 🥐 #recipe https://recipesbyclare.com/recipes/boston-cream-pie-croissants
Precidted: neutral
Score: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Tweet: This farmer rescued an abandoned family of pigs
Precidted: negative
Score: {'neg': 0.254, 'neu': 0.508, 'pos': 0.237, 'compound': -0.0516}



### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [None]:
# Your code here


## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count')
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10)
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ:
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
import pathlib
import sklearn
import nltk
from nltk.corpus import stopwords
from sklearn.datasets import load_files

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

In [None]:
import zipfile
with zipfile.ZipFile('/content/airlinetweets.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/temp')

In [None]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('temp/airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:',
      airline_tweets_folder.exists())

path: /content/temp/airlinetweets
this will print True if the folder exists: True


In [None]:
path = str(airline_tweets_folder)
airline_tweets_train = load_files(path)

In [None]:
def train_classifier(airline_data, min_df: int, vectorization: str, return_vec = False):

  docs_train, docs_test, y_train, y_test = train_test_split(airline_data.data, airline_data.target, test_size = 0.20)

  representation = CountVectorizer(min_df=min_df, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
  docs_train = representation.fit_transform(docs_train)
  docs_test = representation.transform(docs_test)

  if vectorization == 'tfidf':
    representation = TfidfTransformer()
    docs_train = representation.fit_transform(docs_train)
    docs_test = representation.transform(docs_test)

  model = MultinomialNB().fit(docs_train, y_train)
  if return_vec:
    print('Returning Vectorizer')
    return model, representation
  return model, docs_test, y_test

In [None]:
def run_experiments(results):
  for min_df in results:
    for vectorization in results[min_df]:
      print(f"--- Results for min_df = {min_df}; Vectorization = {vectorization} ---")
      model, docs_test, y_test = results[min_df][vectorization]
      model_pred = model.predict(docs_test)
      print(classification_report(y_test, model_pred))
    print('\n')

test_parameters = {
    'vectorization': ['bag', 'tfidf'],
    'min_df': [2, 5, 10],
}

results = {}

for min_df in test_parameters['min_df']:
  results.update({min_df: {}})
  for vectorization in test_parameters['vectorization']:
    results[min_df].update({vectorization: train_classifier(airline_tweets_train, min_df, vectorization)})

run_experiments(results)



--- Results for min_df = 2; Vectorization = bag ---
              precision    recall  f1-score   support

           0       0.85      0.92      0.88       356
           1       0.88      0.71      0.79       295
           2       0.82      0.88      0.85       300

    accuracy                           0.84       951
   macro avg       0.85      0.84      0.84       951
weighted avg       0.85      0.84      0.84       951

--- Results for min_df = 2; Vectorization = tfidf ---
              precision    recall  f1-score   support

           0       0.81      0.93      0.87       350
           1       0.83      0.67      0.74       299
           2       0.83      0.85      0.84       302

    accuracy                           0.82       951
   macro avg       0.82      0.82      0.82       951
weighted avg       0.82      0.82      0.82       951



--- Results for min_df = 5; Vectorization = bag ---
              precision    recall  f1-score   support

           0       0.80

- By analyzing the results, we can find that the best combination of vectorization and min_df was a bag vectorization with a min_df value of 5. across the board, we find that bag vectorization outperformed the tfidf vectorization.

- Looking at the results above, we can see that the frequency threshold does impact the accuracy of the model. The difference between a value of 2 and a value of 5 for this model was negligible. It is only when we increase the value too much, like with a value of 10, does the performance of our model noticeably decreases.

### [4 points] Question 6: Inspecting the best scoring features

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues:
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ?
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why?

In [None]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names_out()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat)
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat)

test_parameters = {
    'vectorization': ['bag'],
    'min_df': [2],
}

for min_df in test_parameters['min_df']:
  for vectorization in test_parameters['vectorization']:
    results[min_df].update({vectorization: train_classifier(airline_tweets_train, min_df, vectorization, return_vec = True)})

model, vectorizer = results[2]['bag']
important_features_per_class(vectorizer=vectorizer, classifier=model)



Returning Vectorizer
Important words in negative documents
0 1499.0 @
0 1377.0 united
0 1197.0 .
0 423.0 ``
0 386.0 flight
0 373.0 ?
0 353.0 !
0 303.0 #
0 218.0 n't
0 142.0 ''
0 120.0 's
0 110.0 service
0 105.0 virginamerica
0 100.0 :
0 95.0 get
0 91.0 delayed
0 89.0 cancelled
0 87.0 customer
0 84.0 bag
0 82.0 plane
0 79.0 time
0 78.0 ...
0 74.0 'm
0 73.0 -
0 71.0 hours
0 70.0 ;
0 67.0 http
0 64.0 airline
0 64.0 &
0 62.0 hour
0 62.0 gate
0 60.0 late
0 59.0 still
0 59.0 help
0 56.0 would
0 55.0 ca
0 54.0 flights
0 54.0 2
0 53.0 amp
0 52.0 worst
0 52.0 one
0 49.0 delay
0 48.0 've
0 46.0 back
0 45.0 waiting
0 45.0 $
0 43.0 never
0 43.0 like
0 42.0 us
0 42.0 flightled
0 42.0 (
0 41.0 lost
0 40.0 ever
0 39.0 day
0 39.0 3
0 39.0 )
0 38.0 check
0 38.0 bags
0 36.0 seat
0 36.0 really
0 36.0 fly
0 35.0 wait
0 35.0 thanks
0 35.0 people
0 35.0 luggage
0 33.0 u
0 33.0 even
0 33.0 due
0 32.0 last
0 32.0 hold
0 32.0 crew
0 32.0 4
0 31.0 ticket
0 30.0 could
0 30.0 airport
0 29.0 seats
0 29.0 guys
0 29

- For the negative class, we expected words like flight, time, delay, service, cancelled, late, worst, never, luggage, and more to be present as the top-scoring words. This is due to the fact that when people complain about a problem with an airline, they usually complain about specific topics like baggage being lost or plane being late, thus it is no surprise that these words appear in our rankings.
- For the positive class, we expected words like flight, great, thanks, love, best, awesome, thank, and more to appear. These words express general happiness about the flight, and unless you are ready to review, most people will express their positive experience about the flights using general language.
- Neutral: help, please, need, and know would be used a lot since they are generally used in a variety of contexts and questions, which usually tend to stay neutral.

- Negative: we did not expect to see words like "like" and "thanks" since they are usually associated with positives rather than negatives.

- Positive: apart from some special symbols, most of the words here do fit our perception of what words might express positive sentiment.

- Neutral: words like "cancelled" should probably be classified as a negative class, since that is one of the major problems that people complain about when flying - cancelled flights.

When trying to improve the model, we would first remove the names of the airlines since if we want to understand which airlines are the best using this model, we would need to take into account all of the tweets and their context and then evaluate the problems and the good sides of the airline (but if that is the goal, then the dataset should probably be something like review sites and not tweets since people usually don't post about positive experiences in flights unless it's something that happens rarely, like being upgraded to better seats). We would also remove punctuation since we are training the Bayesian model which is not the best at using context to judge the sentiment. If we were training a transformer model instead, we would need punctuation since it is really important in deciding the meaning of words, which the transformer takes into account when producing results.

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook