# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

1. Vader correctly identifies the sentence as being highly positive, due to the positive word 'love' being used, and thus gives it a high positive score and a positive compound score.

2. Vader sees the negation of 'love' accurately, and thus correctly gives it a negative compound score due to 'don't love' being used in the sentence.

3. Vader seems to be able to identify the smiley in the sentence, and also sees the word 'love' being used. Thus the sentence correctly gets a positive compound score, which is even more positive than the first sentence due to the smiley.

4. Vader sees the word 'ruins' and correctly identifies this as a negative word, with a high negative score. The negative compound score also reflects the negative sentiment of the sentence accurately.

5. Vader accurately assesses the sentiment of the sentence, because the negative word 'ruins' is negated within this sentence with 'certainly not', leading to a positive compound score. The score is also not too high because this sentence is only negating a negative assessment of houses.

6. Vader is slightly off with this sentence, probably due to the word 'lies' being used in the sentence, which has multiple meanings, but the negative meaning is not used within this sentence. Vader correctly sees a lot of neutrality in this sentence, but also sees some negativity in the sentence, resulting in the compound score being off. The compound score reflects a pretty negative sentence, whereas this sentence is more of a neutral/lightly positive sentence.

7. Vader is also slightly off in the last sentence. The sentence can be both a neutral or negative assessment of a house, because the house doesn't feel distinct from other houses. Vader correctly detects some neutrality in the sentence, due to 'any house' being a mostly neutral description. However, vader also saw the word 'likes', which probably increased the positivity score of the sentence, leading to a somewhat positive compound score. The positive and negative score should be interchanged, and the compound score should be closer to 0 or slightly negative to accurately reflect the slightly negative tone of this sentence.

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [2]:
import json

In [3]:
my_tweets = json.load(open('my_tweets.json'))

In [4]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'negative', 'text_of_tweet': 'Iâ€™m actually like so angry about Iceland I canâ€™t stop thinking about it. ', 'tweet_url': 'https://twitter.com/MaxxyRainbow/status/1764168102571360681'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

a.  Looking at the quantitative analysis, we can see that in all three metrics (precision, recall and F1-score) neutral is more often incorrectly classified as compared to the other two categories. This can be explained by the fact that even though a text may be neutral, it can still contain words that have a negative or positive connotation when used in a subjective text. Think with this for example of a factual statement about hurricanes: the text "a hurricane is a type of natural disaster" is undoubtedly neutral. However, the word "disaster" has a negative pretense, and thus the sentence might be classified as negative. Assuming we are trying to assess the negativity/positivity of the platform, we would consider F1-score to be the most important metric in that regard. This metric most accurately encompasses the goal of trying to get an analysis as close to reality as possible. If for example, we were trying to make automatic moderation automatically banning tweets that are too negative, precision would be more important, since you dont want to ban any innocent people.

b. 
neutral: words like disaster, harsh, prison and strange are interpreted as negative words, while in this context they are not used this way. The same goes in the positive direction for words like safety and security can be seen as positive, while in this specific sentence they are used very factually. This problem was described earlier in 3a.

negative: For some sentences, even though the sentence is ultimately negative, there is a lot of neutral/lightly positive words mixed in there as well (maybe in a contradiction, or someone hopes for better, or simply in their wording), making the final label different from the actual one. VADER of course cannot pick up on the subtle nuances and contextual clues (something like sarcasm or requiring knowledge about the world) very well, making sentences like sentence #6 hard to interpret. another possibility is VADER misinterpreting the sentence and using the wrong version of a word (like has multiple meanings) for its sentiment analysis. This leads it to incorrectly classifying sentences. 
Then there are also the sentences that dont contain any positively or negatively connotated words, thus leading to a false label. This is thus simply a lack of completeness in the VADER lexicon.
For the sentence about burger king we dont really know why it is classified as positive, since we can only find either a neutral interpretation or negatively annotated words. Still the sentence is classified as lightly positive.

positive: In some sentences, some very positive words are not part of the VADER lexicon (incredible being one). Sometimes the lemmatization changes positive words into negative ones (cutest -> cut). In other sentences there is not really one clear label (the madonna tweet was meant ironic; They describe a negative thing with a positive intention). And finally, just like for the negative tweets, sometimes a sentence contains words that on their own are interpreted as the other direction -> mad is very negative, thus the whole sentence is interpreted as negative

In [5]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import spacy

vader_model = SentimentIntensityAnalyzer()
nlp = spacy.load("en_core_web_sm") # 'en_core_web_sm'

In [6]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [7]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
    input_to_vader = []

    for sentence in doc.sents:
        for token in sentence:
            to_add = token.text

            if lemmatize:
                to_add = token.lemma_ if token.lemma_ != '-PRON-' else token.text

            if not parts_of_speech_to_consider:
                input_to_vader.append(to_add)
                continue
            if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add)

    return vader_model.polarity_scores(' '.join(input_to_vader))

In [8]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet, to_lemmatize, pos)
    vader_label = vader_output_to_label(vader_output)
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
    if vader_label != tweet_info['sentiment_label']:
        print(f"LABEL: {tweet_info['sentiment_label']} V-LABEL: {vader_label} TWEET: {the_tweet}")
    
# use scikit-learn's classification report
from sklearn.metrics import classification_report
print(classification_report(y_true=gold, y_pred=all_vader_output))

LABEL: negative V-LABEL: positive TWEET: ffs i did not want to wake up to this israel news maybe ive been too naive but i really did trust the EBU would actually do something
LABEL: neutral V-LABEL: negative TWEET: Designed to withstand the harsh environment of Mars, these squishy robots could transform how first responders determine their approach to a disaster scene here on Earth. Learn how  @NASASpinoff tech could help disaster response: https://go.nasa.gov/3IpybN2
LABEL: positive V-LABEL: neutral TWEET: Incredible pan across Marsâ€™ surface from NASAâ€™s Perseverance rover ðŸ˜²
LABEL: negative V-LABEL: positive TWEET: This is what Mali, Africa looks like today. This is what the future of Western countries looks like when you keep importing the third world. You become the third world.
LABEL: negative V-LABEL: neutral TWEET: UK just threw a guy into jail for 2 years for stickers Stickers
LABEL: negative V-LABEL: positive TWEET: Something how Dr. Fauci is revered by the LameStream Med

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

b. 
* Lemmatisation does not seem to have a significant effect on the outcome of the algorithms results. This is probably due to the fact that most lemmatized words dont change sentiment from their original form. In the end, the judgement stays relatively the same (loving -> love doesnt suddenly become negative)
* As far as we can tell, the order from most important to least goes as follows: Adjectives -> Verbs -> Nouns. This is most likely due to the fact that adjectives are used to describe something, and depending on the choice of specific adjective you can choose what sentiment to convey (bad apple vs good apple). The noun is simply the subject of the sentence, and whether you have a positive or a bad opinion doesnt change the fact that you use them. As for verbs, there is a little bit of variation you can make depending on sentiment, but oftentimes the verb just describes a certain action, which would be described anyways as well. Some verbs can however have a sentiment of themselves (loving, hating, killing, etc.)

In [9]:
import pathlib
from sklearn.datasets import load_files

cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:', 
    airline_tweets_folder.exists())

# loading all files as training data.
str(airline_tweets_folder)
airline_tweets_train = load_files(str(airline_tweets_folder))

path: d:\school\TUDelft\Python\Artificial Intelligence\Text Mining\ba-text-mining-group19\lab_sessions\lab3\airlinetweets
this will print True if the folder exists: True


In [24]:
print(len(airline_tweets_train.data))

4755


In [50]:
#read the zip file "airlinetweets.zip" and extract the file "airline_tweets.json

def run_vader_on_airline_tweets(to_lemmatize: bool, pos: set):
    gold = []
    all_vader_output = []   
    for tweet in airline_tweets_train.data:
        tweet = tweet.decode('utf-8')
        vader_output = run_vader(tweet, to_lemmatize, pos)
        vader_label = vader_output_to_label(vader_output)

        all_vader_output.append(vader_label)

    for target in airline_tweets_train.target:
        gold.append(airline_tweets_train.target_names[target])

    #print the classification report
    print(classification_report(y_true=gold, y_pred=all_vader_output))



In [51]:
#run vader on airline tweets
print('As is')
run_vader_on_airline_tweets(to_lemmatize=False, pos=None)
print("______________________________________________________________________________________")


As is
              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.64      0.62      4755
weighted avg       0.66      0.63      0.62      4755

______________________________________________________________________________________


In [52]:
# 2. lemmatized
print('lemmatized')
run_vader_on_airline_tweets(to_lemmatize=True, pos=None)
print("______________________________________________________________________________________")


lemmatized
              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755

______________________________________________________________________________________


In [55]:
# 3. adjectives
print("Only adjectives")
run_vader_on_airline_tweets(to_lemmatize=False, pos={'ADJ'})
print("______________________________________________________________________________________")


Only adjectives
              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.40      0.89      0.56      1515
    positive       0.66      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.51      0.47      4755
weighted avg       0.66      0.50      0.47      4755

______________________________________________________________________________________


In [56]:
# 4. adjectives & lemmatized
print("Only adjectives & lemmatized")
run_vader_on_airline_tweets(to_lemmatize=True, pos={'ADJ'})
print("______________________________________________________________________________________")


Only adjectives & lemmatized
              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.40      0.89      0.56      1515
    positive       0.66      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.51      0.47      4755
weighted avg       0.66      0.50      0.47      4755

______________________________________________________________________________________


In [57]:
# 5. nouns
print("Only nouns")
run_vader_on_airline_tweets(to_lemmatize=False, pos={'NOUN'})
print("______________________________________________________________________________________")


Only nouns
              precision    recall  f1-score   support

    negative       0.73      0.14      0.24      1750
     neutral       0.36      0.82      0.50      1515
    positive       0.53      0.34      0.41      1490

    accuracy                           0.42      4755
   macro avg       0.54      0.43      0.38      4755
weighted avg       0.55      0.42      0.38      4755

______________________________________________________________________________________


In [58]:
# 6. nouns & lemmatized
print("Only nouns & lemmatized")
run_vader_on_airline_tweets(to_lemmatize=True, pos={'NOUN'})
print("______________________________________________________________________________________")


Only nouns & lemmatized
              precision    recall  f1-score   support

    negative       0.72      0.16      0.26      1750
     neutral       0.36      0.81      0.50      1515
    positive       0.52      0.33      0.40      1490

    accuracy                           0.42      4755
   macro avg       0.53      0.43      0.39      4755
weighted avg       0.54      0.42      0.38      4755

______________________________________________________________________________________


In [59]:
# 7. verbs
print("Only verbs")
run_vader_on_airline_tweets(to_lemmatize=False, pos={'VERB'})
print("______________________________________________________________________________________")


Only verbs
              precision    recall  f1-score   support

    negative       0.77      0.29      0.42      1750
     neutral       0.38      0.81      0.52      1515
    positive       0.57      0.34      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.58      0.48      0.46      4755
weighted avg       0.59      0.47      0.45      4755

______________________________________________________________________________________


In [60]:
# 8. verbs & lemmatized
print("Only verbs & lemmatized")
run_vader_on_airline_tweets(to_lemmatize=True, pos={'VERB'})
print("______________________________________________________________________________________")

Only verbs & lemmatized
              precision    recall  f1-score   support

    negative       0.74      0.30      0.42      1750
     neutral       0.38      0.78      0.51      1515
    positive       0.57      0.35      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.56      0.48      0.46      4755
weighted avg       0.57      0.47      0.45      4755

______________________________________________________________________________________


## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

b.
* Just like in question 4, the difference between td-idf and bag of words representation doesnt seem to differ significantly enough to infer either being superior to the other. This goes for any of the min_df setting, although the small differences that do occur do differ between the settings.
* between a min_df of 2 and 5 there is very little noticeable difference. however, between 5 and 10 there is a larger difference. This is likely because the threshold of 10 is much harsher than that of 5, and will thus exclude much more words from consideration. This seems to, in this case, affect negative words the most in terms of precision and neutral the most in terms of recall.

In [72]:
# Your code here
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import nltk
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

In [73]:
airline_vec = CountVectorizer(min_df=2, # If a token appears fewer times than this, across all documents, it will be ignored
                            tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                            stop_words=stopwords.words('english')) # stopwords are removed

airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

docs_train, docs_test, y_train, y_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

#generate a classification report
print(classification_report(y_test, y_pred, target_names=airline_tweets_train.target_names))



              precision    recall  f1-score   support

    negative       0.82      0.92      0.86       337
     neutral       0.85      0.70      0.77       323
    positive       0.83      0.87      0.85       291

    accuracy                           0.83       951
   macro avg       0.83      0.83      0.83       951
weighted avg       0.83      0.83      0.83       951



In [74]:
airline_vec = CountVectorizer(min_df=2, # If a token appears fewer times than this, across all documents, it will be ignored
                            tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                            stop_words=stopwords.words('english')) # stopwords are removed

airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

#generate a classification report
print(classification_report(y_test, y_pred, target_names=airline_tweets_train.target_names))



              precision    recall  f1-score   support

    negative       0.85      0.90      0.87       353
     neutral       0.84      0.74      0.79       308
    positive       0.83      0.88      0.85       290

    accuracy                           0.84       951
   macro avg       0.84      0.84      0.84       951
weighted avg       0.84      0.84      0.84       951



In [75]:
airline_vec = CountVectorizer(min_df=5, # If a token appears fewer times than this, across all documents, it will be ignored
                            tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                            stop_words=stopwords.words('english')) # stopwords are removed

airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

docs_train, docs_test, y_train, y_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

#generate a classification report
print(classification_report(y_test, y_pred, target_names=airline_tweets_train.target_names))



              precision    recall  f1-score   support

    negative       0.82      0.90      0.86       346
     neutral       0.87      0.73      0.80       313
    positive       0.85      0.89      0.87       292

    accuracy                           0.84       951
   macro avg       0.85      0.84      0.84       951
weighted avg       0.84      0.84      0.84       951



In [76]:
airline_vec = CountVectorizer(min_df=5, # If a token appears fewer times than this, across all documents, it will be ignored
                            tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                            stop_words=stopwords.words('english')) # stopwords are removed

airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

#generate a classification report
print(classification_report(y_test, y_pred, target_names=airline_tweets_train.target_names))



              precision    recall  f1-score   support

    negative       0.84      0.90      0.87       365
     neutral       0.79      0.74      0.76       288
    positive       0.83      0.81      0.82       298

    accuracy                           0.83       951
   macro avg       0.82      0.82      0.82       951
weighted avg       0.82      0.83      0.82       951



In [77]:
airline_vec = CountVectorizer(min_df=10, # If a token appears fewer times than this, across all documents, it will be ignored
                            tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                            stop_words=stopwords.words('english')) # stopwords are removed

airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

docs_train, docs_test, y_train, y_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

#generate a classification report
print(classification_report(y_test, y_pred, target_names=airline_tweets_train.target_names))



              precision    recall  f1-score   support

    negative       0.79      0.90      0.84       326
     neutral       0.81      0.70      0.75       313
    positive       0.83      0.82      0.83       312

    accuracy                           0.81       951
   macro avg       0.81      0.81      0.81       951
weighted avg       0.81      0.81      0.81       951



In [78]:
airline_vec = CountVectorizer(min_df=10, # If a token appears fewer times than this, across all documents, it will be ignored
                            tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                            stop_words=stopwords.words('english')) # stopwords are removed

airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

#generate a classification report
print(classification_report(y_test, y_pred, target_names=airline_tweets_train.target_names))



              precision    recall  f1-score   support

    negative       0.83      0.93      0.88       372
     neutral       0.82      0.71      0.76       295
    positive       0.84      0.82      0.83       284

    accuracy                           0.83       951
   macro avg       0.83      0.82      0.82       951
weighted avg       0.83      0.83      0.83       951



### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [79]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
#important_features_per_class(airline_vec, clf)

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook