# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

"""The sentences with less reasonable scores are the second, third, fourth,fith and sixth. This is because:

- In the second sentence the neutral feeling should be higher than the negative one, since the person don't hate apples but nor love them. 
- The third sentence should have the neutral feeling in 0.0, since it's explicit is a positive comment.
- The fourth sentence can be interpreted in two ways, namely: the houses are old and dirty which indeed has a negative tone/score. However, it could also be interpreted as literal ruins. Then the score should be more neutral.
- The fifth sentence should be neutral mostly, since it's not considered nor positive or negative comment.
- The sixth comment should be neutral, and have a negative feeling of 0.0
- The seventh sentence should have a lower positive score and rather little more negative. The tone of this sentence implies that the house that is being observed is not something special.

Last sentences have these errors because not all parts of speech are provided. It's also important to put negations as the following "These houses are certainly {not} considered ruins " to the list of negation tokens, so it should appropriately flip the sentiment polarity 
while maintaining the intensity calculations. Most sentences lack of them.

Additionally, it is needed to convert titles (headlines) into strings before applying polarity_scores function.""""

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [17]:
import json

In [18]:
my_tweets = json.load(open('my_tweets.json', encoding='utf8'))

In [19]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)

1 {'sentiment_label': 'negative', 'text_of_tweet': 'The cybersecurity industry is so very embarrassing.', 'tweet_url': 'https://mobile.twitter.com/GossiTheDog/status/1497899214637912066?cxt=HHwWhMCi3cjZzckpAAAA '}
2 {'sentiment_label': 'positive', 'text_of_tweet': "This is Donut. He wanted to pop out and say hi. Wondering where your personal chauffeur is. Said you can borrow his, if you'd like. 13/10", 'tweet_url': 'https://mobile.twitter.com/dog_rates/status/1496265582550863879?cxt=HHwWjsCymdXn5sMpAAAA '}
3 {'sentiment_label': 'negative', 'text_of_tweet': 'Ukraine’s ambassador to the US just told us that a Russian platoon from the 74th Motorized Brigade has surrendered to Ukraine’s forces. She says that the Russian troops apparently had been unaware they were being sent to kill Ukrainians. No confirmation yet from Russia’s military', 'tweet_url': 'https://mobile.twitter.com/JoshNBCNews/status/1496884570716835840?cxt=HHwWgMC4tcKlgMYpAAAA '}
4 {'sentiment_label': 'negative', 'text_of_tw

### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

[2.5 points] a. Most relevant scores are the ones from tweets 7, 10, 14, 16, 18, 20, 21, 22, 23, 24, 26, 30, 33, 36, 40, 45, 48. This scores are the most relevant since they give different outcomes than the ones predicted in the my_tweets.json. Most of them are classified as possitive or neutral by the vader output label, when the tone of the tweets it's negative.

[2.5 points] b.

Positive tweets: 10,23,26,45,48

tweet 10: "ata from the #Sentinel5P satellite are being used to detect emission hotspots in various regions, identifying #methane concentrations and locating exact facilities leaking methane". (token as neutral)

tweet 23: "Excellent discussion covering the intersections between COVID-19 and geopolitical crisis with Professor Isa Blumi and Jesse Zurawell. Essential listening. NOTE: It's the 4th interview down on the list, dated 1 March 2022"  (token as negative)

tweet 26: "An opinion piece by Australian writer Alexandra Marshall on how the media has connived to keep their readers in the dark and in doing so, build into The Narrative."  (token as neutral)

tweet 45: "I'm beginning to think the cat is in charge"  (token as neutral)

tweet 48: "Achievement unlocked" (token as neutral)

The following positive sentences have these errors because not all parts of speech are provided, also positive or affirmative keyboards are missing. Positive tags should be added to the list of affirmation tokens, so it should appropriately flip the sentiment polarity while maintaining the intensity calculations. Also it is important to avoid special characters for a better predicitions. 



Negative tweets: 7,20,21,22,24,30,40

tweet 7: "This man drove to Poland from DENMARK to help Ukrainian women and children. He said he is going back with six people, whom he will help settle in a new country, over 2000 km. away from home. “Putin is a dictator, and we don’t like that in Denmark,” he said." (token as positive)

tweet 20:"Ran for a train today for the first time ever. My mcm better tweet “if they wanted, to they would”"(token as positive)

tweet 21: "My mistery: hyperglycemia, not yet diabete, low insulin level, low pep-c level, no D1 antibody, no mody, pancreas ok…no doctor has an explanation…any idea? Thanks" (token as positive)

tweet 22: "Events of the last two years highlight the extreme danger presented by centralization. The intellectual backing for the idea that we should have more of it rests on the notion that having more data enables better centralized decision-making. /1" (token as positive)

tweet 24: "Covid modelling cannot accurately predict numbers, admits Prof Graham Medley, of the Scientific Pandemic Influenza Group on Modelling (Spi-M). I think what he means is their models are rubbish and pointless. Is it time to disband SAGE? " (token as positive)

tweet 40: "Tough question, prob would have to call a friend. Flat planet friends" (token as positive)

The following negative sentences have these errors because not all parts of speech are provided, also negation keyboards are missing. Negative tags such as {not} ,{bad}, etc should be added to the list of negation tokens, so it should appropriately flip the sentiment polarity while maintaining the intensity calculations. Also it is important to avoid special characters for a better predicitions. 



Neutral tweets: 14,16,18,20,22,33,36

tweet 14: "Deciding between security cameras or video doorbells can be tough. We cover the pros and cons of installation, video coverage, and price." (token as positive)


tweet 16: "The James Carey Urban Communication Grant supports research that advances knowledge on the consequences of urban communication in and across urban societies. The grant is awarded by the Urban Communication Foundation (UCF) and co-sponsored"(token as positive)


tweet 18: "“What’s the trend of the market? If it’s negative, you’ll want to do very little, if any buying, even if you see some stocks breaking out. Your probabilities of success are quite low when the market trend is going against you.”Stan Weinstein"(token as negative)


tweet 20: "Ran for a train today for the first time ever. My mcm better tweet “if they wanted, to they would”"(token as positive)


tweet 22: "Events of the last two years highlight the extreme danger presented by centralization. The intellectual backing for the idea that we should have more of it rests on the notion that having more data enables better centralized decision-making. /1"(token as positive)


tweet 33: "The top for most US stocks was 1 year ago, why are you writing this stuff now (don't answer lol) Good luck"(token as positive)


tweet 36: "Entered TQQQ this morning with a wide stop. Would like to add to my position once we get confirmation."(token as positive)



The following neutral sentences have these errors because not all parts of speech are provided, also neutral keyboards are missing. Neutral tags should be added to the list of neutral tokens, so it should appropriately flip the sentiment polarity while maintaining the intensity calculations. Also it is important to avoid special characters for a better predicitions.

In [20]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [21]:
import nltk
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pathlib
import sklearn
import numpy
import spacy
from nltk.corpus import stopwords
from collections import Counter
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report

tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = SentimentIntensityAnalyzer() # run vader
    vader_label = vader_output.polarity_scores(the_tweet)# convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
    print(id_,'VADER OUTPUT', vader_label, vader_output_to_label(vader_label))
    print('CORRECT SENTIMENT', tweet_info['sentiment_label'])

# use scikit-learn's classification report

1 VADER OUTPUT {'neg': 0.346, 'neu': 0.654, 'pos': 0.0, 'compound': -0.489} negative
CORRECT SENTIMENT negative
2 VADER OUTPUT {'neg': 0.0, 'neu': 0.909, 'pos': 0.091, 'compound': 0.3612} positive
CORRECT SENTIMENT positive
3 VADER OUTPUT {'neg': 0.175, 'neu': 0.825, 'pos': 0.0, 'compound': -0.8271} negative
CORRECT SENTIMENT negative
4 VADER OUTPUT {'neg': 0.062, 'neu': 0.905, 'pos': 0.032, 'compound': -0.34} negative
CORRECT SENTIMENT negative
5 VADER OUTPUT {'neg': 0.313, 'neu': 0.612, 'pos': 0.075, 'compound': -0.5423} negative
CORRECT SENTIMENT negative
6 VADER OUTPUT {'neg': 0.091, 'neu': 0.909, 'pos': 0.0, 'compound': -0.296} negative
CORRECT SENTIMENT negative
7 VADER OUTPUT {'neg': 0.0, 'neu': 0.851, 'pos': 0.149, 'compound': 0.7845} positive
CORRECT SENTIMENT negative
8 VADER OUTPUT {'neg': 0.118, 'neu': 0.796, 'pos': 0.086, 'compound': -0.2648} negative
CORRECT SENTIMENT neutral
9 VADER OUTPUT {'neg': 0.0, 'neu': 0.815, 'pos': 0.185, 'compound': 0.7996} positive
CORRECT SENT

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectDecisionTreeClassifiere set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
Lemmatisation checks for the dictionary meaning when changing into base-form, which is usefull when a word has a meaning that is important to decide if a sentence is positive, negative or neutral, therefore it increases the accuracy of the sentiment score.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.
We would say that are all part os speech are equally important for sentiment analysis because based on the parts of speech,whether the noun or verb of a word for example is meant, the sentiment can differ.

In [24]:
import spacy
nlp = spacy.load('en_core_web_sm')

cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
airline_tweets = load_files('C:/Users/zakar/Downloads/ba-text-mining-master/ba-text-mining-master/lab_sessions/lab3/airlinetweets/airlinetweets')
all_tweets = airline_tweets.data
vader_model = SentimentIntensityAnalyzer()
gold_labels = airline_tweets.target

def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [25]:
#VADER (as it is) on the set of airline tweets
three_class_system_output = []
for i in range(0,len(all_tweets)):
    vader_scores = run_vader(str(all_tweets[i]), 
                    lemmatize = False,
                    verbose = 0)
    three_class_system_output.append(vader_output_to_label(vader_scores))

report = classification_report(gold_labels,three_class_system_output,digits = 3)
print(report)

ValueError: Mix of label input types (string and number)

In [9]:
#VADER on the set of airline tweets after having lemmatized the text
for i in range(0,len(all_tweets)):
    print(run_vader(str(all_tweets[i]),
                    lemmatize = True,
                    verbose = 1))
    three_class_human_annotation = tweet_info['sentiment_label']
    three_class_system_output = vader_output_to_label(vader_label)
    report = classification_report(three_class_human_annotation,three_class_system_output,digits = 3)
    print(report)

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'bytes'>.

In [7]:
#VADER on the set of airline tweets with only adjectives
for i in range(0,len(all_tweets)):
    print(run_vader(str(all_tweets[i]), 
          lemmatize=False, 
          parts_of_speech_to_consider={'ADJ'},
          verbose=1))
    three_class_human_annotation = tweet_info['sentiment_label']
    three_class_system_output = vader_output_to_label(vader_label)
    report = classification_report(three_class_human_annotation,three_class_system_output,digits = 3)
    print(report)

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'bytes'>.

In [8]:
#VADER on the set of airline tweets with only adjectives and after having lemmatized the text
for i in range(0,len(all_tweets)):
    print(run_vader(str(all_tweets[i]), 
          lemmatize=True, 
          parts_of_speech_to_consider={'ADJ'},
          verbose=1))
    three_class_human_annotation = tweet_info['sentiment_label']
    three_class_system_output = vader_output_to_label(vader_label)
    report = classification_report(three_class_human_annotation,three_class_system_output,digits = 3)
    print(report)

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'bytes'>.

In [9]:
#VADER on the set of airline tweets with only nouns
for i in range(0,len(all_tweets)):
    print(run_vader(str(all_tweets[i]), 
          lemmatize=False, 
          parts_of_speech_to_consider={'NOUN'},
          verbose=1))
    three_class_human_annotation = tweet_info['sentiment_label']
    three_class_system_output = vader_output_to_label(vader_label)
    report = classification_report(three_class_human_annotation,three_class_system_output,digits = 3)
    print(report)

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'bytes'>.

In [10]:
#VADER on the set of airline tweets with only nouns and after having lemmatized the text
for i in range(0,len(all_tweets)):
    print(run_vader(str(all_tweets[i]), 
          lemmatize=True, 
          parts_of_speech_to_consider={'NOUN'},
          verbose=1))
    three_class_human_annotation = tweet_info['sentiment_label']
    three_class_system_output = vader_output_to_label(vader_label)
    report = classification_report(three_class_human_annotation,three_class_system_output,digits = 3)
    print(report)

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'bytes'>.

In [11]:
#VADER on the set of airline tweets with only verbs and after having lemmatized the text
for i in range(0,len(all_tweets)):
    print(run_vader(str(all_tweets[i]), 
          lemmatize=True, 
          parts_of_speech_to_consider={'VERB'},
          verbose=1))
    three_class_human_annotation = tweet_info['sentiment_label']
    three_class_system_output = vader_output_to_label(vader_label)
    report = classification_report(three_class_human_annotation,three_class_system_output,digits = 3)
    print(report)

ValueError: [E866] Expected a string or 'Doc' as input, but got: <class 'bytes'>.

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?
b)    
We think that we could use similar for loops like in the previous exercise for all the experiment, but then 
run them on the airlinetweets.For the other settings we would have to change min_df from 2 to 5 and to 10 in 
CountVectorizer. We expect that setting the min_df too high(so at 10) might cause too many important terms to be ignored.
But setting min_df too low(so at 1 or 2) might cause too many terms to be considered and that can decrease the performance.
To conclude, we think that setting the min_df somewehere in between(so at 5) will give the best performance.

In [7]:
import nltk
nltk.download('stopwords')
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

airline_vec = CountVectorizer(min_df=2, # If a token appears fewer times than this, across all documents, it will be ignored
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed

cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
airline_tweets = load_files(str(airline_tweets_folder))
airline_counts = airline_vec.fit_transform(airline_tweets.data)
tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the tf-idf model
    airline_tweets.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)



for i in range(0,len(airline_tweets.data)):
    print('one tweet review:', airline_tweets.data[i])
    print('gold label:', airline_tweets.target[i])
    print('classifier predicted:', y_pred[i])
#We keep getting an error that permission to the airlinetweets files is denied. 
#Therefore we could not experiment at all, while we do know how to carry out the experiment. 
#Read the explanation at question b.


[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


PermissionError: [Errno 13] Permission denied: 'C:\\Users\\zakar\\Downloads\\ba-text-mining-master\\ba-text-mining-master\\lab_sessions\\lab3\\airlinetweets\\airlinetweets\\positive'

### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [143]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 
        
# example of how to call from notebook:
important_features_per_class(airline_vec, clf)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zakar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


PermissionError: [Errno 13] Permission denied: 'C:\\Users\\zakar\\Downloads\\ba-text-mining-master\\ba-text-mining-master\\lab_sessions\\lab3\\airlinetweets\\airlinetweets\\positive'

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook