# Sentiment & Dictionaries

We will mostly be using NLTK to conduct sentiment analysis in this lab

## NLTK Corpus

NLTK has several corpora. Some are useful for sentiment analysis.

http://www.nltk.org/howto/corpus.html

* opinion_lexicon
* WordNet
* SentiWordNet

### opinion lexicon

Opinion Lexicon: A list of English positive and negative opinion words or sentiment words (around 6800 words). This list was compiled over many years starting from in the paper by (Hu and Liu, KDD-2004).

You need to first download this nltk opinion_lexicon corpus
`nltk.download('opinion_lexicon')`



In [None]:
import nltk
# nltk.download('opinion_lexicon') #this download needs to happen for the very first time
from nltk.corpus import opinion_lexicon

In [None]:
opinion_lexicon.positive()

In [None]:
len(opinion_lexicon.positive())

In [None]:
opinion_lexicon.negative()

**<span class="mark">Your turn</span>**: think of three positive and negative sentiment words. See if they are in the lexicons.

In [None]:
# replace with your own words
my_pos = ['good','great','groovy']
my_neg = ['sick','demented','nasty']

In [None]:
# run this to see if they are in any of the lexicon
print('WORD, POS, NEG\n---------------')
for lex in [my_pos,my_neg]:
    for word in lex:
        print(word,word in opinion_lexicon.positive(),word in opinion_lexicon.negative())

The above results tells you that for certain words, opinion_lexicon is not able to assign positive or negative labels. Trying with a non-sentiment word you will see the same result  

### Sentiment of tweet
In the last lab, you all tried tokenizing tweets.

**<span class="mark">TODO</span>**: What's the sentiment of a tweet sample? 
You can try with "@john lol that was #awesome :)"


In [None]:
test_tweet = "@john lol that was #awesome :)"

#your code below


### sentiment analysis with `VADER`
https://pypi.org/project/vaderSentiment/

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #pip install vaderSentiment

In [None]:
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(test_tweet)

In [None]:
analyzer.polarity_scores(test_tweet)

Trying with another text. News article this time. Recall this text from last lab

In [None]:
text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

In [None]:
analyzer.polarity_scores(text)

**How to interpret the overall score?**

The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

* Positive sentiment: compound score >= 0.05
* Neutral sentiment: -0.05 < compound score < 0.05 : 
* Negative sentiment: compound score <= -0.05

**Multi-dimensional measures of sentiment**

The `pos`, `neu`, and `neg` scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.

**<span class="mark">TODO</span>**:

write function to interpret the overall sentiment of text as positive, negavitve, or neutral based on VADER's analysis

In [None]:
# Your code below


### sentiment analysis with `TextBlob`

https://textblob.readthedocs.io/en/dev/

In [None]:
from textblob import TextBlob #pip install TextBlob

blob = TextBlob(test_tweet)
blob.polarity

In [None]:
blob.subjectivity

Trying with another text. News article this time. Recall this text from last lab

In [None]:
text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

In [None]:
blob = TextBlob(text)
blob.polarity

In [None]:
blob.subjectivity

There are few other functions available as well. Press tab to see them

#### Few more tests to see rule-based approach

In [None]:
TextBlob('great').sentiment

In [None]:
TextBlob('not great').sentiment

So the rule above for "not great" is polarity of "great" X -0.5 = 0.8* -0.5 = -0.4

**<span class="mark">TODO for fun</span>**

Try with a few different variations to see whether you can observe the rules working here.

### `Empath`

https://github.com/Ejhfast/empath-client

https://pypi.org/project/empath/

In [None]:
from empath import Empath #pip install empath

In [None]:
lexicon = Empath()

In [None]:
categ = lexicon.analyze("he hit the other person", normalize=True)

In [None]:
print('Categories for the sentence: "he hit the other person":')
for key, value in categ.items():
    if value != 0:
        print(key)

In [None]:
#available categories in empath
print(categ.keys())

In [None]:
# let's see how Empath works on our tweet text
categ_tweets = lexicon.analyze(test_tweet)
categ_tweets

In [None]:
print('Categories for the sentence:', test_tweet)
for key, value in categ_tweets.items():
    if value != 0:
        print(key)

In [None]:
categ_text = lexicon.analyze(text)
categ_text

In [None]:
print('Categories for the news sentence:', text, '\n---------')
for key, value in categ_text.items():
    if value != 0:
        print(key)

**<span class="mark">TODO</span>**: 

1. From the project pitches that you all submitted, you had some idea of what data to collect. Get one data point for your problem (this could be one reddit post from a community, one tweet, etc.)
2. Now check to see which categories of Empath are present
3. Now loop through your entire data