# Sentiment & Dictionaries

We will mostly be using NLTK to conduct sentiment analysis in this lab



## NLTK Corpus

NLTK has several corpora. Some are useful for sentiment analysis.

http://www.nltk.org/howto/corpus.html

* opinion_lexicon
* WordNet
* SentiWordNet

### opinion lexicon

Opinion Lexicon: A list of English positive and negative opinion words or sentiment words (around 6800 words). This list was compiled over many years starting from in the paper by (Hu and Liu, KDD-2004).

You need to first download this nltk opinion_lexicon corpus
`nltk.download('opinion_lexicon')`



In [8]:
import nltk
#nltk.download('opinion_lexicon') #this download needs to happen for the very first time
from nltk.corpus import opinion_lexicon

In [10]:
opinion_lexicon.positive()

['a+', 'abound', 'abounds', 'abundance', 'abundant', ...]

In [11]:
len(opinion_lexicon.positive())

2006

In [12]:
opinion_lexicon.negative()

['2-faced', '2-faces', 'abnormal', 'abolish', ...]

In [13]:
len(opinion_lexicon.negative())

4783

**<span class="mark">Your turn</span>**: think of three positive and negative sentiment words. See if they are in the lexicons.

In [14]:
# replace with your own words
my_pos = ['good','great','groovy']
my_neg = ['sick','demented','nasty']

In [15]:
# run this to see if they are in any of the lexicon
print('WORD, POS, NEG\n---------------')
for lex in [my_pos,my_neg]:
    for word in lex:
        print(word,word in opinion_lexicon.positive(),word in opinion_lexicon.negative())

WORD, POS, NEG
---------------
good True False
great True False
groovy False False
sick False True
demented False False
nasty False True


The above results tells you that for certain words, opinion_lexicon is not able to assign positive or negative labels. Trying with a non-sentiment word you will see the same result  

### Sentiment of tweet
In the last lab, you all tried tokenizing tweets.

**<span class="mark">TODO</span>**: What's the sentiment of a tweet sample? 
You can try with "@john lol that was #awesome :)"


In [16]:
from nltk.tokenize import sent_tokenize, word_tokenize
test_tweet = "@john lol that was #awesome :)"

#your code below
tweetoken = nltk.word_tokenize(test_tweet)
for word in tweetoken:
        print(word,word in opinion_lexicon.positive(),word in opinion_lexicon.negative())

@ False False
john False False
lol False False
that False False
was False False
# False False
awesome True False
: False False
) False False


### sentiment analysis with `VADER`
https://pypi.org/project/vaderSentiment/

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

`VADER`在发现负面情绪方面表现更好,适合应用于社交媒体文本情感分析

返回一个字典，其中包含文本出现肯定pos，否定neg和中立neu的可能性，再过滤筛选出可能性最高的情感。

In [17]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #pip install vaderSentiment

In [18]:
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(test_tweet)

{'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'compound': 0.872}

Trying with another text. News article this time. Recall this text from last lab

In [19]:
text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

In [20]:
analyzer.polarity_scores(text)

{'neg': 0.2, 'neu': 0.778, 'pos': 0.023, 'compound': -0.9287}

**How to interpret the overall score?**

The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

* Positive sentiment: compound score >= 0.05
* Neutral sentiment: -0.05 < compound score < 0.05 : 
* Negative sentiment: compound score <= -0.05

**Multi-dimensional measures of sentiment**

The `pos`, `neu`, and `neg` scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.

**<span class="mark">TODO</span>**:

write function to interpret the overall sentiment of text as positive, negavitve, or neutral based on VADER's analysis

In [22]:
def get_vader_score (sent) :
    # Polarity score returns dictionary
    ss = sid.polarity_scores(sent)
    #return ss
    return np.argmax(list(ss.values())[: -1 ])

"""news[ 'polarity' ]=news[ 'headline_text' ].map( lambda x: get_vader_score(x))
polarity=news[ 'polarity' ].replace({ 0 : 'neg' , 1 : 'neu' , 2 : 'pos' })"""

"news[ 'polarity' ]=news[ 'headline_text' ].map( lambda x: get_vader_score(x))\npolarity=news[ 'polarity' ].replace({ 0 : 'neg' , 1 : 'neu' , 2 : 'pos' })"

### sentiment analysis with `TextBlob`

https://textblob.readthedocs.io/en/dev/

返回：   
    
    polarity 极性:[-1,1], 1表示肯定陈述, -1表示否定陈述
  
    subjectivity 主观性:[0,1], 指个人的意见和感受如何影响某人的判断力

In [23]:
from texctytblob import TextBlob #pip install TextBlob

blob = TextBlob(test_tweet)
blob.sentiment

Sentiment(polarity=0.7666666666666666, subjectivity=0.9)

In [24]:
blob.polarity #[-1,1]

0.7666666666666666

In [25]:
blob.subjectivity #[0,1]

0.9

Trying with another text. News article this time. Recall this text from last lab

In [26]:
text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

In [27]:
blob = TextBlob(text)
blob.polarity

-0.04285714285714285

In [28]:
blob.subjectivity
blob.sentiment

Sentiment(polarity=-0.04285714285714285, subjectivity=0.2642857142857143)

There are few other functions available as well. Press tab to see them

#### Few more tests to see rule-based approach

In [29]:
TextBlob('great').sentiment

Sentiment(polarity=0.8, subjectivity=0.75)

In [30]:
TextBlob('not great').sentiment

Sentiment(polarity=-0.4, subjectivity=0.75)

So the rule above for "not great" is polarity of "great" X -0.5 = 0.8* -0.5 = -0.4

**<span class="mark">TODO for fun</span>**

Try with a few different variations to see whether you can observe the rules working here.

### `Empath`

https://github.com/Ejhfast/empath-client

https://pypi.org/project/empath/

移情分析的主要目标是将文本连接到各种不同的情感中,使用相似度比较来映射到Empath的词汇表。Empath值不仅仅是通过计算pos,neg,neu的极性，在计算出现次数的基础上，还对分析出的单词总数进行了归一化。这不仅考虑了计算最频繁的情感并假设它与好坏相关性,更可以深一步挖掘主观感受

In [31]:
from empath import Empath #pip install empath

In [32]:
lexicon = Empath()

In [33]:
categ = lexicon.analyze("he hit the other person", normalize=True)

In [34]:
print('Categories for the sentence: "he hit the other person":')
for key, value in categ.items():
    if value != 0:
        print(key)

Categories for the sentence: "he hit the other person":
movement
violence
pain
negative_emotion


In [35]:
#available categories in empath
print(categ.keys())

dict_keys(['help', 'office', 'dance', 'money', 'wedding', 'domestic_work', 'sleep', 'medical_emergency', 'cold', 'hate', 'cheerfulness', 'aggression', 'occupation', 'envy', 'anticipation', 'family', 'vacation', 'crime', 'attractive', 'masculine', 'prison', 'health', 'pride', 'dispute', 'nervousness', 'government', 'weakness', 'horror', 'swearing_terms', 'leisure', 'suffering', 'royalty', 'wealthy', 'tourism', 'furniture', 'school', 'magic', 'beach', 'journalism', 'morning', 'banking', 'social_media', 'exercise', 'night', 'kill', 'blue_collar_job', 'art', 'ridicule', 'play', 'computer', 'college', 'optimism', 'stealing', 'real_estate', 'home', 'divine', 'sexual', 'fear', 'irritability', 'superhero', 'business', 'driving', 'pet', 'childish', 'cooking', 'exasperation', 'religion', 'hipster', 'internet', 'surprise', 'reading', 'worship', 'leader', 'independence', 'movement', 'body', 'noise', 'eating', 'medieval', 'zest', 'confusion', 'water', 'sports', 'death', 'healing', 'legend', 'heroic

In [36]:
# let's see how Empath works on our tweet text
categ_tweets = lexicon.analyze(test_tweet)
categ_tweets

{'help': 0.0,
 'office': 0.0,
 'dance': 0.0,
 'money': 0.0,
 'wedding': 0.0,
 'domestic_work': 0.0,
 'sleep': 0.0,
 'medical_emergency': 0.0,
 'cold': 0.0,
 'hate': 0.0,
 'cheerfulness': 0.0,
 'aggression': 0.0,
 'occupation': 0.0,
 'envy': 0.0,
 'anticipation': 0.0,
 'family': 0.0,
 'vacation': 0.0,
 'crime': 0.0,
 'attractive': 0.0,
 'masculine': 0.0,
 'prison': 0.0,
 'health': 0.0,
 'pride': 0.0,
 'dispute': 0.0,
 'nervousness': 0.0,
 'government': 0.0,
 'weakness': 0.0,
 'horror': 0.0,
 'swearing_terms': 0.0,
 'leisure': 0.0,
 'suffering': 0.0,
 'royalty': 0.0,
 'wealthy': 0.0,
 'tourism': 0.0,
 'furniture': 0.0,
 'school': 0.0,
 'magic': 0.0,
 'beach': 0.0,
 'journalism': 0.0,
 'morning': 0.0,
 'banking': 0.0,
 'social_media': 0.0,
 'exercise': 0.0,
 'night': 0.0,
 'kill': 0.0,
 'blue_collar_job': 0.0,
 'art': 0.0,
 'ridicule': 0.0,
 'play': 0.0,
 'computer': 0.0,
 'college': 0.0,
 'optimism': 0.0,
 'stealing': 0.0,
 'real_estate': 0.0,
 'home': 0.0,
 'divine': 0.0,
 'sexual': 0.0

In [37]:
print('Categories for the sentence:', test_tweet)
for key, value in categ_tweets.items():
    if value != 0:
        print(key)

Categories for the sentence: @john lol that was #awesome :)


In [38]:
categ_text = lexicon.analyze(text)
categ_text

{'help': 0.0,
 'office': 0.0,
 'dance': 0.0,
 'money': 0.0,
 'wedding': 0.0,
 'domestic_work': 0.0,
 'sleep': 0.0,
 'medical_emergency': 0.0,
 'cold': 0.0,
 'hate': 0.0,
 'cheerfulness': 0.0,
 'aggression': 1.0,
 'occupation': 0.0,
 'envy': 0.0,
 'anticipation': 0.0,
 'family': 1.0,
 'vacation': 1.0,
 'crime': 0.0,
 'attractive': 0.0,
 'masculine': 0.0,
 'prison': 0.0,
 'health': 0.0,
 'pride': 0.0,
 'dispute': 0.0,
 'nervousness': 0.0,
 'government': 0.0,
 'weakness': 0.0,
 'horror': 0.0,
 'swearing_terms': 0.0,
 'leisure': 0.0,
 'suffering': 0.0,
 'royalty': 0.0,
 'wealthy': 0.0,
 'tourism': 1.0,
 'furniture': 0.0,
 'school': 0.0,
 'magic': 0.0,
 'beach': 0.0,
 'journalism': 0.0,
 'morning': 0.0,
 'banking': 0.0,
 'social_media': 1.0,
 'exercise': 0.0,
 'night': 0.0,
 'kill': 1.0,
 'blue_collar_job': 0.0,
 'art': 0.0,
 'ridicule': 0.0,
 'play': 0.0,
 'computer': 0.0,
 'college': 0.0,
 'optimism': 0.0,
 'stealing': 0.0,
 'real_estate': 0.0,
 'home': 0.0,
 'divine': 0.0,
 'sexual': 0.0

In [39]:
print('Categories for the news sentence:', text, '\n---------')
for key, value in categ_text.items():
    if value != 0:
        print(key)

Categories for the news sentence: Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack "despicable." 
---------
aggression
family
vacation
tourism
social_media
kill
reading
violence
communication
deception
fight
meeting
war
urban
phone
injury
appearance
traveling
ship
breaking
friends
weapon


**<span class="mark">TODO</span>**: 

1. From the project pitches that you all submitted, you had some idea of what data to collect. Get one data point for your problem (this could be one reddit post from a community, one tweet, etc.)
2. Now check to see which categories of Empath are present
3. Now loop through your entire data

In [40]:
import tweepy
import json

In [41]:
def loadKeys(key_file):
    with open(key_file) as f:
        key_dict = json.load(f)
    return key_dict['api_key'], key_dict['api_secret'], key_dict['token'], key_dict['token_secret']

In [42]:
KEY_FILE = 'twitterkeys-test.json'
api_key, api_secret, token, token_secret = loadKeys(KEY_FILE)
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(token, token_secret)
api = tweepy.API(auth)

FileNotFoundError: [Errno 2] No such file or directory: 'twitterkeys-test.json'

In [43]:
search_term = "COVID19"
new_search = search_term + " -filter:retweets"
no_of_pages = 1

for page in tweepy.Cursor(api.search, q = new_search, lang="en",).pages(no_of_pages):
    for status in page:
        print("\033[1mtweet :\033[0m " + status.text)
        categ_text = lexicon.analyze(status.text)
        for key, value in categ_text.items():
            if value != 0:
                print(key)

NameError: name 'api' is not defined