# Tweeter Sentiment Analysis

Perform [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) of Twitter data&mdash;that is, determining the "attitude" or "emotion" (e.g., how "positive", "negative", "joyful", etc) of tweets made by a particular Twitter user. 

## The Data

Load tweet data taken directly from [Twitter's API](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline). 

We'll use a set of **word-sentiments**&mdash;a list of English-language words and what emotions (e.g., "joy", "anger") [are associated with them](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm).

* The [`nltk`](https://github.com/nltk/nltk/wiki/Sentiment-Analysis) library also support sentiment analysis. However, for practice and extendability, we'll doing a more "manual" analysis using the provided data file.

In [2]:
# import a set of word-sentiments, a list of English-language words and what emotions are associated with them.
from data.sentiments_nrc import SENTIMENTS, EMOTIONS

## Text Sentiment
Define a function that take a tweet's text (a string) and split it up into a list of individual words. For example, the string `"Amazingly, I prefer a #rainy day to #sunshine."` should produce a list with 6 lower-case words in it.

In [0]:
# Define a function that take a tweet's text (a string) and split it up into a list of individual words
def split_words(text):
    """Take in a string as "text", then transform the string into lower case, split up and store into a list(split_text)"""
    import re
    text = text.lower()
    split_text = re.split(r'\W+',text)
    for words in split_text:
        if len(words) <=1:
            split_text.remove(words)
    return(split_text)
split_words("Amazingly, I prefer a #rainy day to #sunshine.")

['amazingly', 'prefer', 'rainy', 'day', 'to', 'sunshine']

Define a function that **filters** a list of the words to get only those words that contain a specific emotion. For example, the `"positive"` words extracted from `"Amazingly, I prefer a #rainy day to #sunshine."` are `["amazingly", "prefer", "sunshine"]`.

In [0]:
# Define a function that filters a list of the words to get only those words that contain a specific emotion.
def filter_emotions(text):
    """Take in a list of text and filter out words without sentiments. Output sentiments words as a list(emotions_text)"""
    emotions_text = []
    for words in text:
        if SENTIMENTS.get(words,"n") != "n":
            emotions_text.append(words)
    return(emotions_text)
filter_emotions(split_words("Amazingly, I prefer a #rainy day to #sunshine."))

['amazingly', 'prefer', 'rainy', 'sunshine']

Define a function that determines which words from a list have _each_ emotion (i.e., the "emotional" words). For example, the words extracted from `"Amazingly, I prefer a #rainy day to #sunshine."` should produce a dictionary that looks like:

```
{
 'anger': [],
 'anticipation': [],
 'disgust': [],
 'fear': [],
 'joy': ['amazingly', 'sunshine'],
 'negative': [],
 'positive': ['amazingly', 'prefer', 'sunshine'],
 'sadness': ['rainy'],
 'surprise': ['amazingly'],
 'trust': ['prefer']
}
```

In [0]:
def emotions_sort(text):
    """Take in a list of text that contain sentiments, and output a dictionary categorizing words with associated emotions."""
    emotions_text = filter_emotions(text)
    emotions_dict = {}
    for i in sorted(EMOTIONS):
        emotions_dict[i]=[]
    for words in emotions_text:
        for i in SENTIMENTS.get(words):
            emotions_dict[i].append(words)
    return(emotions_dict)
emotions_sort(split_words("Amazingly, I prefer a #rainy day to #sunshine."))

{'anger': [],
 'anticipation': [],
 'disgust': [],
 'fear': [],
 'joy': ['amazingly', 'sunshine'],
 'negative': [],
 'positive': ['amazingly', 'prefer', 'sunshine'],
 'sadness': ['rainy'],
 'surprise': ['amazingly'],
 'trust': ['prefer']}

Define a function that gets a list of the "most common" words in a list: that is a new list containing each word in the original list, in descending order by how many times that word appears in the orignal list.


In [0]:
def most_common(wordlist):
    """Take in a list of words and find the top 3 common words in the list. Output as a list(most_common)"""
    text_count = {}
    for words in wordlist:
        prev = text_count.get(words,0)
        text_count[words] = prev + 1
    text_count_list = []
    for i in text_count:
        text_count_list.append((i,text_count[i]))
    text_count_list = sorted(text_count_list, key=lambda count:count[1], reverse=True)
#     In order to make it easier to use in the following functions, transform the tuples into list of common words. 
    most_common = []
    for a in text_count_list:
        most_common.append(a[0])
    return(most_common)
most_common(['a','b','c','c','c','a'])

['c', 'a', 'b']

## Tweet Statistics

Define a function (e.g., `analyze_tweets()`) that takes as an argument a **list** of tweet data (with the same structure as the imported `SAMPLE_TWEETS` variable), and _returns_ the data of interest to display in a table like the one at the very top of the notebook. In particular, we'll produce the following information **for each emotion**:

1. The percentage of words _across all tweets_ that have that emotion
2. The most common words _across all tweets_ that have that emotion (in order!)
3. The most common **hashtags** _across all tweets_ associated with that emotion

In [0]:
def analyze_tweets(tweet):
    """Take in a list of tweet data and output the percentages of sentiments words, most common words and hashtags associated with that emotion."""
    import re
    from functools import reduce
    # Create a dictionary to store emotion words and hashtags based on the sentiments associated.
    emotions_dict = {'anger': [], 'anticipation': [], 'disgust': [], 'fear': [], 'joy': [], 'negative': [], 'positive': [], 'sadness': [], 'surprise': [], 'trust': []}
    hashtags_dict = {'anger': [], 'anticipation': [], 'disgust': [], 'fear': [], 'joy': [], 'negative': [], 'positive': [], 'sadness': [], 'surprise': [], 'trust': []}
    for i in range(len(tweet)):
        split_text = re.split(" ",tweet[i]['text'])
        # Create keys in the orignal data to store the value of each analysis, including split text, number of words in each tweet, emotions associated, and hashtags.
        tweet[i]['split_text'] = split_text
        tweet[i]['word_cnt'] = len(split_text)
        tweet[i]['emotions'] = emotions_sort(split_words(tweet[i]['text']))
        tweet[i]['hashtags'] = []
        for a in range(len(tweet[i]['entities']['hashtags'])):
            tweet[i]['hashtags'].append(tweet[i]['entities']['hashtags'][a]['text'].lower())
        for emotion in emotions_dict:
            emotions_dict[emotion] = emotions_dict[emotion] + tweet[i]['emotions'][emotion]
            # Categorize the hashtags by emotions.
            if len(tweet[i]['hashtags'])>0 and len(tweet[i]['emotions'][emotion])>0:
                hashtags_dict[emotion] = hashtags_dict[emotion] + tweet[i]['hashtags']
    most_common_emotion_word = {}
    most_common_hashtags = {}
    emotions_count = []
    # Count the total number of words across the tweets.
    word_count = 0
    total_count = reduce(lambda word_count , new : word_count + new['word_cnt'], tweet, 0)
    # Get the most common words and hashtags in each emotions.
    for emotion in EMOTIONS:
        most_common_emotion_word[emotion] = most_common(emotions_dict[emotion])[:3]
        most_common_hashtags[emotion] = most_common(hashtags_dict[emotion])[:3]
        temp = (emotion, round(len(emotions_dict[emotion])/total_count*100,2))
        emotions_count.append(temp)
    # Sort the emotions by percentages.
    emotions_count = sorted(emotions_count, key=lambda count:count[1], reverse=True)
    return(emotions_count, most_common_emotion_word, most_common_hashtags)

([('positive', 6.72),
  ('trust', 3.36),
  ('anticipation', 2.76),
  ('joy', 1.92),
  ('surprise', 1.08),
  ('negative', 0.96),
  ('sadness', 0.6),
  ('disgust', 0.48),
  ('fear', 0.48),
  ('anger', 0.36)],
 {'positive': ['learn', 'faculty', 'happy'],
  'negative': ['fall', 'rejection', 'outstanding'],
  'anger': ['rejection', 'disaster', 'involvement'],
  'anticipation': ['happy', 'top', 'ready'],
  'disgust': ['rejection', 'weird', 'finally'],
  'fear': ['rejection', 'surprise', 'problem'],
  'joy': ['happy', 'peace', 'deal'],
  'sadness': ['fall', 'rejection', 'problem'],
  'surprise': ['deal', 'award', 'surprised'],
  'trust': ['school', 'faculty', 'happy']},
 {'positive': ['accesstoinfoday', 'indigenouspeoplesday', 'idealistfair'],
  'negative': [],
  'anger': ['mlis'],
  'anticipation': ['indigenouspeoplesday', 'informatics', 'info340'],
  'disgust': [],
  'fear': [],
  'joy': ['indigenouspeoplesday', 'accesstoinfoday'],
  'sadness': [],
  'surprise': ['suzzallolibrary', 'nobrain

Define another function to display the information as a printed table (the function should take as an argument the data structure returned from the "analysis" function).

In [0]:
def format_table(analysis):
    """Print out a table to show the percentages of each emotions, the most common words and hashtags of that emotion."""
    # Header of the table
    header = ['EMOTION', '% WORDS', 'EXAMPLE WORDS', 'HASHTAGS']
    print("{:<15} {:<10} {:<35} {}".format(*header))
    comma = ", "
    hashtag = ", #"
    # Get the information needed from the input data
    for i in range(len(EMOTIONS)): 
        emotion = analysis[0][i][0]
        percentage = str(analysis[0][i][1])+"%"
        words = comma.join(analysis[1][emotion])
        if len(analysis[2][emotion])> 0:
            hashtags = "#"+ hashtag.join(analysis[2][emotion])
        else:
            hashtags = ""
        print("{:<15} {:<10} {:<35} {}".format(emotion, percentage, words, hashtags))

EMOTION         % WORDS    EXAMPLE WORDS                       HASHTAGS
positive        6.72%      learn, faculty, happy               #accesstoinfoday, #indigenouspeoplesday, #idealistfair
trust           3.36%      school, faculty, happy              #indigenouspeoplesday, #diversity
anticipation    2.76%      happy, top, ready                   #indigenouspeoplesday, #informatics, #info340
joy             1.92%      happy, peace, deal                  #indigenouspeoplesday, #accesstoinfoday
surprise        1.08%      deal, award, surprised              #suzzallolibrary, #nobrainer
negative        0.96%      fall, rejection, outstanding        
sadness         0.6%       fall, rejection, problem            
disgust         0.48%      rejection, weird, finally           
fear            0.48%      rejection, surprise, problem        
anger           0.36%      rejection, disaster, involvement    #mlis


## Getting Live Data

Define a function that takes in a Twitter username as an argument and then returns a list of dictionaries representing the tweets made by that user.

Normally we could fetch this data by sending a request directly to the web service's API (e.g., to the the [statuses/user_timeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline) endpoint provided by the Twitter API at `https://api.twitter.com/1.1/statuses/user_timeline`). However, Twitter includes access controls so that only registered developers are allowed to send requests. Instead, we used a [proxy](https://en.wikipedia.org/wiki/Proxy) set up by a UW faculty, Joel Ross. Hence, we'll send a request to the url: 
`https://faculty.washington.edu/joelross/proxy/twitter/timeline/`
instead of `https://api.twitter.com/1.1/statuses/user_timeline`, and it will redirect the request with the proper authentication to Twitter, and then give back whatever JSON Twitter's API responded with. 

In [0]:
def get_user(username):
    """Take in a Twitter username and output a structured tweet data of that account."""
    import requests, json
    query_params = {"screen_name": username}
    response = requests.get("https://faculty.washington.edu/joelross/proxy/twitter/timeline/", params = query_params)
    response_data = response.text
    text_data = json.loads(response_data)
    return(text_data)

Define a function that [prompts the user](https://docs.python.org/3/library/functions.html#input) for a Twitter username. The function should then fetch the tweets, and pass the returned tweet data into "analyze" and "show" functions in order to display your sentiment analysis of the user's timeline.

In [0]:
if __name__ == "__main__":
    from data.uw_ischool_sample import SAMPLE_TWEETS
    from data.sentiments_nrc import SENTIMENTS, EMOTIONS
    username = input("Which Twitter account do you want to analyze? ")
    tweet = get_user(username)
    analysis = analyze_tweets(tweet)
    format_table(analysis)

Which Twitter account do you want to analyze? Dior
EMOTION         % WORDS    EXAMPLE WORDS                       HASHTAGS
positive        8.18%      creative, lover, passion            #diormagazine, #peterphilips, #miwakomatsu
trust           3.69%      lover, passion, dance               #diormagazine, #peterphilips, #miwakomatsu
joy             3.43%      lover, passion, love                #miwakomatsu, #yoonahn, #missdior
anticipation    2.64%      lover, passion, calls               #diormagazine, #miwakomatsu, #yoonahn
negative        1.32%      shot, calls, frenzied               #diormagazine, #missdior
surprise        1.06%      inspired, shot, celebration         #diormagazine, #missdior, #peterphilips
fear            0.79%      shot, frenzied, whirlwind           #missdior, #diormagazine
anger           0.53%      shot, frenzied                      #diormagazine, #missdior
sadness         0.26%      shot                                #diormagazine
disgust         0.0%   