# Today's challenge: a sentiment analysis of Trump tweets

The goal for today is to introduce Python programming via a real world text analysis task: conducting a sentiment analysis of Twitter data. Our learning objectives include:

1. Introducing Python via a real-world text analysis task.
2. Illustrating how to load data into (and write data out of) Python
3. Learning to implement "standard" text pre-processing in Python
4. Learning how to conduct a lexicon-based sentiment analysis

To achieve these objectives, we will focus on a dataset which contains all of Donald Trump's tweets for the year 2017. Our challenge is to build a simple lexicon (or dictionary) to track negative sentiment in Trump's Twitter feed. Let's get started!

#### Note on programming in Python

This notebook offers an introduction to progamming in the Python language. It's impossible to cover it all in a single notebook (or a single class!); however, this notebook highlights core aspects of Python that are important for this class. I highly recommend the (free and online!) book <a href=https://python.swaroopch.com/><i>A Byte of Python</i></a> if you would like to further study the ideas outlined in this notebook.

## Step 1: Load the required libraries

The first step in constructing our sentiment analysis script is load any required libraries using the `import` statement: 

In [75]:
import os # Needed to change your working directory
os.chdir('/Users/tcoan/git_repos/notebooks') # CHANGE THIS DIRECTORY!
import json # Needed to load the trump_tweets_2017.json file

# We will use a couple of regular expressions in this tutorial.
# I have another notebook that provides more details on working with
# regular expressions.
import re

# Import NLTK
import nltk
nltk.download('punkt')     # Download NLTK "data" once (see
nlkt.download('stopwords') # https://www.nltk.org/data.html)

# Tokenization
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

# Stopwords
from nltk.corpus import stopwords

# Stemming and lemmatization
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Demonstrate pre-built sentiment lexicons
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Import the pandas library using the namespace "pd" to 
# save on typing. We will use pandas for reading/writing data,
# as well as for the occasional descriptive statistic
import pandas as pd

[nltk_data] Downloading package punkt to /Users/tcoan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Note that if there is a library that you need to use that is not pre-installed with Anaconda, you can install it within Jupyter using the following:

In [None]:
!pip install spacy

## Step 2: Read text data into Python for analysis

The next step is to read in tweets for our sentiment analysis. The trump_tweets_2017.json is a <a href="https://www.w3schools.com/js/js_json_intro.asp">JSON</a> formatted file and so we need to use the `json` library to open it:

In [76]:
# The "opens" a connection to the trump_tweets data on disk and "loads"
# into an object called "tweets"
with open('data/trump_tweets_2017.json', 'r', encoding="utf-8") as jfile:
    tweets = json.load(jfile)

Great, so we've "loaded" our JSON file, but what is actually stored in <b>variable</b> `tweets`? We could print it, but that doesn't help all that much!

In [None]:
print(tweets)

It turns out that JSON files are read as a list of dictionaries. Awesome, so what's a `list` and whats a `dict`?

### Lists

A list is just that -- a list of objects. These "objects" can be numbers, strings, and even other data structures (such as dictionaries!). For example:

In [5]:
names = ['trav', 'dre', 'riley', 'gwen']
ages = [41, 42, 10, 4]

In [6]:
print(names)

['trav', 'dre', 'riley', 'gwen']


Each list will have a <b>length</b>:

In [7]:
print(len(names))

4


And we can lookup a particular element in the list using the appropriate <b>index</b>. Note that Python indexes lists starting at 0, and moves right to left. So if we wanted to lookup the 2nd element in `names` list, we would type:

In [9]:
print(names[0])

trav


We can iterate over a list in the opposite direction by using negative indices. So to get the last and second to last name in the list, we could type:

In [11]:
# Last name
print(names[-1])

# Second to last name
print(names[-2])

gwen
riley


Lastly, we can `append` new objects to a list:

In [12]:
names.append('ranu')
ages.append(26)

In [17]:
names.insert(0, 'name')

In [21]:
del names[2]

this is markup

We are going to be using lists a TON in this class, so it's good to get comfortable working with them.

### Dictionaries

In addition to lists, dictionaries are one of the most often used data structures in Python programs. As aptly described in Byte of Python,

> "A dictionary is like an address-book where you can find the address or contact details of a person by knowing only his/her name i.e. we associate keys (name) with values (details). Note that the key must be unique just like you cannot find out the correct information if you have two persons with the exact same name."

For example, say we wanted to set up a dictionary that maps our names to ages using the data above:

In [26]:
name_to_age = {'trav': [41, 28], 'dre': 42, 'ranu': 26}

In [27]:
name_to_age['trav']

[41, 28]

Our `name_to_age` dictionary has the following keys: 

In [25]:
name_to_age.keys()

dict_keys(['trav', 'dre', 'ranu'])

Or we can store our `names` and `ages` list as follows: 

In [None]:
names_ages = {'names': names, 'ages': ages}

In [None]:
print(names_ages['ages'])

Like lists, dictonaries are extremely flexible and can store all different kinds of information. We will also use these a TON in this class!

### What's a list of dictionaries?

It's exactly what it says---it's a `list` holding 1 or more `dict` objects!

In [28]:
name_data = [
    {'name': 'trav', 'age': 41},
    {'name': 'dre', 'age': 42}
]

print(name_data[1]['name'])

dre


In the same way, our `tweets` data that we loaded is just a list of dicts:

In [29]:
tweets[0]

{'source': 'Twitter for iPhone',
 'text': 'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!',
 'created_at': 'Sat Dec 30 22:42:09 +0000 2017',
 'retweet_count': 24332,
 'favorite_count': 117013,
 'is_retweet': False,
 'id_str': '947236393184628741'}

In [31]:
tweets[0].keys()

dict_keys(['source', 'text', 'created_at', 'retweet_count', 'favorite_count', 'is_retweet', 'id_str'])

In [36]:
print(f'Our list of tweets is {len(tweets)} long and has the following keys:\n{list(tweets[0].keys())}')

Our list of tweets is 2275 long and has the following keys:
['source', 'text', 'created_at', 'retweet_count', 'favorite_count', 'is_retweet', 'id_str']


### Reading and writing data with `pandas`

We use the `json` library to read our Trump data, but what if your data is saved in something other than JSON? Python has standalone libraries for reading files stored in different formats (e.g., the <a href=https://docs.python.org/3/library/csv.html>csv</a> module). However, the `pandas` library offers a convienent approach to reading and writing files in pretty much any format out there.

Note that `pandas` offers so, so much more than just reading and writing files. It's provides an R-like environment for data wrangling and analysis, and is quickly becoming the "go to" library for data analysis in Python. For an introduction to `pandas`, please check out:

<https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/>

However, we won't spend too much time on `pandas` in this section of the course, but instead use it for reading/writing data, as well as a bit of descriptive statistics.

Reading data in `pandas` is as easy as:

In [37]:
# Read the Trump tweets CSV into a pandas "dataframe"
trump_df = pd.read_csv('data/trump_tweets_2017.csv')

The `trump_df` object is a `pandas` dataframe, which operates similarly to an R data frame:

In [38]:
trump_df.head()

Unnamed: 0,source,id_str,text,is_retweet,retweet_count,favorite_count
0,Twitter for iPhone,9.47e+17,Jobs are kicking in and companies are coming b...,False,24332,117013
1,Twitter for iPhone,9.47e+17,"I use Social Media not because I like to, but ...",False,50342,195754
2,Twitter for iPhone,9.47e+17,On Taxes: “This is the biggest corporate rate ...,False,16703,73325
3,Twitter for iPhone,9.47e+17,"Oppressive regimes cannot endure forever, and ...",False,23270,78932
4,Twitter for iPhone,9.47e+17,The entire world understands that the good peo...,False,23532,77986


In [39]:
trump_df['retweet_count'].mean()

19480.64043956044

To convert our `trump_df` to list of dictionaries as above, we need to run the following code:

In [40]:
trump = trump_df.to_dict('records')
trump[0]

{'source': 'Twitter for iPhone',
 'id_str': 9.47e+17,
 'text': 'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!',
 'is_retweet': False,
 'retweet_count': 24332,
 'favorite_count': 117013}

<span style="color: red;">WARNING</span>: The code written below to clean and process our text data assumes that our data is a list of dictionaries. If you try to use this code directly on a `pandas` dataframe it will not work.

In [None]:
trump_df.to_

## Step 3: prepare text for analysis (aka pre-processing)

With our list of tweets in hand, we are now need to get out text ready for analysis. While the process of "pre-processing" text will vary a bit based on the specific analysis employed, we will cover the so-called "standard" pre-processing procedures:

1. Tokenization
2. Convert to lower (or upper) case
3. Punctuation removal
4. Stopword removal

Other common pre-processing procedures **not** covered in this notebook include:

5. Expanding contractions
6. Lemmatizing or stemming
7. Removing numbers

I will cover these additional pre-processing procedures in a seperate notebook (extended-preprocessing.ipynb) for anyone that is interested.

Great, we are ready to get going. Before processing our text, however, we need to get a sense of how Python understands `strings`.

### Strings and string methods

Python has all of the standard data types (number, strings, bool), but `strings` (or text!) is obviously quite important for text analysis class.

In [70]:
tweet = tweets[0]['text']
tweet

'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!'

Strings are "iterable", meaning that they have an index:

In [None]:
tweet[1]

They also have a set of "methods" (or functions) that will be useful for us throughout the course. Here are some of the most important.

`lower()` and `upper()`:

In [42]:
'This text NEEDS to be lowercase.'.lower()

'this text needs to be lowercase.'

`strip()` white space from the ends of a string:

In [43]:
'   We need to strip the white space off of this sentence.         '.strip()

'We need to strip the white space off of this sentence.'

`split()` a string:

In [45]:
tweet.split(' ')

['Jobs are kicking in and companies are coming back to the U',
 'S',
 ' Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better',
 ' MUCH MORE TO COME!']

I could go on and on here. The point is do yourself a favor and review the following methods:

<a href="https://www.w3schools.com/python/python_ref_string.asp">https://www.w3schools.com/python/python_ref_string.asp</a>

### Tokenization

Tokenization is just the process of splitting up our strings into smaller parts (usually sentences or words). We've already seen one way to do this using the `split()` method:

In [None]:
toks = tweet.split(' ')
print(toks)

There are also times when you want to first tokenize a string into sentences and then into words. Here, we can use the `sent_tokenize` function that we imported above to do the job:

In [48]:
sentences = sent_tokenize(tweet)
print(sentences)

['Jobs are kicking in and companies are coming back to the U.S.', 'Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better.', 'MUCH MORE TO COME!']


In [47]:
tweet

'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!'

For our simple twitter sentiment analysis, we will ignore the sentence structure and treat our text as a <b>bag of words</b>. We'll also use the `word_tokenize` function from `nltk` to get a cleaner set of tokens than what was produced using `split()`:

In [50]:
toks = word_tokenize(tweet)
print(toks)

['Jobs', 'are', 'kicking', 'in', 'and', 'companies', 'are', 'coming', 'back', 'to', 'the', 'U.S', '.', 'Unnecessary', 'regulations', 'and', 'high', 'taxes', 'are', 'being', 'dramatically', 'Cut', ',', 'and', 'it', 'will', 'only', 'get', 'better', '.', 'MUCH', 'MORE', 'TO', 'COME', '!']


Nice, so now we know how to tokenize a single tweet. However, we have 2,275 tweets to process. <b>Soluton</b>: the trusty old `for` loop.

In [63]:
this_dict = f'This is {len(tweets)}'

In [64]:
this_dict

'This is 2275'

### The `for` loop

We often want to make repeated calculations and this is where the idea of a "loop" comes in. Let's start by taking a look at a `for` loop, which allows you to iterate over a sequence of objects. For example, let's create a list of numbers from 0 to 9 using Python's `range()` function, loop over the list, and print the number:

In [None]:
for i in range(10):
    print(i)

Similarly, we can loop over our list of tokens (i.e., `toks`) and print each token:

In [None]:
for token in toks:
    print(token)

How can we use a `for` loop to tokenize our list of tweets? Like so:

In [None]:
tokens = [] # preallocate a list to hold the tokenized tweets
for tweet in tweets:
    # Note: I'm going to convert everything to lowercase here
    # to avoid looping again
    tokens.append(word_tokenize(tweet['text'].lower()))

In [None]:
print(tokens[1])

That's it! We can actually write this code a bit more efficiently by using what is called "<b>list comprehension</b>" in Python:

In [None]:
tokens_ = [word_tokenize(tweet['text'].lower()) for tweet in tweets]
tokens == tokens_

### Removing stopwords

In many analyses, it makes sense to remove so-called **stopwords**. These are words that show up frequently in a text, but add very little meaning.

In [77]:
# Pull in NLTK's English language stopword list.
stops = stopwords.words('english')
print(stops[0:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


First thing to notice is our stopword list assumes lowercase. You always need to check this! But how do we remove these words from our list? Let's to it the long way first and then we will use the equivalent list comprehension:

In [None]:
tokens_no_stops = [] # preallocate a list to hold tweet-level data
# First loop is over tokenized tweets
for toks in tokens:
    toks_no_stops = [] # preallocate a list to hold token-level data
    # Next loop over the individual words in each tokenized tweet
    for tok in toks:
        # Check if the word (tok) is in the stopword list
        if tok not in set(stops):
            toks_no_stops.append(tok)
    # Store result and move to next tweet
    tokens_no_stops.append(toks_no_stops)

In [None]:
print(tokens_no_stops[0])

Here's a more compact version using list comprehension:

In [None]:
tokens_no_stops = [] # preallocate a list to hold tweet-level data
for toks in tokens:
    tokens_no_stops.append([tok for tok in toks if tok not in set(stops)])

In [None]:
print(tokens_no_stops[0])

Note that there is still some junk in this tweet (e.g., 'amp', 'https', etc.). It's not uncommon to have "corpus-specific" (or "extended") stopwords. We can add additional words to the stopword list:

In [78]:
stops_to_add = ['amp', 'https', 'co']

# Concatenate the two lists
extended_stops = stops + stops_to_add

# The last stopword is now 'co'
print(extended_stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

We just need to use this extended stop word list and we are good to go:

In [None]:
tokens_no_stops = [] # preallocate a list to hold tweet-level data
for toks in tokens:
    tokens_no_stops.append([tok for tok in toks if tok not in set(extended_stops)])

### Removing punctuation

Our use of the `word_tokenize()` function makes removing punctuation very, very easy. Check out how `word_tokenize()` handles punctuation:

In [None]:
print(word_tokenize('I love this class!!!!!'))

As such, all we need to do is remove strings with a length == 1:

In [None]:
tokens_no_punct = []
for toks in tokens_no_stops:
    tokens_no_punct.append([tok for tok in toks if len(tok) > 1])

In [None]:
tokens_no_punct[0]

### Step 4: Build and use your lexicon (or dictionary)

What words should go into our lexicon of "negative" words? While there are a number of words with clear, negative conotations, lexicon-based approaches often need to be tailored to a specific task (or at least validated for the specific task at hand). We will start by illustrating how to build and employ our own lexicon -- as this could be useful across a range of tasks, not just sentiment analysis -- and then I will quickly demonstate how to use existing sentiment lexicons in Python.

#### Starting simple: looking up a single word

Let's start simple and examine the presence of absence of a single word in our corpus: fake. Here's one way of completing this task:

In [None]:
fake = []
for toks in tokens_no_punct:
    if 'fake' in toks:
        fake.append(1)
    else:
        fake.append(0)

In [None]:
fake[0:10]

#### Building our negative sentiment lexicon

A lexicon (or dictonary) is just a list of words. That's it! So we can start by created list of "negative" words:

In [72]:
negative = ['fake','hoax', 'idiot', 'moron', 'phony', 'fight', 'dishonest', 'unfair']

Done! Okay, okay, this list isn't exhaustive, but it is good enough to show you how to implement your own lexicon-based approach. Using our `negative` words lexicon takes a little code, but we have all of the tools to implement it:

In [None]:
# Start by looping over each list of tokens
negative_sentiment_score = []
for toks in tokens_no_punct:
    # Define a counter to hold the number of negative words
    negative_words = 0
    # It is possible for a tweet not to have text. If this is the case,
    # we need to skip it.
    if len(toks) == 0:
        print("This tweet does not have text. Setting negative sentiment to 0.")
        negative_sentiment_score.append(0) 
    else:
        # Then loop over each token in each list
        for tok in toks:
            if tok in set(negative):
                negative_words += 1
        negative_sentiment_score.append(negative_words/len(toks)) 

In [None]:
negative_sentiment_score[0:10]

## Using pre-built sentiment lexicons

There are a number of good sentiment lexicons already available and `NLTK` provides access to a number of these (see the <a href="https://www.nltk.org/api/nltk.sentiment.html">nltk.sentiment api documentation</a> for a full list). Once lexicon that has been shown to work well across a range of tasks is <a href="http://scholar.google.co.uk/scholar_url?url=https://ojs.aaai.org/index.php/ICWSM/article/download/14550/14399&hl=en&sa=X&ei=rUoRYKXUDIfCmgGXoLToAw&scisig=AAGBfm22NmYm4wMvPAvkQZuGz-24V1Wu1A&nossl=1&oi=scholarr">VADER</a>. Here's how you can use VADER in `NLTK`. First, start by instantiating the `SentimentIntensityAnalyzer` class:

In [None]:
vader = SentimentIntensityAnalyzer()

In [None]:
vader.polarity_scores("This class really does suck.")

Note that there is no pre-processing necessary -- you can simply pass your text in "raw" form. Applying to our Trump example:

In [None]:
for tweet in tweets:
    # Here I'm adding a new field to the "tweets" dictionary directly
    # While this changes the original dictionry (and so requires caution),
    # it ensures that we have all of the relevant meta-data about the tweet
    # easily accessble.
    tweet['negative_sentiment'] = vader.polarity_scores(tweet['text'])['neg']

In [None]:
print(tweets[1])

### Bonus material: defining a pre-processing function

It's good practice to roll up our various bits of pre-processing in a single **function**. Functions allow to reuse pieces (or blocks) of code. We do so by declaring a function using the `def` statement. We have already used several of Python's built-in functions earlier in this tutoral. For instance, we "called" the `len` function to get the number of characters in a string. Python, however, makes it super easy to define your own functions.

For example, we can combine the main pre-processing steps above into a single function as follows:

In [73]:
def process_tweet(tweet, stops):
    '''
    Helper function to pre-process tweet text. It takes an individual tweet,
    tokenizes it, converts to lowers, and removes punctuation.
    
    Args:
        tweet (str): The (unprocessed) tweet text
        stops (list): Stopword list to use
    
    Returns:
        list of lists: Returns processed tokens
    '''
    # Convert the entire tweet to lowercase, tokenize, and remove stop words
    tokens = [token for token in word_tokenize(tweet.lower()) 
          if token not in set(stops)]
    # Remove punctuation
    tokens_no_punct = [token for token in tokens if len(token) > 1]
    # Remove numbers and return the processed list of tokens
    return tokens_no_punct

We can then process our corpus using a single line of code:

In [79]:
tweets_clean = [process_tweet(tweet['text'], extended_stops) for tweet in tweets]

In [None]:
print(tweets_clean[0])

We can set up a function to use our lexicon in the same exact way:

In [80]:
def calculate_lexicon (tokens, lexicon):
    '''
    Takes a tokenized set of texts and counts the number of tokens
    in each text included in a lexicon.
    
    Args:
        tokens (list of list): Tokenized text
        lexicon (list of str): List of tokens to lookup
    '''
    result = []
    for token in tokens:
        lexicon_words = 0
        if len(token) == 0:
            result.append(None) 
        else:
            # Then loop over each token in each list
            for tok in token:
                if tok in set(lexicon):
                    lexicon_words += 1
            result.append(lexicon_words/len(toks)) 
    return result

In [82]:
res = calculate_lexicon(tweets_clean, negative)

In [83]:
res[0:10]

[0.0,
 0.14285714285714285,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.05714285714285714,
 0.0]