# Today's challenge: a sentiment analysis of Trump tweets

The goal for today is to introduce Python programming via a real world text analysis task: conducting a sentiment analysis of Twitter data. Our learning objectives include:

1. Introducing Python via a real-world text analysis task.
2. Illustrating how to load data into (and write data out of) Python
3. Learning to implement "standard" text pre-processing in Python
4. Learning how to conduct a lexicon-based sentiment analysis

To achieve these objectives, we will focus on a dataset which contains all of Donald Trump's tweets for the year 2017. Our challenge is to build a simple lexicon (or dictionary) to track negative sentiment in Trump's Twitter feed. Let's get started!

#### Note on programming in Python

This notebook offers an introduction to progamming in the Python language. It's impossible to cover it all in a single notebook (or a single class!); however, this notebook highlights core aspects of Python that are important for this class. I highly recommend the (free and online!) book <a href=https://python.swaroopch.com/><i>A Byte of Python</i></a> if you would like to further study the ideas outlined in this notebook.

## Step 1: Load the required libraries

The first step in constructing our sentiment analysis script is load any required libraries using the `import` statement: 

In [75]:
import os # Needed to change your working directory
os.chdir('/Users/tcoan/git_repos/notebooks') # CHANGE THIS DIRECTORY!
import json # Needed to load the trump_tweets_2017.json file

# We will use a couple of regular expressions in this tutorial.
# I have another notebook that provides more details on working with
# regular expressions.
import re

# Import NLTK
import nltk
# nltk.download() # Download NLTK "data" once (see
                  # https://www.nltk.org/data.html)

# Tokenization
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

# Stopwords
from nltk.corpus import stopwords

# Stemming and lemmatization
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Demonstrate pre-built sentiment lexicons
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Note that if there is a library that you need to use that is not pre-installed with Anaconda, you can install it within Jupyter using the following:

In [None]:
!pip install spacy

## Step 2: Read text data into Python for analysis

The next step is to read in tweets for our sentiment analysis. The trump_tweets_2017.json is a <a href="https://www.w3schools.com/js/js_json_intro.asp">JSON</a> formatted file and so we need to use the `json` library to open it:

In [15]:
# The "opens" a connection to the trump_tweets data on disk and "loads"
# into an object called "tweets"
with open('data/trump_tweets_2017.json', 'r', encoding="utf-8") as jfile:
    tweets = json.load(jfile)

Great, so we've "loaded" our JSON file, but what is actually stored in <b>variable</b> `tweets`? We could print it, but that doesn't help all that much!

In [None]:
print(tweets)

It turns out that JSON files are read as a list of dictionaries. Awesome, so what's a `list` and whats a `dict`?

### Lists

A list is just that -- a list of objects. These "objects" can be numbers, strings, and even other data structures (such as dictionaries!). For example:

In [8]:
names = ['trav', 'dre', 'riley', 'gwen']
ages = [41, 42, 10, 4]

In [None]:
print(names)

Each list will have a <b>length</b>:

In [None]:
print(len(names))

And we can lookup a particular element in the list using the appropriate <b>index</b>. Note that Python indexes lists starting at 0, and moves right to left. So if we wanted to lookup the 2nd element in `names` list, we would type:

In [None]:
print(names[1])

We can iterate over a list in the opposite direction by using negative indices. So to get the last and second to last name in the list, we could type:

In [None]:
# Last name
print(names[-1])

# Second to last name
print(names[-2])

Lastly, we can `append` new objects to a list:

In [9]:
names.append('ranu')
ages.append(26)

In [None]:
print(names)

We are going to be using lists a TON in this class, so it's good to get comfortable working with them.

### Dictionaries

In addition to lists, dictionaries are one of the most often used data structures in Python programs. As aptly described in Byte of Python,

> "A dictionary is like an address-book where you can find the address or contact details of a person by knowing only his/her name i.e. we associate keys (name) with values (details). Note that the key must be unique just like you cannot find out the correct information if you have two persons with the exact same name."

For example, say we wanted to set up a dictionary that maps our names to ages using the data above:

In [19]:
name_to_age = {'trav': 41, 'dre': 42, 'ranu': 26}

In [17]:
name_to_age['trav']

41

Our `name_to_age` dictionary has the following keys: 

In [20]:
name_to_age.keys()

dict_keys(['trav', 'dre', 'ranu'])

Or we can store our `names` and `ages` list as follows: 

In [10]:
names_ages = {'names': names, 'ages': ages}

In [11]:
print(names_ages['ages'])

[41, 42, 10, 4, 26]


Like lists, dictonaries are extremely flexible and can store all different kinds of information. We will also use these a TON in this class!

### What's a list of dictionaries?

It's exactly what it says---it's a `list` holding 1 or more `dict` objects!

In [14]:
name_data = [
    {'name': 'trav', 'age': 41},
    {'name': 'dre', 'age': 42}
]

print(name_data[1]['name'])

dre


In the same way, our `tweets` data that we loaded is just a list of dicts:

In [16]:
tweets[0]

{'source': 'Twitter for iPhone',
 'text': 'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!',
 'created_at': 'Sat Dec 30 22:42:09 +0000 2017',
 'retweet_count': 24332,
 'favorite_count': 117013,
 'is_retweet': False,
 'id_str': '947236393184628741'}

In [21]:
print(f'Our list of tweets is {len(tweets)} long and has the following keys:\n{list(tweets[0].keys())}')

Our list of tweets is 2275 long and has the following keys:
['source', 'text', 'created_at', 'retweet_count', 'favorite_count', 'is_retweet', 'id_str']


### Reading and writing data with `pandas`

We use the `json` library to read our Trump data, but what if your data is saved in something other than JSON? Python has standalone libraries for reading files stored in different formats (e.g., the <a href=https://docs.python.org/3/library/csv.html>csv</a> module). However, the `pandas` library offers a convienent approach to reading and writing files in pretty much any format out there.

Note that `pandas` offers so, so much more than just reading and writing files. It's provides an R-like environment for data wrangling and analysis, and is quickly becoming the "go to" library for data scientists.

## Step 3: prepare text for analysis (aka pre-processing)

With our list of tweets in hand, we are now need to get out text ready for analysis. While the process of "pre-processing" text will vary a bit based on the specific analysis employed, so-called "standard" pre-processing includes some combination of the following procedures:

1. Tokenization
2. Convert to lower (or upper) case
3. Expanding contractions
4. Punctuation removal
5. Stopword removal
6. Removing numbers
7. Lematization (or stemming)

Before processing our text, however, we need to get a sense of how Python understands `strings`.

### Strings and string methods

Python has all of the standard data types (number, strings, bool), but `strings` (or text!) is obviously quite important for text analysis class.

In [25]:
tweet = tweets[0]['text']
tweet

'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!'

Strings are "iterable", meaning that they have an index:

In [27]:
tweet[1]

'o'

They also have a set of "methods" (or functions) that will be useful for us throughout the course. Here are some of the most important.

`lower()` and `upper()`:

In [32]:
'This text NEEDS to be lowercase.'.lower()

'this text needs to be lowercase.'

`strip()` white space from the ends of a string:

In [31]:
'   We need to strip the white space off of this sentence.         '.strip()

'We need to strip the white space off of this sentence.'

`split()` a string:

In [33]:
tweet.split(' ')

['Jobs',
 'are',
 'kicking',
 'in',
 'and',
 'companies',
 'are',
 'coming',
 'back',
 'to',
 'the',
 'U.S.',
 'Unnecessary',
 'regulations',
 'and',
 'high',
 'taxes',
 'are',
 'being',
 'dramatically',
 'Cut,',
 'and',
 'it',
 'will',
 'only',
 'get',
 'better.',
 'MUCH',
 'MORE',
 'TO',
 'COME!']

I could go on and on here. The point is do yourself a favor and review the following methods:

<a href="https://www.w3schools.com/python/python_ref_string.asp">https://www.w3schools.com/python/python_ref_string.asp</a>

### Tokenization

Tokenization is just the process of splitting up our strings into smaller parts (usually sentences or words). We've already seen one way to do this using the `split()` method:

In [36]:
toks = tweet.split(' ')
print(toks)

['Jobs', 'are', 'kicking', 'in', 'and', 'companies', 'are', 'coming', 'back', 'to', 'the', 'U.S.', 'Unnecessary', 'regulations', 'and', 'high', 'taxes', 'are', 'being', 'dramatically', 'Cut,', 'and', 'it', 'will', 'only', 'get', 'better.', 'MUCH', 'MORE', 'TO', 'COME!']


There are also times when you want to first tokenize a string into sentences and then into words. Here, we can use the `sent_tokenize` function that we imported above to do the job:

In [38]:
sentences = sent_tokenize(tweet)
print(sentences)

['Jobs are kicking in and companies are coming back to the U.S.', 'Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better.', 'MUCH MORE TO COME!']


For our simple twitter sentiment analysis, we will ignore the sentence structure and treat our text as a <b>bag of words</b>. We'll also use the `word_tokenize` function from `nltk` to get a cleaner set of tokens than what was produced using `split()`:

In [40]:
toks = word_tokenize(tweet)
print(toks)

['Jobs', 'are', 'kicking', 'in', 'and', 'companies', 'are', 'coming', 'back', 'to', 'the', 'U.S', '.', 'Unnecessary', 'regulations', 'and', 'high', 'taxes', 'are', 'being', 'dramatically', 'Cut', ',', 'and', 'it', 'will', 'only', 'get', 'better', '.', 'MUCH', 'MORE', 'TO', 'COME', '!']


Nice, so now we know how to tokenize a single tweet. However, we have 2,275 tweets to process. <b>Soluton</b>: the trusty old `for` loop.

### The `for` loop

We often want to make repeated calculations and this is where the idea of a "loop" comes in. Let's start by taking a look at a `for` loop, which allows you to iterate over a sequence of objects. For example, let's create a list of numbers from 0 to 9 using Python's `range()` function, loop over the list, and print the number:

In [43]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


Similarly, we can loop over our list of tokens (i.e., `toks`) and print each token:

In [44]:
for token in toks:
    print(token)

Jobs
are
kicking
in
and
companies
are
coming
back
to
the
U.S
.
Unnecessary
regulations
and
high
taxes
are
being
dramatically
Cut
,
and
it
will
only
get
better
.
MUCH
MORE
TO
COME
!


How can we use a `for` loop to tokenize our list of tweets? Like so:

In [57]:
tokens = [] # preallocate a list to hold the tokenized tweets
for tweet in tweets:
    # Note: I'm going to convert everything to lowercase here
    # to avoid looping again
    tokens.append(word_tokenize(tweet['text'].lower()))

In [58]:
print(tokens[1])

['i', 'use', 'social', 'media', 'not', 'because', 'i', 'like', 'to', ',', 'but', 'because', 'it', 'is', 'the', 'only', 'way', 'to', 'fight', 'a', 'very', 'dishonest', 'and', 'unfair', '“', 'press', ',', '”', 'now', 'often', 'referred', 'to', 'as', 'fake', 'news', 'media', '.', 'phony', 'and', 'non-existent', '“', 'sources', '”', 'are', 'being', 'used', 'more', 'often', 'than', 'ever', '.', 'many', 'stories', '&', 'amp', ';', 'reports', 'a', 'pure', 'fiction', '!']


That's it! We can actually write this code a bit more efficiently by using what is called "<b>list comprehension</b>" in Python:

In [60]:
tokens_ = [word_tokenize(tweet['text'].lower()) for tweet in tweets]
tokens == tokens_

True

### Removing stopwords

In many analyses, it makes sense to remove so-called **stopwords**. These are words that show up frequently in a text, but add very little meaning.

In [61]:
# Pull in NLTK's English language stopword list
stops = stopwords.words('english')
print(stops[0:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


First thing to notice is our stopword list assumes lowercase. You always need to check this! But how do we remove these words from our list? Let's to it the long way first and then we will use the equivalent list comprehension:

In [62]:
tokens_no_stops = [] # preallocate a list to hold tweet-level data
# First loop is over tokenized tweets
for toks in tokens:
    toks_no_stops = [] # preallocate a list to hold token-level data
    # Next loop over the individual words in each tokenized tweet
    for tok in toks:
        # Check if the word (tok) is in the stopword list
        if tok not in set(stops):
            toks_no_stops.append(tok)
    # Store result and move to next tweet
    tokens_no_stops.append(toks_no_stops)

In [64]:
print(tokens_no_stops[0])

['jobs', 'kicking', 'companies', 'coming', 'back', 'u.s.', 'unnecessary', 'regulations', 'high', 'taxes', 'dramatically', 'cut', ',', 'get', 'better', '.', 'much', 'come', '!']


Here's a more compact version using list comprehension:

In [65]:
tokens_no_stops = [] # preallocate a list to hold tweet-level data
for toks in tokens:
    tokens_no_stops.append([tok for tok in toks if tok not in set(stops)])

In [66]:
print(tokens_no_stops[0])

['jobs', 'kicking', 'companies', 'coming', 'back', 'u.s.', 'unnecessary', 'regulations', 'high', 'taxes', 'dramatically', 'cut', ',', 'get', 'better', '.', 'much', 'come', '!']


### Removing punctuation

Our use of the `word_tokenize()` function makes removing punctuation very, very easy. Check out how `word_tokenize()` handles punctuation:

In [68]:
print(word_tokenize('I love this class!!!!!'))

['I', 'love', 'this', 'class', '!', '!', '!', '!', '!']


As such, all we need to do is remove strings with a length == 1:

In [69]:
tokens_no_punct = []
for toks in tokens_no_stops:
    tokens_no_punct.append([tok for tok in toks if len(tok) > 1])

In [70]:
tokens_no_punct[0]

['jobs',
 'kicking',
 'companies',
 'coming',
 'back',
 'u.s.',
 'unnecessary',
 'regulations',
 'high',
 'taxes',
 'dramatically',
 'cut',
 'get',
 'better',
 'much',
 'come']

### Step 4: Build and use your lexicon (or dictionary)

What words should go into our lexicon of "negative" words? While there are a number of words with clear, negative conotations, lexicon-based approaches often need to be tailored to a specific task (or at least validated for the specific task at hand). We will start by illustrating how to build and employ our own lexicon -- as this could be useful across a range of tasks, not just sentiment analysis -- and then I will quickly demonstate how to use existing sentiment lexicons in Python.

A lexicon (or dictonary) is just a list of words. That's it! So we can start by created list of "negative" words:

In [88]:
negative = ['fake','hoax', 'idiot', 'moron', 'phony', 'fight', 'dishonest', 'unfair']

Done! Okay, okay, this list isn't exhaustive, but it is good enough to show you how to implement your own lexicon-based approach. Using our `negative` words lexicon takes a little code, but we have all of the tools to implement:

In [89]:
# Start by looping over each list of tokens
negative_sentiment_score = []
for toks in tokens_no_punct:
    # Define a counter to hold the number of negative words
    negative_words = 0
    # It is possible for a tweet not to have text. If this is the case,
    # we need to skip it.
    if len(toks) == 0:
        print("This tweet does not have text. Setting negative sentiment to 0.")
        negative_sentiment_score.append(0) 
    else:
        # Then loop over each token in each list
        for tok in toks:
            if tok in set(negative):
                negative_words += 1
        negative_sentiment_score.append(negative_words/len(toks)) 

This tweet does not have text. Setting negative sentiment to 0.


In [91]:
negative_sentiment_score[0:10]

[0.0,
 0.19230769230769232,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.06666666666666667,
 0.0]

## Using pre-built sentiment lexicons

There are a number of good sentiment lexicons already available and `NLTK` provides access to a number of these (see the <a href="https://www.nltk.org/api/nltk.sentiment.html">nltk.sentiment api documentation</a> for a full list). Once lexicon that has been shown to work well across a range of tasks is <a href="http://scholar.google.co.uk/scholar_url?url=https://ojs.aaai.org/index.php/ICWSM/article/download/14550/14399&hl=en&sa=X&ei=rUoRYKXUDIfCmgGXoLToAw&scisig=AAGBfm22NmYm4wMvPAvkQZuGz-24V1Wu1A&nossl=1&oi=scholarr">VADER</a>. Here's how you can use VADER in `NLTK`. First, start by instantiating the `SentimentIntensityAnalyzer` class:

In [77]:
vader = SentimentIntensityAnalyzer()

In [95]:
vader.polarity_scores("This class really does suck.")

{'neg': 0.443, 'neu': 0.557, 'pos': 0.0, 'compound': -0.4902}

Note that there is no pre-processing necessary -- you can simply pass your text in "raw" form. Applying to our Trump example:

In [92]:
for tweet in tweets:
    # Here I'm adding a new field to the "tweets" dictionary directly
    # While this changes the original dictionry (and so requires caution),
    # it ensures that we have all of the relevant meta-data about the tweet
    # easily accessble.
    tweet['negative_sentiment'] = vader.polarity_scores(tweet['text'])['neg']

In [94]:
print(tweets[1])

{'source': 'Twitter for iPhone', 'text': 'I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!', 'created_at': 'Sat Dec 30 22:36:41 +0000 2017', 'retweet_count': 50342, 'favorite_count': 195754, 'is_retweet': False, 'id_str': '947235015343202304', 'negative_sentiment': 0.344}
