---
title: Applying sentiment analysis with VADER and the Twitter API
date: 2017-04-15
comments: false
tags: python, programming tips, text mining
keywords: python, data science, text mining, machine learning
---

A few months ago, I posted a blog post about a small project I did where I analysed how people felt about the New Year's resolutions they post on Twitter. In this post, we'll go through the under-the-hood details of how I carried out this analysis, as well as some of the issues I encountered that are pretty typical of a text mining project.

If you're interested in getting a bit more detail on the package I used to do the sentiment analysis, VADER, you can see this in last week's analysis. If not, let's jump straight into it!

## Setting up your app

To do this analysis, I pulled data from [Twitter's public search API](https://dev.twitter.com/rest/public/search), which allows you to pull historical results from up to a week ago. To get started, you will need to create a Twitter account (if you don't already have one), and then jump over to Twitter's [application management portal](https://apps.twitter.com/). If you've never done this before, what we are doing here is creating a unique 'identity' that will allow Twitter to work out who we are when we're accessing their public API. This is a way for them to boot off users or apps that are using the API too heavily or doing dodgy stuff like spamming the site.

Once in there, hit the 'Create New App' button, and you'll be prompted to enter a name, description and website for your app. It doesn't really matter what you write in here - just make sure that the name is not so generic that you can distinguish one app from another.

<img src="/figure/Vader_3.png" title="Create your application" style="display: block; margin: auto;" />

Once you've done that, you'll want to jump into the 'Keys and Access Tokens' tab. There are 4 bits of information we need to get from here so that our Python program can connect to the API. We need the consumer key and the consumer secret (circled at the top of the below screenshot), and also the access token and the access token secret (circled at the bottom). As you can see I have blurred mine out - you should take care to keep these secure and not do something like commit them to a public Github repo or anything (definitely not something I've done in the past...).

<img src="/figure/Vader_4.png" title="Get your keys" style="display: block; margin: auto;" />

## Pulling down some data

Now that we have our keys, we can connect to the API and pull down some data. In order to do this, we first need to install and import the `tweepy` and `json` packages:

In [2]:
import tweepy
import json

Let's now take those keys that we got from the app and use them to set up the connection to the API. As you can see below, we need to pass these keys to the authorisation handler and then get the API method from tweepy to use them. We also need to get tweepy to return the results as JSON.

In [5]:
# Enter authorisations
consumer_key = "uotv8mB6Zlr54p4sTI6N2HFgz"
consumer_secret = "XjCg5YEQmBdNoqqBcgM3R84eLAYncOczbwwOQQOkoeO8TrDe8D"
access_key = "89087163-68jLOHDBmtb12GTtWQd6qyFtJ5JgkxtqIgoWI9NGr"
access_secret = "IVYuHEYVHIkvCw2S2rNIhSXP4mgWvizDnaGF9nh4NK9OA"

# Set up your authorisations
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)

# Set up API call
api = tweepy.API(auth, parser = tweepy.parsers.JSONParser())

Now that we've done that, let's define our search. We need to restrict our search to the exact phrase "new year's resolution", and I also want to get rid of retweets (because they are essentially just duplicates in this dataset). The full list of possible ways to search are in the [search API documentation](https://dev.twitter.com/rest/public/search), and they are surprisingly flexible - in fact you can even search on sentiment in your query! 

In [6]:
# Set search query
searchquery = '"new years resolution" -filter:retweets'

We can now make our call to the API. You can see here I've limited my search to my specific query and English-language results. I'm also limiting the search to 100 tweets, which is the maximum you can return in a single call (we'll get to how we get some more volume soon).

In [7]:
data = api.search(q = searchquery, count = 100, lang = 'en', result_type = 'mixed')

Let's have a look at this data, which is JSON format. For those of you who haven't worked with JSON before, in order to get our data out, we just need to find how to reference it properly in the structure, which is a series of nested Python lists and dictionaries. (I've written in more detail on how to work your way through a JSON file [here]({filename}2015-11-25-reddit-api-part-2.md)). In our case, all of the data about each tweet is contained in a dictionary. Each dictionary is contained in a list, and this list is contained at index 1 of an overarching list. Thus the below code returns the tweet text for tweet number 12 in our dataset:

In [9]:
data.values()[1][12]['text']

u'my new years resolution is to be even more bitter than before'

## Getting some volume

Now that we've returned our first 100 tweets, we need to scale up to get enough tweets to actually analyse. In order to do this, we need to put our original API call into a loop. However, we need each loop to start after the final tweet returned by the previous call. To do this, we extract the ID of the last tweet from each call and add this to the `max_id` argument in the `api.search()` method.

In order to make sure we're not exceeding the number of API calls we can make, we can rate-limit our calls using the `sleep()` method from the time package. You can see I've put 4 seconds between calls.

Finally, you can see I've stripped the results our of that outer list, and appended them to a list called data_all. We'll use this list as the basis of our DataFrame in the next step.

In [11]:
import time

data = api.search(q = searchquery, count = 100, lang = 'en', result_type = 'mixed')
data_all = data.values()[1]

while (len(data_all) <= 20000):
    time.sleep(4)
    last = data_all[-1]['id']
    data = api.search(q = searchquery, count = 100, lang = 'en', result_type = 'mixed', max_id = last)
    data_all += data.values()[1][1:]

RateLimitError: [{u'message': u'Rate limit exceeded', u'code': 88}]

## Putting it in a DataFrame

We now have a list of up to 20,000 dictionaries containing all of the metadata about each tweet (I say up to, as your particular query may not have enough matches from the past week). We now want to pull out specific information about each tweet, as well as generate our sentiment metrics. 

For my particular analysis, I used the tweet text and the number of favourites each tweet received, but feel free to play and explore the huge amount of metadata you get back about each tweet for your own purposes - it's honestly a bit creepy how much data you can readily access!

The first 

## Categorising our tweets

## Doing the analyses

I think I should do a link out to a gist with details on the graphs I created

## Some issues with this analysis

Text mining traps that come up when you don't carefully clean your data.