# Extracting 2020 Candidate Tweets
This notebook extracts the baseline data for this project. 

Contents include:
* Prioritization of what to pull first to establish a baseline. This can guide understanding and extraction strategy, and then it can be used to collect more tweets once the list of user handles is more robust.
* Start by pulling just the 2020 candidates tweets; probably for the year 2018 up to early 2019.

In [2]:
import tweepy
import os 
import bz2 
import json
import pandas as pd 
import numpy as np
import re 
from datetime import datetime, date, time, timedelta
import time

## Authorization codes for the REST API
These are authorization codes from personal Twitter developer account https://apps.twitter.com/


In [2]:
# authorization codes here from personal Twitter developer account
# https://apps.twitter.com/
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
OAUTH_TOKEN = ''
OAUTH_SECRET = ''

## Functions
* Setup API
* Getting Tweets using API

Standard Search Operators: https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators

In [12]:
def oauth_login():
    """login to Twitter with ordinary rate limiting
    needs defined authorization codes for personal twitter developer application
    CONSUMER_KEY (consumer api key)
    CONSUMER_SECRET (consumer api secret key)
    OAUTH_TOKEN (access token)
    OAUTH_SECRET (access token secret)
    Returns:
        [tweepy.api.API] -- [tweepy api]
    """
    # get the authorization from Twitter and save in the Tweepy package
    auth = tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(OAUTH_TOKEN,OAUTH_SECRET)
    tweepy_api = tweepy.API(auth)
    # if a null api is returned, give error message
    if (not tweepy_api):
        print ("Problem Connecting to API with OAuth")
        # return the Twitter api object that allows access for the Tweepy api functions
    return tweepy_api

def get_handle_data(api, query):
    """Uses the tweepy Cursor to wrap a twitter api search for the query string
    returns json formatted results
    Arguments:
        api {[tweepy.api.API]} -- [tweepy api]
        query {[str]} -- [handle text]    
    
    Returns:
        [dict] -- [dictionary of metadata for a handle]
    """
    item = api.get_user(query)
    hdict = {}
    hdict['handle'] = query
    hdict['name'] = item.name
    hdict['created_at'] = item.created_at
    hdict['screen_name'] = item.screen_name 
    hdict['description'] = item.description 
    hdict['statuses_count'] = item.statuses_count # number of tweets published 
    hdict['friends_count'] = item.friends_count 
    hdict['followers_count'] = item.followers_count
    return hdict

def get_date(created_at):
    """Function to convert Twitter created_at to date format
    Argument:
        created_at {[str]} -- [raw tweet creation date time stamp]
    Returns:
        [str] -- [date e.g. '2020-04-18']
    """
    return time.strftime('%Y-%m-%d', time.strptime(created_at, '%a %b %d %H:%M:%S +0000 %Y'))

## People Data Source
This data source includes index of all people names, class labels, and Twitter handles used in this project

In [4]:
People = pd.read_excel('data/People.xlsx', sheet_name='All')
People['Twitter Handle'] = People['Twitter Handle'].fillna('-')
People.head()

Unnamed: 0,Key,Name,Party,State,Governor,Senate,House,Ran 2020,Ran 2016,positions held,Twitter Handle
0,Aaron Schock-Illinois,Aaron Schock,Republican,Illinois,0,0,1,0,0,1,-
1,Abby Finkenauer-Iowa,Abby Finkenauer,Democratic,Iowa,0,0,1,0,0,1,-
2,Abigail Spanberger-Virginia,Abigail Spanberger,Democratic,Virginia,0,0,1,0,0,1,-
3,Adam Kinzinger-Illinois,Adam Kinzinger,Republican,Illinois,0,0,1,0,0,1,-
4,Adam Putnam-Florida,Adam Putnam,Republican,Florida,0,0,1,0,0,1,-


## Setup API

In [5]:
api = oauth_login()

## Baseline: Get 2020 candidates user information

In [6]:
hlist = People.loc[(People['Twitter Handle'] != '-') & (People['Ran 2020'] == 1)]['Twitter Handle'].tolist()
print("There are {:d} Twitter handles respective to 2020 candidates in the data".format(len(hlist)))
print("Handles:", hlist)

There are 34 Twitter handles respective to 2020 candidates in the data
Handles: ['amyklobuchar', 'AndrewYang', 'BernieSanders', 'BetoORourke', 'BilldeBlasio', 'GovBillWeld', 'CoryBooker', 'DevalPatrick', 'realDonaldTrump', 'ewarren', 'ericswalwell', 'JayInslee', 'JoeSestak', 'WalshFreedom', 'JohnDelaney', 'Hickenlooper', 'JoeBiden', 'JulianCastro', 'KamalaHarris', 'SenGillibrand', 'marwilliamson', 'MarkSanford', 'MichaelBennet', 'MikeBloomberg', 0, 'PeteButtigieg', 'VoteOjeda2020', 'JoinRocky', 'sethmoulton', 'stevebullockmt', 'RepTimRyan', 'TomSteyer', 'TulsiGabbard', 'WayneMessam']


Get user data respective to the 2020 candidates. 

These include:

* name
* created_at
* screen_name (i.e. handle)
* description
* statuses_count (i.e. total number of tweets)
* friends_count
* followers_count

In [7]:
%%time
user_data = [] # list of dictionaries where each element is metadata respective to a twitter handle (i.e. 'user')
user_err_list = [] # list of handles for which data could not be found
for h in hlist:
    try:
        user_data.append(get_handle_data(api, query = h))
    except:
        user_err_list.append(h)

print("Metadata collected for {:d} Twitter handles".format(len(user_data)))
print("Could not get data for {:d} Twitter handles: {:s}".format(len(user_err_list), str(user_err_list)))

Metadata collected for 33 Twitter handles
Could not get data for 1 Twitter handles: [0]
Wall time: 6.83 s


In [8]:
df_user = pd.DataFrame(user_data)
df_user.head()

Unnamed: 0,handle,name,created_at,screen_name,description,statuses_count,friends_count,followers_count
0,amyklobuchar,Amy Klobuchar,2009-04-20 14:59:36,amyklobuchar,U.S. Senator from Minnesota. Text AMY to 91990...,11549,138356,1043026
1,AndrewYang,Andrew Yang🧢🇺🇸,2013-12-03 21:31:03,AndrewYang,2020 US Presidential Candidate (D). Entreprene...,17647,7514,1403395
2,BernieSanders,Bernie Sanders,2010-11-17 17:53:52,BernieSanders,U.S. Senator from Vermont and candidate for Pr...,17853,1459,11809126
3,BetoORourke,Beto O'Rourke,2011-07-26 18:05:52,BetoORourke,,7970,946,1654429
4,BilldeBlasio,Bill de Blasio,2012-01-27 21:35:21,BilldeBlasio,Mayor of New York City. Fighting for working p...,2161,29,218641


### Summary of candidates:
* statuses_count
* friends_count
* followers_count

In [9]:
df_user.describe()

Unnamed: 0,statuses_count,friends_count,followers_count
count,33.0,33.0,33.0
mean,13857.818182,9647.69697,3722096.0
std,16685.650458,28333.174978,13465310.0
min,19.0,26.0,8348.0
25%,5395.0,526.0,56385.0
50%,9438.0,1428.0,241629.0
75%,12354.0,3077.0,1849941.0
max,71815.0,138356.0,77568060.0


## Baseline: Get 2020 Candidate Tweets From User Timeline
* Timeframe target of 2018 up to early 2019.
* Method: Tweets extracted from user timeline

In [68]:
%%time
user_tweets = {} # dictionary containing list of tweets (values) respective to Twitter handles (keys)

for handle in hlist:
    try:        
        # get all available tweets from index for this Twitter handle
        search_results = [status for status in tweepy.Cursor(api.user_timeline, id = handle, wait_on_rate_limit=True).items()] 
        # format tweet as json
        tweets = [tweet._json for tweet in search_results]
        # add list of tweets (value) respective to Twitter handle (key)
        user_tweets[handle] = tweets
    except Exception as e:
        print(e) or invalid user

Twitter error response: status code = 401
Wall time: 1h 17min 50s


In [73]:
print("number of users collected: {:d}".format(len(user_tweets.keys())))
print(user_tweets.keys())

number of users collected: 33
dict_keys(['amyklobuchar', 'AndrewYang', 'BernieSanders', 'BetoORourke', 'BilldeBlasio', 'GovBillWeld', 'CoryBooker', 'DevalPatrick', 'realDonaldTrump', 'ewarren', 'ericswalwell', 'JayInslee', 'JoeSestak', 'WalshFreedom', 'JohnDelaney', 'Hickenlooper', 'JoeBiden', 'JulianCastro', 'KamalaHarris', 'SenGillibrand', 'marwilliamson', 'MarkSanford', 'MichaelBennet', 'MikeBloomberg', 'PeteButtigieg', 'VoteOjeda2020', 'JoinRocky', 'sethmoulton', 'stevebullockmt', 'RepTimRyan', 'TomSteyer', 'TulsiGabbard', 'WayneMessam'])


## Number of Tweets and Timeframe of Tweets by User

In [74]:
for handle in user_tweets.keys():
    tweets = user_tweets[handle]
    datelist = [get_date(tweet['created_at']) for tweet in tweets]
    start = min(datelist)
    end = max(datelist)
    print(handle, len(user_tweets[handle]), "from: {:s} to: {:s}".format(start, end))

amyklobuchar 3239 from: 2019-02-18 to: 2020-04-18
AndrewYang 3215 from: 2020-01-06 to: 2020-04-18
BernieSanders 3238 from: 2019-09-07 to: 2020-04-18
BetoORourke 3228 from: 2018-11-06 to: 2020-04-18
BilldeBlasio 2159 from: 2016-10-19 to: 2020-03-12
GovBillWeld 1002 from: 2019-02-14 to: 2020-03-23
CoryBooker 3230 from: 2018-10-12 to: 2020-04-18
DevalPatrick 2032 from: 2011-03-25 to: 2020-03-07
realDonaldTrump 3222 from: 2019-12-31 to: 2020-04-18
ewarren 3210 from: 2019-10-08 to: 2020-04-18
ericswalwell 3239 from: 2018-07-30 to: 2020-04-16
JayInslee 3206 from: 2019-05-04 to: 2020-04-17
JoeSestak 3189 from: 2015-07-17 to: 2020-02-15
WalshFreedom 3205 from: 2020-01-28 to: 2020-04-18
JohnDelaney 3233 from: 2019-03-10 to: 2020-04-18
Hickenlooper 3241 from: 2015-03-24 to: 2020-04-18
JoeBiden 3200 from: 2017-07-31 to: 2020-04-18
JulianCastro 3228 from: 2019-07-13 to: 2020-04-17
KamalaHarris 3216 from: 2019-02-22 to: 2020-04-18
SenGillibrand 3219 from: 2018-08-07 to: 2020-04-18
marwilliamson 321

In [77]:
tweets_count = 0
for user in user_tweets.keys():
    for tweet in user_tweets[user]:
        tweets_count += 1
print("A total of {:d} tweets have been collected".format(tweets_count))

A total of 97316 tweets have been collected


## Save Tweets to JSON File

In [76]:
fname = 'data/2020_candidate_tweets.json'

with bz2.BZ2File(fname, 'w') as fout:
    fout.write(json.dumps(user_tweets).encode('utf-8'))
print("Results are saved in {:s}".format(fname))

Results are saved in data/2020_candidate_tweets.json


#### Read Tweets from existing JSON File

In [3]:
fname = 'data/2020_candidate_tweets.json'

with bz2.BZ2File(fname, 'r') as fin:
    user_tweets = json.loads(fin.read().decode('utf-8'))

## Subset 2020 Candidate Tweets
* Timeframe will be from the beginning of 2018 (i.e. 1/1/2018) to early 2019 (i.e. 3/31/2019)
* Only tweet text is extracted

In [29]:
def check_date(created_at):
    x = get_date(created_at)
    return x <= '2019-03-31' and x >= '2018-01-01'

In [None]:
tweet_text_list = []
for user in user_tweets.keys():
    for tweet in user_tweets[user]:
        if tweet['retweeted']