# Summary

This document contains all steps in this project relevant to the data collection process. 

I will need to consider different data sources, the information they provide, and the drawbacks of each source. Then I will proceed to choose one and continue onward with the project. 

# Source 1: Trump Twitter Archive

The archive (http://www.trumptwitterarchive.com) contains every single one of Trump's tweets ever made - including those he deletes. The pro of this source is that it's extremely easy to access. Cons are that there's no exact way of telling if a tweet was deleted, and there are potentially fewer features than scraping directly from Twitter. 

Here's what data from the archive looks like (data is taken from a previous project). 

In [1]:
# Load relevant packages 
import numpy as np
import pandas as pd
import tweepy 
import json

In [2]:
archive = pd.read_csv('archive_tweets.csv', encoding = 'ISO-8859-1')
archive.head()

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,Jenna Ellis ÒFBI thought they wouldnÕt get ca...,8/9/18 22:50,26667.0,98925.0,False,1.02769e+18
1,Twitter for iPhone,@LindseyGrahamSC ÒWhy didnÕt the FBI tell Pre...,8/9/18 19:43,5966.0,12205.0,False,1.02764e+18
2,Twitter for iPhone,Congressman Ted Yoho of Florida is doing a fan...,8/9/18 17:00,16838.0,69806.0,False,1.0276e+18
3,Twitter for iPhone,Space Force all the way!,8/9/18 16:03,35382.0,131769.0,False,1.02759e+18
4,Twitter for iPhone,This is an illegally brought Rigged Witch Hunt...,8/9/18 16:02,24439.0,91267.0,False,1.02759e+18


# Source 2: Twitter API

Pro of this source is that there is potentially more information that can be used for later feature engineering. Furthermore, it only contains tweets Trump didn't decide to delete. Con is that it's more clunky to collect and organize the data. We'll do so now. 

In [3]:
# Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

# Authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

# Collect tweets
tweets = api.user_timeline(screen_name = 'realDonaldTrump', count = 200,  tweet_mode = "extended", include_rts = False)

The above code collected the latest 200 tweets from Trump's account. Let's take a look at what keys we have. 

In [4]:
json_tweet = tweets[3]._json
primary_keys = []
for key, value in json_tweet.items():
    primary_keys.append(key)
print(primary_keys)

['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang']


Let's take a look at the secondary keys for each given primary key. 

In [5]:
# For each primary key 
for prim_key in primary_keys:
    # Create empty vector 
    secondary_keys = []
    # If there's more data nested
    if type(json_tweet[prim_key]) == dict:
        
        # Append the keys of that data to secondary_keys
        for key, value in json_tweet[prim_key].items():
            secondary_keys.append(key)
        print(dict({prim_key: secondary_keys}), '\n')

{'entities': ['hashtags', 'symbols', 'user_mentions', 'urls']} 

{'user': ['id', 'id_str', 'name', 'screen_name', 'location', 'description', 'url', 'entities', 'protected', 'followers_count', 'friends_count', 'listed_count', 'created_at', 'favourites_count', 'utc_offset', 'time_zone', 'geo_enabled', 'verified', 'statuses_count', 'lang', 'contributors_enabled', 'is_translator', 'is_translation_enabled', 'profile_background_color', 'profile_background_image_url', 'profile_background_image_url_https', 'profile_background_tile', 'profile_image_url', 'profile_image_url_https', 'profile_banner_url', 'profile_link_color', 'profile_sidebar_border_color', 'profile_sidebar_fill_color', 'profile_text_color', 'profile_use_background_image', 'has_extended_profile', 'default_profile', 'default_profile_image', 'following', 'follow_request_sent', 'notifications', 'translator_type']} 



The user info isn't helpful since it would be the same info for each tweet. Entities is. 

In [6]:
df = pd.DataFrame([[tweet.id, tweet.created_at, tweet.full_text, tweet.retweet_count, tweet.favorite_count, 
                    tweet.entities['hashtags'], tweet.entities['symbols'], tweet.entities['user_mentions'], 
                    tweet.entities['urls'], tweet.source, tweet.display_text_range, tweet.in_reply_to_screen_name, 
                    tweet.in_reply_to_status_id] for tweet in tweets], 
                 columns = ['id', 'time', 'text', 'retweet_count', 'favorite_count', 'hashtags', 'symbols', 
                            'user_mentions', 'urls', 'source', 'length', 'in_reply_to_screen_name', 
                            'in_reply_to_status_id'])

In [7]:
df.head()

Unnamed: 0,id,time,text,retweet_count,favorite_count,hashtags,symbols,user_mentions,urls,source,length,in_reply_to_screen_name,in_reply_to_status_id
0,1046473870650290176,2018-09-30 18:56:27,"Wow! Just starting to hear the Democrats, who ...",21484,74516,[],[],[],[],Twitter for iPhone,"[0, 274]",,
1,1046456403651698693,2018-09-30 17:47:03,So if African-American unemployment is now at ...,19995,74252,[],[],[],[],Twitter for iPhone,"[0, 278]",,
2,1046443996074127361,2018-09-30 16:57:45,"Like many, I don’t watch Saturday Night Live (...",26606,113731,[],[],[],[],Twitter for iPhone,"[0, 279]",,
3,1046230634103025664,2018-09-30 02:49:55,NBC News incorrectly reported (as usual) that ...,40298,135551,[],[],[],[],Twitter for iPhone,"[0, 259]",,
4,1046201064469549056,2018-09-30 00:52:25,Thank you West Virginia - I love you! https://...,15180,69733,[],[],[],[],Twitter for iPhone,"[0, 37]",,


It is clear we can access far more information from the twitter API directly. Unfortunately, I only just now found that the twitter API allows us to retrieve the last 3200 tweets only. So we're going to have to rely on Source 1. Thankfully, nearly all the additional variables I was going to use can be created from the existing ones. Ultimately, it was helpful to see what the API generates, as that informs the kind of features I can create.  