# DAY06: Structured Data Collection 

## REST API (REspresentational State Transfer Application Programming Interface):

* Application program interface (API) that uses HTTP requests to GET, PUT, POST and DELETE data.
* Based on representational state transfer (REST) technology, an architectural style and approach to communications often used in web services development.
* The REST used by browsers can be thought of as the language of the internet. 
* REST is a logical choice for building APIs that allow users to connect and interact with cloud services. 
* RESTful APIs are used by such sites as Amazon, Google, LinkedIn and Twitter.
* Source: http://searchcloudstorage.techtarget.com/definition/RESTful-API

## STEPS TO USE TWITTER API:
1. Register for Twitter API
    - sign in to https://apps.twitter.com/
    - create new App
    - Fill in the required informations 
    - Go to "key and Access Tokens" tab 
    - Click "Create Access Tokens" at the bottom
    - Copy down Consumer Key, Consumer Secret, Access Token, Access Token Secret values and paste here.
2. Install dependencies
3. Write our script

## Import necessary libraries

In [1]:
import tweepy 
from tweepy import OAuthHandler
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Configure Panda for display
pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
pd.options.display.width = 120

## Authentication

In [2]:
# Copy and paste your info below

consumer_key = '1tXpwHv4jMkLUoEPQvaZJSrRu'
consumer_secret = 'JTVpGsfzGfat5Cn3x9b00e1cWwZ3ColHBn7NIHplcK1G1eGZ53'
access_token = '865149462309621761-uQY1bEBslrM96kS9Jf3G2RpFLuIR8Aj'
access_secret ='13etK8elPvWEukBI6E3Bak5OzmkDqM6S41S2bGJmAwrKi'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

## REST API (request existing data)

**API.search(q[, lang][, locale][, rpp][, page][, since_id][, geocode][, show_user])**
Return tweets that match a specified query

*Parameters*:

**q**: the search query string

**lang**: restricts tweets to the given language, given by an ISO 639-1 code.

**locale**: specify the language of the query you are sending. This is intended for language-specific clients and the default should work in the majority of cases.

**rpp**: the number of tweets to return per page, up to max of 160.

**page**: the page number (starting from 1) to return, up to maximum of roughly 1500 results based on rpp*page.

**geocode**: returns tweets by users located within a given radius of the given latitude/longtitude. The location is preferentially taking from the Geotagging API, but will fall back to their Twitter profile. The parameter value is specified by "latitude, longtitude, radius", where the radius units must be specified as either "mi" (miles) or "km (kilometers). 

**show_user**: when true, prepends "<user>: " to the beginning of the tweet. Default is set to False.

## Our objective is to collect tweets containing "stackup"

In [3]:
results = api.search(q="stackup")

## Inspecting Results

In [4]:
# len method will count the number of tweets that contains "stackup"
len(results)

15

In [5]:
# Print the exact tweet containing "stackup"
def print_tweet(tweet):
    print "@%s - %s (%s)" % (tweet.user.screen_name, tweet.user.name, tweet.created_at)
    print tweet.text

tweet = results[0] # printing ONLY THE FIRST TWEET we have
print_tweet(tweet)

@HeathStevens69 - Battlefield Marine (2017-07-13 08:25:22)
RT @StackUpDotOrg: Looking for some great people to game with?

https://t.co/78OzEZTVSD

#StackUp with the #RedShirtRaiders for fun and gam…


## Inspecting a Status Object

In [6]:
tweet = results[0]

#list everything inside the object
for param in dir(tweet): 
    # looking for object that does not start with underscore because usually there's hidden objects we don't need
    if not param.startswith("_"): 
        print "%s : %s" % (param, eval("tweet." + param))

author : User(follow_request_sent=False, has_extended_profile=True, profile_use_background_image=True, _json={u'follow_request_sent': False, u'has_extended_profile': True, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 1688386194, u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/378800000167008987/lV24LBfW.jpeg', u'verified': False, u'translator_type': u'none', u'profile_text_color': u'333333', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/845791086811512832/_L_9_ztQ_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'entities': {u'url': {u'urls': [{u'url': u'https://t.co/8SveZoPaVD', u'indices': [0, 23], u'expanded_url': u'https://donate.stack-up.org/fundraise?is_new=1&fcid=610450', u'display_url': u'donate.stack-up.org/fundraise?is_n\u2026'}]}, u'description': {u'urls': []}}, u'followers_count': 2076, u'profile_sidebar_border_color': u'000000', u'id_str': u'1688386194', u'profile_backgro

## Inspecting an User Object

In [None]:
user = tweet.author

for param in dir(user):
    if not param.startswith("_"):
        print "%s : %s" % (param, eval("user." + param)) #Output too long so is not shown 

## Using Cursor for Pagination

For data mining, you will be dealing with large amount of results. Cursor is a simple way to handle iteration and results page. 

In [8]:
results = []
# retrieving the latest 100 tweet items with the word "stackup", and iterate all items in multiple pages
for tweet in tweepy.Cursor(api.search, q="stackup").items(100): 
    results.append(tweet) # append in the list

print len(results) # Output will prove we collected 100 items 

100


## Store Results in a Data Frame

In [9]:
def process_results(results):
    id_list = [tweet.id for tweet in results]
    data_set = pd.DataFrame(id_list, columns = ["id"])
    
    # Processing Tweet Data
    data_set["text"] = [tweet.text for tweet in results]
    data_set["created_at"] = [tweet.created_at for tweet in results]
    data_set["retweet_count"] = [tweet.retweet_count for tweet in results]
    data_set["favorite_count"] = [tweet.favorite_count for tweet in results]
    data_set["source"] = [tweet.source for tweet in results]

    # Processing User Data (Note user == author)
    data_set["user_id"] = [tweet.author.id for tweet in results]
    data_set["user_screen_name"] = [tweet.author.screen_name for tweet in results]
    data_set["user_name"] = [tweet.author.name for tweet in results]
    data_set["user_created_at"] = [tweet.author.created_at for tweet in results]
    data_set["user_description"] = [tweet.author.description for tweet in results]
    data_set["user_followers_count"] = [tweet.author.followers_count for tweet in results]
    data_set["user_friends_count"] = [tweet.author.friends_count for tweet in results]
    data_set["user_location"] = [tweet.author.location for tweet in results]
    
    return data_set

data_set = process_results(results)    

In [10]:
def write_data_csv(results):
    pass


write_data_csv(results)


## Looking at the Data: First and Last 5

In [11]:
data_set.head(5)

Unnamed: 0,id,text,created_at,retweet_count,favorite_count,source,user_id,user_screen_name,user_name,user_created_at,user_description,user_followers_count,user_friends_count,user_location
0,885349492664532993,Listen to Is You Ready(Go!) feat. Blaze(StackU...,2017-07-13 04:05:42,0,0,Twitter for iPhone,25782629,Finalestackup,FinaleStackUp ☠️💵🆙,2009-03-22 04:47:56,#Spinrilla 🐵 #730DipsDJs 🦅 #HeadDj #TeamBiggaR...,24551,10082,New Jersey
1,885328994203295744,RT @StackUpDotOrg: The #BoredRoom meeting star...,2017-07-13 02:44:15,2,0,Twitter for Android,1688386194,HeathStevens69,Battlefield Marine,2013-08-21 14:13:05,@EA Game changer & DICE Friend @GUNNARoptiks #...,2076,1193,"Carrolltown, PA"
2,885293954765402112,@StarUsher3 #StackUp #salute on tha #FOLLO pee...,2017-07-13 00:25:01,0,0,Twitter for Android,830575400392630272,D_RockStackUpDJ,D Rock #StackUpDJs🔌,2017-02-12 00:33:01,Official Promoter for #StackUp 🤘🏽💵🆙 @blazestac...,140,82,"Vineland, NJ 💃💸💊"
3,885287722356686851,RT @StackUpDotOrg: The #BoredRoom meeting star...,2017-07-13 00:00:15,2,0,Twitter for iPhone,4757396335,vashnare,VaShNaRe,2016-01-14 09:36:01,"U.S. Marine Veteran OIF 05-07, #Affiliate #str...",613,481,
4,885287670880096256,The #BoredRoom meeting starts now!\n\nhttps://...,2017-07-13 00:00:03,2,3,TweetDeck,3691232175,StackUpDotOrg,Stack-UpOrg,2015-09-18 00:59:21,"Founded in 2015, Stack-Up is a Military Charit...",19803,9139,Earth


In [12]:
data_set.tail(5)

Unnamed: 0,id,text,created_at,retweet_count,favorite_count,source,user_id,user_screen_name,user_name,user_created_at,user_description,user_followers_count,user_friends_count,user_location
95,884810316391043074,RT @AgeeDior: Everyone has been showing their ...,2017-07-11 16:23:13,35,0,Twitter for iPhone,714150800176594944,Ji_Stackup,Ji 🤤🌊,2016-03-27 18:03:13,Self Made 🏄🏽🏋🏽 16 🤷🏾‍♂️💸,78,79,"Bronx, NY"
96,884810298460426241,RT @OsamaGuwop_: Gotta move different when the...,2017-07-11 16:23:08,200,0,Twitter for iPhone,714150800176594944,Ji_Stackup,Ji 🤤🌊,2016-03-27 18:03:13,Self Made 🏄🏽🏋🏽 16 🤷🏾‍♂️💸,78,79,"Bronx, NY"
97,884804611684663297,RT @StackUpDotOrg: Time to #StackUp with @Kade...,2017-07-11 16:00:32,4,0,Twitter for Android,1688386194,HeathStevens69,Battlefield Marine,2013-08-21 14:13:05,@EA Game changer & DICE Friend @GUNNARoptiks #...,2076,1193,"Carrolltown, PA"
98,884804609956601857,RT @StackUpDotOrg: Time to #StackUp with @Kade...,2017-07-11 16:00:32,4,0,Twitter for Android,3060276682,OriginalKyller,KcKyller,2015-02-24 20:59:29,Sponsored by @GeekIsUs @GlitchGearInc @BluvosE...,2678,1304,Live most nights 7-10pm EST
99,884804488271249408,Time to #StackUp with @Kadexgaming on #Twitch!...,2017-07-11 16:00:03,4,5,TweetDeck,3691232175,StackUpDotOrg,Stack-UpOrg,2015-09-18 00:59:21,"Founded in 2015, Stack-Up is a Military Charit...",19803,9139,Earth
