
[![AnalyticsDojo](../fig/final-logo.png)](http://rpi.analyticsdojo.com)
<center><h1>Introduction to API's with Python</h1></center>
<center><h3><a href = 'http://rpi.analyticsdojo.com'>rpi.analyticsdojo.com</a></h3></center>



This is adopted from [Mining the Social Web, 2nd Edition](http://bit.ly/16kGNyb)
Copyright (c) 2013, Matthew A. Russell
All rights reserved.

This work is licensed under the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt).

### Before you Begin #1
If you are working locally or on colab, this exercise requires the twitter package.
`!pip install twitter`


In [None]:
 !pip install twitter

In [1]:
#see if it worked by importing the twitter package & some other things we will use.  
from  twitter import *
import datetime, traceback 
import json
import time
import sys


### Before you Begin #2
Download the sample configuration.  


In [None]:
!wget <see filename on slack>


## Step1.  Loading Authorization Data
- Here we are going to store the authorization data in a .YAML file rather than directly in the notebook.  
- We have also added `config.yaml` to the `.gitignore` file so we won't accidentally commit our sensitive data to the repository.
- You should generally keep sensitive data out of all git repositories (public or private) but definitely Public. 
- If you ever accidentally commit data to a public repository you must consider it compromised.
- A .yaml file is a common way to store configuration data, but it is not really secure. 

In [2]:
#This will import some required libraries.
import sys 
import ruamel.yaml #A .yaml file 
#This is your configuration file. 
twitter_yaml='config.yaml'
with open(twitter_yaml, 'r') as yaml_t:
    cf_t=ruamel.yaml.round_trip_load(yaml_t, preserve_quotes=True)

#You can check your config was loaded by printing, but you should not commit this.
#print(cf_t)
cf_t['file']

'screen_names.csv'

## Create Some Relevant Functions
- We first will create a Twitter object we can used to authorize data.
- Then we will get profiles.
- Finally we will get some tweets.  

**Don't worry about not understanding all the code.  Here we are pushing you us more complex functions.**

In [3]:
def create_twitter_auth(cf_t):
        """Function to create a twitter object
           Args: cf_t is configuration dictionary. 
           Returns: Twitter object.
            """
        # When using twitter stream you must authorize.
        # these tokens are necessary for user authentication
        # create twitter API object

        auth = OAuth(cf_t['access_token'], cf_t['access_token_secret'], cf_t['consumer_key'], cf_t['consumer_secret'])

        try:
            # create twitter API object
            twitter = Twitter(auth = auth)
        except TwitterHTTPError:
            traceback.print_exc()
            time.sleep(cf_t['sleep_interval'])
        return twitter

In [4]:
def get_profiles(twitter, names, cf_t):
    """Function write profiles to a file with the form *data-user-profiles.json*
       Args: names is a list of names
             cf_t is a list of twitter config
       Returns: Nothing
        """
    # file name for daily tracking
    dt = datetime.datetime.now()
    fn = cf_t['data']+'/profiles/'+dt.strftime('%Y-%m-%d-user-profiles.json')
    with open(fn, 'w') as f:
        for name in names:
            print("Searching twitter for User profile: ", name)
            try:
                # create a subquery, looking up information about these users
                # twitter API docs: https://dev.twitter.com/docs/api/1/get/users/lookup
                profiles = twitter.users.lookup(screen_name = name)
                sub_start_time = time.time()
                for profile in profiles:
                    print("User found. Total tweets:", profile['statuses_count'])
                    # now save user info
                    f.write(json.dumps(profile))
                    f.write("\n")
                sub_elapsed_time = time.time() - sub_start_time;
                if sub_elapsed_time < cf_t['sleep_interval']:
                    time.sleep(cf_t['sleep_interval'] + 1 - sub_elapsed_time)
            except TwitterHTTPError:
                traceback.print_exc()
                time.sleep(cf_t['sleep_interval'])
                continue
    f.close()
    return fn

## Load Twitter Handle From CSV
- This is a .csv that has individuals we want to collect data on. 
- Go ahead and follow [AnalyticsDojo](https://twitter.com/AnalyticsDojo).  

In [13]:
import pandas as pd
df=pd.read_csv(cf_t['config']+"/"+cf_t['file'])
df

Unnamed: 0,index,screen_name
0,1,tensorflow
1,2,DeepLearningHub


## Create Twitter Object

In [14]:
import twitlab

In [15]:
#Create Twitter Object
twitter= twitlab.create_twitter_auth(cf_t)

In [7]:
df['screen_name']

0         tensorflow
1    DeepLearningHub
Name: screen_name, dtype: object

In [12]:
#This will get general profile data
profiles_fn=twitlab.get_profiles(twitter, df['screen_name'], cf_t)

Searching twitter for User profile:  tensorflow
User found. Total tweets: 426
Searching twitter for User profile:  DeepLearningHub
User found. Total tweets: 2431


The outcoming of running the above API is to generate a twitter object. 

## Step 2. Getting Help

In [11]:
# We can get some help on how to use the twitter api with the following. 
help(twitter)

Help on Twitter in module twitter.api object:

class Twitter(TwitterCall)
 |  The minimalist yet fully featured Twitter API class.
 |  
 |  Get RESTful data by accessing members of this class. The result
 |  is decoded python objects (lists and dicts).
 |  
 |  The Twitter API is documented at:
 |  
 |    https://dev.twitter.com/overview/documentation
 |  
 |  The list of most accessible functions is listed at:
 |  
 |    https://dev.twitter.com/rest/public
 |  
 |  
 |  Examples::
 |  
 |      from twitter import *
 |  
 |      t = Twitter(
 |          auth=OAuth(token, token_secret, consumer_key, consumer_secret))
 |  
 |      # Get your "home" timeline
 |      t.statuses.home_timeline()
 |  
 |      # Get a particular friend's timeline
 |      t.statuses.user_timeline(screen_name="billybob")
 |  
 |      # to pass in GET/POST parameters, such as `count`
 |      t.statuses.home_timeline(count=5)
 |  
 |      # to pass in the GET/POST parameter `id` you need to use `_id`
 |      t.sta


Go ahead and take a look at the [twitter docs](https://dev.twitter.com/rest/public).



In [12]:
# The Yahoo! Where On Earth ID for the entire world is 1.
# See https://dev.twitter.com/docs/api/1.1/get/trends/place and
# http://developer.yahoo.com/geo/geoplanet/

WORLD_WOE_ID = 1
US_WOE_ID = 23424977

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = twitter.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter.trends.place(_id=US_WOE_ID)

print (world_trends)
print (us_trends)

[{'trends': [{'name': '#NationalVoterRegistrationDay', 'url': 'http://twitter.com/search?q=%23NationalVoterRegistrationDay', 'promoted_content': None, 'query': '%23NationalVoterRegistrationDay', 'tweet_volume': 31490}, {'name': 'Caputo', 'url': 'http://twitter.com/search?q=Caputo', 'promoted_content': None, 'query': 'Caputo', 'tweet_volume': 33073}, {'name': '#FelizMartes', 'url': 'http://twitter.com/search?q=%23FelizMartes', 'promoted_content': None, 'query': '%23FelizMartes', 'tweet_volume': 36370}, {'name': '#FantasticBeasts', 'url': 'http://twitter.com/search?q=%23FantasticBeasts', 'promoted_content': None, 'query': '%23FantasticBeasts', 'tweet_volume': 94250}, {'name': '#INDvAFG', 'url': 'http://twitter.com/search?q=%23INDvAFG', 'promoted_content': None, 'query': '%23INDvAFG', 'tweet_volume': 31758}, {'name': '#EleSimENo1Turno', 'url': 'http://twitter.com/search?q=%23EleSimENo1Turno', 'promoted_content': None, 'query': '%23EleSimENo1Turno', 'tweet_volume': 43322}, {'name': '내기니', 

## Step 3. Displaying API responses as pretty-printed JSON

In [13]:
import json

print (json.dumps(world_trends, indent=1))
print (json.dumps(us_trends, indent=1))

[
 {
  "trends": [
   {
    "name": "#NationalVoterRegistrationDay",
    "url": "http://twitter.com/search?q=%23NationalVoterRegistrationDay",
    "promoted_content": null,
    "query": "%23NationalVoterRegistrationDay",
    "tweet_volume": 31490
   },
   {
    "name": "Caputo",
    "url": "http://twitter.com/search?q=Caputo",
    "promoted_content": null,
    "query": "Caputo",
    "tweet_volume": 33073
   },
   {
    "name": "#FelizMartes",
    "url": "http://twitter.com/search?q=%23FelizMartes",
    "promoted_content": null,
    "query": "%23FelizMartes",
    "tweet_volume": 36370
   },
   {
    "name": "#FantasticBeasts",
    "url": "http://twitter.com/search?q=%23FantasticBeasts",
    "promoted_content": null,
    "query": "%23FantasticBeasts",
    "tweet_volume": 94250
   },
   {
    "name": "#INDvAFG",
    "url": "http://twitter.com/search?q=%23INDvAFG",
    "promoted_content": null,
    "query": "%23INDvAFG",
    "tweet_volume": 31758
   },
   {
    "name": "#EleSimENo1Turno",


Take a look at the [api docs](https://dev.twitter.com/rest/reference/get/trends/place) for the /trends/place call made above. 

## Step 4. Collecting search results for a targeted hashtag.

In [14]:
# Import unquote to prevent url encoding errors in next_results
#from urllib3 import unquote

#This can be any trending topic, but let's focus on a hashtag that is relevant to the class. 
q = '#analytics' 

count = 100

# See https://dev.twitter.com/rest/reference/get/search/tweets
search_results = twitter.search.tweets(q=q, count=count)

#This selects out 
statuses = search_results['statuses']


# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
    print ("Length of statuses", len(statuses))
    try:
        next_results = search_results['search_metadata']['next_results']
        print ("next_results", next_results)
    except: # No more results when next_results doesn't exist
        break
        
    # Create a dictionary from next_results, which has the following form:
    # ?max_id=313519052523986943&q=NCAA&include_entities=1
    kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ])
    print (kwargs)
    search_results = twitter.search.tweets(**kwargs)
    statuses += search_results['statuses']

# Show one sample search result by slicing the list...
print (json.dumps(statuses[0], indent=1))

Length of statuses 100
next_results ?max_id=1044590656306511872&q=%23analytics&count=100&include_entities=1
{'max_id': '1044590656306511872', 'q': '%23analytics', 'count': '100', 'include_entities': '1'}
Length of statuses 200
next_results ?max_id=1044585942693031936&q=%2523analytics&count=100&include_entities=1
{'max_id': '1044585942693031936', 'q': '%2523analytics', 'count': '100', 'include_entities': '1'}
Length of statuses 200
{
 "created_at": "Tue Sep 25 14:28:36 +0000 2018",
 "id": 1044594521403772928,
 "id_str": "1044594521403772928",
 "text": "RT @MicroStrategy: ICYMI, @BARC_Research named MicroStrategy an #analytics market leader, citing its strengths of \"tight integration of sel\u2026",
 "truncated": false,
 "entities": {
  "hashtags": [
   {
    "text": "analytics",
    "indices": [
     64,
     74
    ]
   }
  ],
  "symbols": [],
  "user_mentions": [
   {
    "screen_name": "MicroStrategy",
    "name": "MicroStrategy",
    "id": 14883246,
    "id_str": "14883246",
    "in

In [None]:
#Print several
print (json.dumps(statuses[0:5], indent=1))

## Step 5. Extracting text, screen names, and hashtags from tweets

In [15]:
#We can access an individual tweet like so:
statuses[1]['text']





'RT @Harry_Robots: 12 Facts you need to know about #IoT and its usage and implications #Infographic #IIoT #CyberSecurity #BigData #infosec #…'

In [16]:
statuses[1]['entities']

{'hashtags': [{'indices': [50, 54], 'text': 'IoT'},
  {'indices': [86, 98], 'text': 'Infographic'},
  {'indices': [99, 104], 'text': 'IIoT'},
  {'indices': [105, 119], 'text': 'CyberSecurity'},
  {'indices': [120, 128], 'text': 'BigData'},
  {'indices': [129, 137], 'text': 'infosec'}],
 'symbols': [],
 'urls': [],
 'user_mentions': [{'id': 973892651522166785,
   'id_str': '973892651522166785',
   'indices': [3, 16],
   'name': 'Harry Miller',
   'screen_name': 'Harry_Robots'}]}

In [17]:
#notice the nested relationships.  We have to take notice of this to further access the data.
statuses[1]['entities']['hashtags']

[{'indices': [50, 54], 'text': 'IoT'},
 {'indices': [86, 98], 'text': 'Infographic'},
 {'indices': [99, 104], 'text': 'IIoT'},
 {'indices': [105, 119], 'text': 'CyberSecurity'},
 {'indices': [120, 128], 'text': 'BigData'},
 {'indices': [129, 137], 'text': 'infosec'}]

In [18]:
status_texts = [ status['text'] 
                 for status in statuses ]

screen_names = [ user_mention['screen_name'] 
                 for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

urls = [ url['url'] 
             for status in statuses
                 for url in status['entities']['urls'] ]



# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

# Explore the first 5 items for each...

print (json.dumps(status_texts[0:5], indent=1))
print (json.dumps(screen_names[0:5], indent=1)) 
print (json.dumps(hashtags[0:5], indent=1))
print (json.dumps(words[0:5], indent=1))

[
 "RT @MicroStrategy: ICYMI, @BARC_Research named MicroStrategy an #analytics market leader, citing its strengths of \"tight integration of sel\u2026",
 "RT @Harry_Robots: 12 Facts you need to know about #IoT and its usage and implications #Infographic #IIoT #CyberSecurity #BigData #infosec #\u2026",
 "RT @R_Demidchuk: Completo informe de Gartner con 100 predicciones sobre #Analytics y #BigData, y c\u00f3mo se potencian al interactuar con otras\u2026",
 "RT @Ronald_vanLoon: 9 Types of #Intelligence [#INFOGRAPHICS]\n\n#ArtificialIntelligence #AI #3D #BigData #VirtualReality #VR #Analytics #UX #\u2026",
 "RT @ahmedjr_16: 48 Best #Development Courses \n\nhttps://t.co/y3DBJGYs64 \n\n#AI #gamedev #indiedev #indiegame #games #game #unity #100daysofCo\u2026"
]
[
 "MicroStrategy",
 "BARC_Research",
 "Harry_Robots",
 "R_Demidchuk",
 "Ronald_vanLoon"
]
[
 "analytics",
 "IoT",
 "Infographic",
 "IIoT",
 "CyberSecurity"
]
[
 "RT",
 "@MicroStrategy:",
 "ICYMI,",
 "@BARC_Research",
 "named"
]


## Step 6. Creating a basic frequency distribution from the words in tweets

In [19]:
from collections import Counter

for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print (c.most_common()[:10]) # top 10, "\n")
    

[('RT', 114), ('to', 79), ('the', 63), ('and', 54), ('in', 48), ('#Analytics', 46), ('of', 37), ('for', 30), ('with', 29), ('#analytics', 27)]
[('Fisher85M', 17), ('MicroStrategy', 14), ('NeilCattermull', 8), ('FinMasonInc', 5), ('BARC_Research', 4), ('AiDirectory', 4), ('bamitav', 4), ('Ronald_vanLoon', 3), ('Flipkart', 3), ('upstreamcomm', 3)]
[('Analytics', 52), ('analytics', 34), ('AI', 26), ('BigData', 25), ('IoT', 10), ('CyberSecurity', 10), ('infosec', 10), ('data', 9), ('Security', 8), ('GDPR', 8)]
