# CIS600 - Social Media & Data Mining
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# Twitter

###  February 13, 2018

# Generalities - 'Scraping'

- You don't need an API to automate data retrieval
- Very often, none is provided.
- But you still have to worry about errors and rate limits
- Some packages to use for this:
    - Requests
    - Beautiful Soup
    - Scrapy
- We will use the designated APIs.

# Generatlities - REST APIs

- RESTful web services provide representations of resources in *text*.    
- JSON and XML format
- This is done via *stateless* operations.
- Uses HTTP methods such as GET, POST.
- We will be using mostly GET requests.
- You can also PUT, POST and DELETE.

# Generalties - JSON & XML

- JSON: Javascript Object Notation
- XML: Extensible Markup Language
- Both record information that is *human-* and *machine-*readable
- JSON works like dictionaries in Python
- XML looks like basic HTML
- Both are *hierarchical*
    - Values in a JSON dictionary can themselves be dictionaries.
    - Elements in XML can themeselves be element trees.

## Example - JSON

In [1]:
#JSON
# See Introduction notebook for other JSON examples.
import json
json_string = '{"first_name": "Guido", "last_name":"Rossum"}'

# Read it into a JSON object
parsed_json = json.loads(json_string)

# Get its values, treating it as a dictionary
parsed_json['first_name']

'Guido'

In [15]:
# Alternatively, we could start with a dictionary
d = {
    'first_name': 'Guido',
    'second_name': 'Rossum',
    'titles': ['BDFL', 'Developer'],
}
type(d)

dict

In [16]:
# And convert it to JSON
new_json = json.loads(json.dumps(d))
new_json['titles'][0]

'BDFL'

## Example - XPath

In [2]:
#XML
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)

### We now have the entire HTML source for that page in the variable "tree". Let's use XPath, the XML query language, to pull things out of it.

In [11]:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')

In [12]:
buyers

['Carson Busses',
 'Earl E. Byrd',
 'Patty Cakes',
 'Derri Anne Connecticut',
 'Moe Dess',
 'Leda Doggslife',
 'Dan Druff',
 'Al Fresco',
 'Ido Hoe',
 'Howie Kisses',
 'Len Lease',
 'Phil Meup',
 'Ira Pent',
 'Ben D. Rules',
 'Ave Sectomy',
 'Gary Shattire',
 'Bobbi Soks',
 'Sheila Takya',
 'Rose Tattoo',
 'Moe Tell']

### We got a list of "buyers". Why? The first part "//" refers to *descendants* of a given node, in this case the root of "tree". All children are descendants, but not all descendants are (direct) children! Next, we are looking for *div* elements - but not all of them!

### The @ symbol refers to *attributes*. We are therefore filtering for div elements whose *title* attribute is equal to "buyer-name".

### Finally, we get the text that is the direct descendant, or child, of the div elements having title equal to "buyer-name". In other words, we get all the buyer names.

### See [this page](https://en.wikipedia.org/wiki/XPath) for more.

# Twitter API - Search

### The Twitter Search API is RESTful. Let's get some responses and look at what is inside them.

In [3]:
from twitter import *

In [4]:
# Loading my authentication tokens
with open('auth_dict','r') as f:
    twtr_auth = json.load(f)
# Notice that there are four tokens - you need to create these in the
# Twitter Apps dashboard after you have created your own "app".
t = Twitter(auth=OAuth(twtr_auth['token'], twtr_auth['token_secret'], 
                       twtr_auth['consumer_key'], twtr_auth['consumer_secret']))

### Let's look at the "Example Request" provided by Twitter.

In [5]:
docs_example = """GET https://api.twitter.com/1.1/search/tweets
                .json?q=%23freebandnames&since_id=2401261998405
                1000&max_id=250126199840518145&result_type=mixed&count=4"""

### Note that we used triple quotes (i.e. 3 double quotes) for this multiline string. It is still just a string.

### What is going on in this string?

### The first piece of information in the string *docs_example* that stands out is "freebandnames". That's a search term, but what is its effect? It is preceded by "%23" to indicate *hashtags*. See [this](https://brajeshwar.github.io/entities/) for more.

### Rmk: [There is no such thing as plaintext.](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)

### Next, we see "since_id" - we are searching for tweets tweeted since a given tweet ID number.

### The "max_id" gives the other bound - we want no tweets with IDs greater than this value.

### We are asking for "mixed" results - both popular and recent

### Finally, we specify the number of results, 4, as opposed to the default of 15.

### Let's look at the Example Results

# Twitter API - Python Wrapper

### We don't want to build one of these strings every time, and we don't have to. Now we will do the above query using the python library "twitter" from SixOhSix.

In [46]:
frebandnames = t.search.tweets(q='#freebandnames', since_id=24012619984051000,
                result_type='mixed', count=4)

In [50]:
frebandnames.keys()

dict_keys(['statuses', 'search_metadata'])

In [51]:
freeb = frebandnames['statuses']

In [55]:
freeb[0]['user']['screen_name']

'jakeonthego'

### Note the structure of this JSON object. Let's discuss.

### Challenge: get "max_id" to work, so that this query reproduces exactly the "Example Results" from the documentation.

### Exercise: try the other example searches [here](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/guides/how-to-build-a-query). Play around and change things, see what you get!

# Python Wrapper - More Examples

In [60]:
t.statuses.oembed(_id=1234577890)

{'author_name': 'Larry Rosbach',
 'author_url': 'https://twitter.com/lrphils',
 'cache_age': '3153600000',
 'height': None,
 'html': '<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Ran Frostbite 5-miler in Ambler, PA this morning and did pretty well considering. A good start to, I hope, a good weekend.</p>&mdash; Larry Rosbach (@lrphils) <a href="https://twitter.com/lrphils/status/1234577890?ref_src=twsrc%5Etfw">February 21, 2009</a></blockquote>\n<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>\n',
 'provider_name': 'Twitter',
 'provider_url': 'https://twitter.com',
 'type': 'rich',
 'url': 'https://twitter.com/lrphils/status/1234577890',
 'version': '1.0',
 'width': 550}

In [None]:
# Get your "home" timeline
t.statuses.home_timeline()

# Get a particular friend's timeline
t.statuses.user_timeline(screen_name="zedshaw")

# to pass in GET/POST parameters, such as `count`
t.statuses.home_timeline(count=5)

# to pass in the GET/POST parameter `id` you need to use `_id`
t.statuses.oembed(_id=1234567890)

# Update your status
t.statuses.update(
    status="Using Python Twitter Tools.")

# Send a direct message
t.direct_messages.new(
    user="zedshaw",
    text="Hi Zed, big fan of your work.")

# Get the members of tamtar's list "Things That Are Rad"
t.lists.members(owner_screen_name="buffer", slug="the-buffer-team")

# An *optional* `_timeout` parameter can also be used for API
# calls which take much more time than normal or twitter stops
# responding for some reason:
t.users.lookup(
    screen_name=','.join(A_LIST_OF_100_SCREEN_NAMES), _timeout=1)

# Rmk: A_LIST_OF_100_SCREEN_NAMES is not defined. Why not fix that?

# Overriding Method: GET/POST
# you should not need to use this method as this library properly
# detects whether GET or POST should be used, Nevertheless
# to force a particular method, use `_method`
t.statuses.oembed(_id=1234567890, _method='GET')

In [None]:
# Search for the latest tweets about #pycon
t.search.tweets(q="#pycon")

# Search for the latest tweets about #pycon, using extended mode
t.search.tweets(q="#pycon", tweet_mode='extended')

In [None]:
x = t.statuses.home_timeline()

# The first 'tweet' in the timeline
x[0]

# The screen name of the user who wrote the first 'tweet'
x[0]['user']['screen_name']

# Twitter Streaming API

### The general idea (from SixOhSix):

In [None]:
# Rmk: this cell will not run!
twitter_stream = TwitterStream(auth=OAuth(...))
iterator = twitter_stream.statuses.sample()

for tweet in iterator:
    #...do something with this tweet...

### Let's see a basic example.

In [None]:
# Create a *streaming* connection (not RESTful, different from Search).
t_stream = TwitterStream(auth=OAuth(twtr_auth['token'], twtr_auth['token_secret'], 
                       twtr_auth['consumer_key'], twtr_auth['consumer_secret']))


# Get an *iterator* object from the twitter wrapper

tweeterator = t_stream.statuses.sample()


# The loop below simply prints randomly selected new tweets
# until we reach the threshold of "tweet_count"

tweet_count = 200
for tweet in tweeterator:
    tweet_count -= 1
    print(json.dumps(tweet))  
    if tweet_count <= 0:
        break 