# CIS600 - Social Media & Data Mining
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# Twitter Again

###  February 20, 2018

# The Midterm

- Python
- Graph Theory
- Social Media Concepts

### The midterm exam will answer the question "Have you been paying attention?" There will be no difficult theory, and it is not meant to assess deep knowledge of any particular thing we have touched on.

### You will get a sample midterm later this week.

### Questions?

In [1]:
import json
import pandas as pd

In [2]:
from twitter import *

In [3]:
# Loading my authentication tokens
with open('auth_dict','r') as f:
    twtr_auth = json.load(f)

# To make it more readable, lets store
# the OAuth credentials in strings first.
CONSUMER_KEY = twtr_auth['consumer_key']
CONSUMER_SECRET = twtr_auth['consumer_secret']
OAUTH_TOKEN = twtr_auth['token']
OAUTH_TOKEN_SECRET = twtr_auth['token_secret']
    
# Then, we store the OAuth object in "auth"
auth = OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)
# Notice that there are four tokens - you need to create these in the
# Twitter Apps dashboard after you have created your own "app".

# We now create the twitter search object.
t = Twitter(auth=auth)

# The Friendship Graph

### Let's step through *Crawling a Friendship Graph* from *Mining the Social Web*

### From the text:
> You’d like to harvest the IDs of a user’s followers, followers of those followers, followers
of followers of those followers, and so on, as part of a network analysis—essentially
crawling a friendship graph of the “following” relationships on Twitter.

### Note that there is a distinction between followers and *reciprocal* followers - two users who follow one another. Let's call two users following each other *reciprocal friends*.

### Rmk: this is distinguished from Twitter's own notion of *friend*: someone a user is following. Yes, "friend" is a tehnical term from Twitter and the API uses it.

### Our ultimate goal is to do analysis on the results of our Twitter mining and then to visualize that analysis. In particular, you will apply community detection to the friendship graph. Further steps could include *profiling* the communities discovered and even *classifying* users - assigning them to a given community - according to predictive account features not used in the initial pattern discovery. That's a good idea for a project.

### To begin, we will write a function that harvests IDs of friends *or* followers of a user. Recall from last time our function *make_twitter_request*. We will use it here.

In [4]:
import sys
import time
from twitter.api import TwitterHTTPError
from urllib.error import URLError
from http.client import BadStatusLine

def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw):
    # A nested helper function that handles common HTTPErrors. Return an updated
    # value for wait_period if the problem is a 500 level error. Block until the
    # rate limit is reset if it's a rate limiting issue (429 error). Returns None
    # for 401 and 404 errors, which requires special handling by the caller.
    def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):
        if wait_period > 3600: # Seconds
            print('Too many retries. Quitting.', file=sys.stderr)
            raise e
        if e.e.code == 401:
            return None
        elif e.e.code == 404:
            print('Encountered 404 Error (Not Found)', file=sys.stderr)
            return None
        elif e.e.code == 429:
            print('Encountered 429 Error (Rate Limit Exceeded)', file=sys.stderr)
            if sleep_when_rate_limited:
                print("Retrying in 15 minutes...ZzZ...", file=sys.stderr)
                sys.stderr.flush()
                time.sleep(60*15 + 5)
                print('...ZzZ...Awake now and trying again.', file=sys.stderr)
                return 2
            else:
                raise e # Caller must handle the rate limiting issue
        elif e.e.code in (500, 502, 503, 504):
            print('Encountered %i Error. Retrying in %i seconds' % (e.e.code, wait_period), file=sys.stderr)
            time.sleep(wait_period)
            wait_period *= 1.5
            return wait_period
        else:
            raise e

    # End of nested helper function

    wait_period = 2
    error_count = 0
    while True:
        try:
            return twitter_api_func(*args, **kw)
        except TwitterHTTPError as e:
            error_count = 0
            wait_period = handle_twitter_http_error(e, wait_period)
            if wait_period is None:
                return
        except URLError as e:
            error_count += 1
            print("URLError encountered. Continuing.", file=sys.stderr)
            if error_count > max_errors:
                print("Too many consecutive errors...bailing out.", file=sys.stderr)
                raise
        except BadStatusLine as e:
            error_count += 1
            print >> sys.stderr, "BadStatusLine encountered. Continuing."
            if error_count > max_errors:
                print("Too many consecutive errors...bailing out.", file=sys.stderr)
                raise

### Now to define our *get_friends_followers_ids* function.

In [5]:
# This will let us create new partial
# functions with arguments set to 
# certain values.
from functools import partial

# This was maxint.
# There is no longer a maxint (in Python 3)
from sys import maxsize


def get_friends_followers_ids(twitter_api, screen_name=None, user_id=None,
                                friends_limit=maxsize, followers_limit=maxsize):
    # Must have either screen_name or user_id (logical xor)
    assert (screen_name != None) != (user_id != None), \
    "Must have screen_name or user_id, but not both"
    
    # You can also do this with a function closure.
    get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids,
                                count=5000)
    get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids,
                                count=5000)
    friends_ids, followers_ids = [], []
    for twitter_api_func, limit, ids, label in [
            [get_friends_ids, friends_limit, friends_ids, "friends"],
            [get_followers_ids, followers_limit, followers_ids, "followers"]
            ]:
        #LOOK HERE! This little line is important.
        if limit == 0: continue
        cursor = -1
        while cursor != 0:
            # Use make_twitter_request via the partially bound callable...
            if screen_name:
                response = twitter_api_func(screen_name=screen_name, cursor=cursor)
            else: # user_id
                response = twitter_api_func(user_id=user_id, cursor=cursor)
            if response is not None:
                ids += response['ids']
                cursor = response['next_cursor']
            print('Fetched {0} total {1} ids for {2}'.format(len(ids),
                    label, (user_id or screen_name), file=sys.stderr))
            if len(ids) >= limit or response is None:
                break
    # Do something useful with the IDs, like store them to disk...
    return friends_ids[:friends_limit], followers_ids[:followers_limit]


In [None]:
friends_ids, followers_ids = get_friends_followers_ids(t,
                                screen_name="ZedShaw",
                                friends_limit=10,
                                followers_limit=10)
print(friends_ids)
print(followers_ids)

### We are almost ready to define our own version of the *crawl_followers* function from the text. Instead of writing to MongoDB, we write to a CSV file. Below we define a function that will play the role of *save_to_mongo*. 

In [6]:
# Create a mostly empty data frame,
# and write it to a CSV file.
df = pd.DataFrame(columns=['ID','followers'])
df.to_csv('followers.csv', index=False)

# Our function
def save_followers(fid, followers):
    df = pd.DataFrame([[fid, followers]], columns=['ID','followers'])
    with open('followers.csv', 'a') as f:
        df.to_csv(f,header=False, index=False)

### Note that *save_followers* is hardcoded to write to a particular file. This is generally not the best practice.

### Below we define *crawl_followers*. How would you change it in order to create a *crawl_friends* function?

In [7]:
def crawl_followers(twitter_api, screen_name, limit=1000000, depth=2):
    
    # Resolve the ID for screen_name and start working with IDs for consistency
    seed_id = str(twitter_api.users.show(screen_name=screen_name)['id'])
    _, next_queue = get_friends_followers_ids(twitter_api, user_id=seed_id,
                        friends_limit=0, followers_limit=limit)
    
    # Store a seed_id => _follower_ids mapping in MongoDB
    save_followers(seed_id, ','.join([str(x) for x in next_queue]))
    
    d = 1
    # Note that in the example in the next cell,
    # we never enter this loop.
    while d < depth:
        d += 1
        # Reset the next_queue so that we can
        # start building up the next level
        # of followers-of-followers
        (queue, next_queue) = (next_queue, [])
        # Loop through the current
        # level of followers
        for fid in queue:
            _, follower_ids = get_friends_followers_ids(twitter_api, user_id=fid,
                                friends_limit=0, followers_limit=limit)
            # Store an ID with a string recording
            # IDs of followers of the user with ID "fid"
            save_followers(str(fid), ','.join([str(x) for x in follower_ids]))
            # Extending the list
            next_queue += follower_ids


### Finally, let's try it and see what we get.

In [None]:
screen_name = "timoreilly"
crawl_followers(t, screen_name, depth=1, limit=10)

### Let's do it again, this time letting *crawl_followers* enter its *while* loop.

In [None]:
crawl_followers(t, screen_name, depth=2, limit=10)

# Assignment 2 - More Details

### You must read this in order to do Assignment 2 properly. As stated, it is quite vague and asks that you do something prohibitively expensive in terms of time and API calls to Twitter.

### You have finite time and resources, therefore *do not attempt to build the entire friendship graph of a user*. Instead follow the steps below.

1. Select a user to start with.
2. Retrieve IDs of that user's friends and followers. These are two lists. You may set the max results, as we have here, to 5000.
3. Use the two lists from 2. in order to find the *reciprocal friends* of the user from 1. These are the *distance-1* friends of that user.
4. From that list of distance-1 friends, select the **5 most popular** as determined by their **followers_count** values in user profiles. Use, for instance, the *get_user_profile* function from *Mining the Social Web*.
5. Repeat steps 2, 3 & 4 for each of the distance-1 friends selected, then distance-2, an so on until you have gathered at least 100 nodes (users) for your social network.

### After you have built a network in this way, proceed with the network analysis and plotting as described in the Assignment 2 file on BB.

### Example - Finding Reciprocal Friends

In [14]:
# Selecting a user
screen_name = 'zedshaw'

# Getting friends' ids
response = make_twitter_request(t.friends.ids,
                                screen_name=screen_name, count = 5000)
friends = response["ids"]

# Getting followers' ids
response = make_twitter_request(t.followers.ids,
                                screen_name=screen_name, count = 5000)
followers = response["ids"]

# Computing reciprocal friends
reciprocal_friends = set(friends) & set(followers)

In [None]:
reciprocal_friends