# DSCI 511: Data acquisition and pre-processing<br>Chapter 3: Acquiring Data from the Internet
## Exercises
Note: numberings refer to the main notes.

#### 3.1.2.3 Exercise: processing a JSON response
Make a request to the SEPTA Arrivals API to get data on the next 10 trains to arrive at Suburban Station. Store this JSON-format data into a dictionary. Inspect the dictionary structure. Then, write code to create a list containing 10 dictionaries, one for each train. These new dictionaries should look like this:

#### Discussion: figuring out what we got. 
Putting together/modifying the example URL from Section 3.1.2.2 was relatively straightforward, especially since Suburban Station is exhibited with its station ID directly in the docs. The harder part is probably figuring out how to extract the information out of the json response. As it turns out, we got 20 trains: 10 each going Northbound and Southbound! 

In [None]:
import requests
from pprint import pprint

response = requests.get("http://www3.septa.org/hackathon/Arrivals/Suburban Station/10")

# print(response.json())

In [None]:
# example of train dictionary format
train_dict = {
    'direction': 'S',
     'line': 'Media/Elwyn',
     'sched_time': '2018-08-22 17:31:01.000',
     'status': 'On Time',
     'track': '6'
}

pprint(train_dict)

{'direction': 'S',
 'line': 'Media/Elwyn',
 'sched_time': '2018-08-22 17:31:01.000',
 'status': 'On Time',
 'track': '6'}


In [None]:
data = response.json()
top_keys = list(data.keys())
# pprint(data[top_keys[0]][0]["Northbound"])

trains = []
for timestamp in data: ## timestamp is the sole key at the top level of response
    for outbound_direction in data[timestamp]: ## each track direction gets its own dictionary
        for direction in outbound_direction:
            for train in outbound_direction[direction]:
                trains.append({
                    'direction': train['direction'],
                    'line': train['line'],
                    'sched_time': train['sched_time'],
                    'status': train['status'],
                    'track': train['track']
                })

pprint(trains)

[{'direction': 'N',
  'line': 'Warminster',
  'sched_time': '2018-10-16 19:04:00.000',
  'status': '1 min',
  'track': '1'},
 {'direction': 'N',
  'line': 'Paoli/Thorndale',
  'sched_time': '2018-10-16 19:19:00.000',
  'status': '3 min',
  'track': '1'},
 {'direction': 'N',
  'line': 'Media/Elwyn',
  'sched_time': '2018-10-16 19:21:00.000',
  'status': '1 min',
  'track': '2'},
 {'direction': 'N',
  'line': 'Fox Chase',
  'sched_time': '2018-10-16 19:22:00.000',
  'status': 'On Time',
  'track': '1'},
 {'direction': 'N',
  'line': 'Trenton',
  'sched_time': '2018-10-16 19:22:00.000',
  'status': '8 min',
  'track': '1'},
 {'direction': 'N',
  'line': 'Cynwyd',
  'sched_time': '2018-10-16 19:28:00.000',
  'status': 'On Time',
  'track': '6'},
 {'direction': 'N',
  'line': 'Airport',
  'sched_time': '2018-10-16 19:34:00.000',
  'status': 'On Time',
  'track': '2'},
 {'direction': 'N',
  'line': 'Chestnut Hill West',
  'sched_time': '2018-10-16 19:36:00.000',
  'status': 'On Time',
  'tra

In [None]:
data = response.json()
top_keys = list(data.keys())
# pprint(data[top_keys[0]][0]["Northbound"])

train_keys = ['direction', 'line', 'sched_time', 'status', 'track']

trains = []
for timestamp in data: ## timestamp is the sole key at the top level of response
    for outbound_direction in data[timestamp]: ## each track direction gets its own dictionary
        for direction in outbound_direction:
            for train in outbound_direction[direction]:
                trains.append({
                    train_key: train[train_key]
                    for train_key in train_keys
                })
print(len(trains))
pprint(trains)

20
[{'direction': 'N',
  'line': 'Warminster',
  'sched_time': '2018-10-16 19:04:00.000',
  'status': '1 min',
  'track': '1'},
 {'direction': 'N',
  'line': 'Paoli/Thorndale',
  'sched_time': '2018-10-16 19:19:00.000',
  'status': '3 min',
  'track': '1'},
 {'direction': 'N',
  'line': 'Media/Elwyn',
  'sched_time': '2018-10-16 19:21:00.000',
  'status': '1 min',
  'track': '2'},
 {'direction': 'N',
  'line': 'Fox Chase',
  'sched_time': '2018-10-16 19:22:00.000',
  'status': 'On Time',
  'track': '1'},
 {'direction': 'N',
  'line': 'Trenton',
  'sched_time': '2018-10-16 19:22:00.000',
  'status': '8 min',
  'track': '1'},
 {'direction': 'N',
  'line': 'Cynwyd',
  'sched_time': '2018-10-16 19:28:00.000',
  'status': 'On Time',
  'track': '6'},
 {'direction': 'N',
  'line': 'Airport',
  'sched_time': '2018-10-16 19:34:00.000',
  'status': 'On Time',
  'track': '2'},
 {'direction': 'N',
  'line': 'Chestnut Hill West',
  'sched_time': '2018-10-16 19:36:00.000',
  'status': 'On Time',
  '

#### Alternative solution (dictionary comprehensions)
#### Discussion: 

In [None]:
# code goes here

arrivals_response = requests.get("http://www3.septa.org/hackathon/Arrivals/Suburban Station/10")
arrivals_dict = arrivals_response.json()

list(arrivals_dict.keys())

master_key = list(arrivals_dict.keys())[0]
northbound_trains = arrivals_dict[master_key][0]
southbound_trains = arrivals_dict[master_key][1]
trains = []
for train in southbound_trains["Southbound"]:
    trains.append({
        ky: train[ky]
        for ky in train_dict
    })
trains

#### 3.2.1.1 Exercise: accessing a soccer schedule

Make a request to the Sportradar Soccer schedule API to obtain the match schedule for Liverpool FC (team_id = sr:competitor:44). Then, from the obtained schedule, make a simple list of fixtures. Your output should be a list with strings as elements. The strings should be of the format "HOME_TEAM vs AWAY_TEAM".

#### Discussion: sometimes it's easier to work with support data than logic
This solution is a great example of when data can simplify code. If we had wanted to, we could have used `if/else` logic gates to make sure the home and away teams were always listed in the right order as we construct our fixtures. But creating the `fixture` object as a dictionary with two keys: `'home'` and `'away'` was strategic: using the _value_ of each `competitor`'s `'qualifier'` field (i.e., role as `'home'` and `'away'`) allowed us to just focus on routing each team to its appropriate position in the fixture _associatively_. 

In [None]:
import requests

# code goes here
soccer_key = ""
address = "https://api.sportradar.us/soccer-xt3/eu/en/teams/sr:competitor:44/schedule.json?api_key=" + soccer_key

response = requests.get(address)

data = response.json()

fixtures = []
for match in data['schedule']:
    fixture = {
        "home": "",
        "away": ""
    }
    for competitor in match['competitors']:
#         pprint(competitor)
        fixture[competitor['qualifier']] = competitor['name']
    
    fixtures.append(
        fixture['home']+" vs "+fixture['away']
    )
pprint(fixtures)

['Huddersfield Town vs Liverpool FC',
 'Liverpool FC vs FK Red Star Belgrade',
 'Liverpool FC vs Cardiff City',
 'Arsenal FC vs Liverpool FC',
 'FK Red Star Belgrade vs Liverpool FC',
 'Liverpool FC vs Fulham FC',
 'Watford FC vs Liverpool FC',
 'Paris Saint-Germain vs Liverpool FC',
 'Liverpool FC vs Everton FC',
 'Burnley FC vs Liverpool FC',
 'AFC Bournemouth vs Liverpool FC',
 'Liverpool FC vs SSC Napoli',
 'Liverpool FC vs Manchester United',
 'Wolverhampton Wanderers vs Liverpool FC',
 'Liverpool FC vs Newcastle United',
 'Liverpool FC vs Arsenal FC',
 'Manchester City vs Liverpool FC',
 'Brighton & Hove Albion FC vs Liverpool FC',
 'Liverpool FC vs Crystal Palace',
 'Liverpool FC vs Leicester City',
 'West Ham United vs Liverpool FC',
 'Liverpool FC vs AFC Bournemouth',
 'Manchester United vs Liverpool FC',
 'Liverpool FC vs Watford FC',
 'Everton FC vs Liverpool FC',
 'Liverpool FC vs Burnley FC',
 'Fulham FC vs Liverpool FC',
 'Liverpool FC vs Tottenham Hotspur',
 'Southampton

#### Alternative Solution Discussion
As it turns out, since the data appear to always be presented in home vs. away order in the list of competitors we can just use order to construct and print out the fixtures:

In [None]:
import requests
from pprint import pprint

# code goes here
key = ""
URL = "https://api.sportradar.us/soccer-xt3/eu/en/teams/sr:competitor:44/schedule.json?api_key=" + key
resp = requests.get(URL)
data = resp.json()

for game in data['schedule']:
  teams = []
  for competitor in game['competitors']:
    teams.append(competitor['name'])
  # print(teams[0] + " vs. " + teams[1])
  print(" vs. ".join(teams))
  # break

Everton FC vs. Liverpool FC
Ajax Amsterdam vs. Liverpool FC
Liverpool FC vs. Sheffield United FC
Liverpool FC vs. FC Midtjylland
Liverpool FC vs. West Ham United FC
Atalanta BC vs. Liverpool FC
Manchester City FC vs. Liverpool FC
Liverpool FC vs. Leicester City FC
Liverpool FC vs. Atalanta BC
Brighton & Hove Albion FC vs. Liverpool FC
Liverpool FC vs. Ajax Amsterdam
Liverpool FC vs. Wolverhampton Wanderers FC
FC Midtjylland vs. Liverpool FC
Fulham FC vs. Liverpool FC
Liverpool FC vs. Tottenham Hotspur FC
Crystal Palace FC vs. Liverpool FC
Liverpool FC vs. West Bromwich Albion
Newcastle United FC vs. Liverpool FC
Southampton FC vs. Liverpool FC
Liverpool FC vs. Burnley FC
Liverpool FC vs. Manchester United FC
Tottenham Hotspur FC vs. Liverpool FC
West Ham United FC vs. Liverpool FC
Liverpool FC vs. Brighton & Hove Albion FC
Liverpool FC vs. Manchester City FC
Leicester City FC vs. Liverpool FC
Liverpool FC vs. Everton FC
Sheffield United FC vs. Liverpool FC
Liverpool FC vs. Fulham FC
Wo

#### 3.3.3.1 Exercise: access some accidental haikus from Twitter's REST API
Create your Twitter API keys and download the last 15 tweets by @accidental575 (the hilarious Accidental Haiku Bot).

#### Discussion: just drop your keys in, and start accessing tweets
Working with a client is _very_ convenient, but the only reason these things exist is because the access is so valued and controled. If you haven't, sign up for a developer account and create an app today to get working with Twitter's API. 

In [None]:
from twython import Twython

access_token = ''
access_token_secret = ''
consumer_key = ''
consumer_secret = ''

twitter = Twython(consumer_key, consumer_secret)

haiku_twitter = twitter.get_user_timeline(screen_name = "accidental575")

In [None]:
for tweet in haiku_twitter[:15]:
    print(tweet['text']+"\n")

Just writing to let /
everyone know I have a /
new profile picture /
#accidentalhaiku by @BHump_12 
https://t.co/T2usPc5C0S

freelance is great cause /
sometimes you don’t wear pants for /
an entire day /
#accidentalhaiku by @mattgee 
https://t.co/y48pVrBc8D

Tell me you love me /
started playing at Starbucks /
and I gasped out loud /
#accidentalhaiku by @ashmj21 
https://t.co/Rfml47ypDL

My dad is singing /
Disney hits with me in the /
car! I’m so happy ❤️ /
#accidentalhaiku by @MakaylaBickhart 
https://t.co/FmLlroSBVv

Wow that’s a lot of /
instructions on how to use /
a public restroom! /
#accidentalhaiku by @kbakies 
https://t.co/cE3OUt0Exh

there are squirrels in /
the Grand Canyon that carry /
the bubonic plague /
#accidentalhaiku by @luckyenoughlin1 
https://t.co/bW8X9tBRDp

apparently he /
donated it to his own /
foundation - #taxdodge /
#accidentalhaiku by @woolkebb 
https://t.co/YNnXKnzul3

Really Ain't Tryna /
Go To Buffalo Wild Wings /
But Imma Have To /
#accidentalhaiku by

## Additional In-depth Exercises
### A. Building Twitter reply chains
Since Twitter provides no direct functionality for downloading reply chains, alternative strategies are needed. Some good suggestions can be found here on github:

- https://github.com/alok/thread-twitter/blob/master/thread_tweets.py
- https://gist.github.com/edsu/54e6f7d63df3866a87a15aed17b51eaf

and the latter provides some excellent explanation for this development need:

> Twitter's API doesn't allow you to get replies to a particular tweet. Strange
but true. But you can use Twitter's Search API to search for tweets that are
directed at a particular user, and then search through the results to see if 
any are replies to a given tweet. You probably are also interested in the
replies to any replies as well, so the process is recursive. The big caveat 
here is that the search API only returns results for the last 7 days. So 
you'll want to run this sooner rather than later.

Apply these techniques using Twython, by searching over source-user mentions from either of the following endpoints:

- https://developer.twitter.com/en/docs/tweets/search/overview/standard
- https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-mentions_timeline

to exhibit how replies to a given message are obtained. To continue, store the tweets one tweet per line in a 'json lines' file, and organize the thread's Tweet IDs in their natural, hierarchical graph/network structure by utilizing dictionaries and json serialization.

In [None]:
from twython import Twython

access_token = ''
access_token_secret = ''
consumer_key = ''
consumer_secret = ''

twitter = Twython(consumer_key, consumer_secret)
screen_name = 'drexeluniv'

#### Looking at mentions
What we're doing with the `twitter.search()` method is observing all tweets in the standard rest API' s sample that 'mention' a user, i.e., by tweeting with an `@<screen_name>` in their tweet text.

Reviewing the output, since the first batch of mentiones doesn't have one that 'targets' our tweet `id` in the `in_reply_to_status_id` field we'll have to figure out how to get the next 'page'/batch of tweets.

In [None]:
results = twitter.search(q = "@" + screen_name, count = 100)

In [None]:
print(len(results['statuses']))

100


In [None]:
tweet_id = 1250762697865678849
for tweet in results['statuses']:
    if tweet['in_reply_to_status_id'] == tweet_id:
        pprint(tweet)
        break

#### Batch size
The batches are of size 100, and the API documentation doesn't seem to indicate how many 'pages' of results might be available from the search. We just know that we have access to 450 requests in a 15-minute window (see: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets), and that we can use the `since_id` and `max_id` parameters to retrieve 'pages' of search results after and before (inclusive), separately. 

So then, why isn't the `max_id` the same tweet `id` as `results['statuses'][-1]['id']`. Well, there are perhaps two things going on, and some lack of explanability. First, while the search results aren't a 'timeline', they are pagenated so this blog post (by Twitter's team) is really helpful:

- https://developer.twitter.com/en/docs/tweets/timelines/guides/working-with-timelines

The important aspect is that `max_id` is inclusive to the call, i.e., that the response starts with the tweet having that `id`. But Twitter has worked out a (strange, perhaps) solution to this, explain in the above post:

> While one redundant Tweet is not terribly inefficient, it is still possible to optimize max_id requests to address this problem if your platform is capable of working with 64-bit integers. Environments where a Tweet ID cannot be represented as an integer with 64 bits of precision (such as JavaScript) should skip this step. Subtract 1 from the lowest Tweet ID returned from the previous request and use this for the value of max_id. It does not matter if this adjusted max_id is a valid Tweet ID, or if it corresponds with a Tweet posted by a different user - the value is just used to decide which Tweets to filter. When adjusted in this manner, it is possible to page through a timeline without receiving redundant Tweets

i.e., <b>we should technically be using one less from the numerically smallest tweet id in the batch!</b>

Unfortunately, we can no longer see our target in this next batch (as of 4/22), and have to proceed to the `next_next_batch`. 

In [None]:
from pprint import pprint
pprint(results['search_metadata'])
print(results['search_metadata']['next_results'][8:27])

{'completed_in': 0.109,
 'count': 100,
 'max_id': 1252974332487311365,
 'max_id_str': '1252974332487311365',
 'next_results': '?max_id=1252240101012320255&q=%40drexeluniv&count=100&include_entities=1',
 'query': '%40drexeluniv',
 'refresh_url': '?since_id=1252974332487311365&q=%40drexeluniv&include_entities=1',
 'since_id': 0,
 'since_id_str': '0'}
1252240101012320255


In [None]:
pprint(results['statuses'][-1]['id'])

1252240101012320256


In [None]:
max_id = min([tweet['id'] for tweet in results['statuses']]) - 1
print(max_id)

1252240101012320255


In [None]:
next_results = twitter.search(q = "@" + screen_name, count = 100, 
                              max_id = max_id)

In [None]:
print(len(next_results['statuses']))

100


In [None]:
tweet_id = 1250762697865678849
for tweet in next_results['statuses']:
    if tweet['in_reply_to_status_id'] == tweet_id:
        pprint(tweet)
        break

In [None]:
max_id = min([tweet['id'] for tweet in next_results['statuses']]) - 1
next_next_results = twitter.search(q = "@" + screen_name, count = 100, 
                              max_id = max_id)
print(len(next_results['statuses']))

100


In [None]:
tweet_id = 1250762697865678849
for tweet in next_next_results['statuses']:
    if tweet['in_reply_to_status_id'] == tweet_id:
        pprint(tweet)
        break

{'contributors': None,
 'coordinates': None,
 'created_at': 'Thu Apr 16 13:00:58 +0000 2020',
 'entities': {'hashtags': [],
              'symbols': [],
              'urls': [{'display_url': 'twitter.com/i/web/status/1…',
                        'expanded_url': 'https://twitter.com/i/web/status/1250771163753652225',
                        'indices': [117, 140],
                        'url': 'https://t.co/yu0izRs4Xp'}],
              'user_mentions': [{'id': 17809006,
                                 'id_str': '17809006',
                                 'indices': [0, 11],
                                 'name': 'Drexel University',
                                 'screen_name': 'DrexelUniv'},
                                {'id': 15599047,
                                 'id_str': '15599047',
                                 'indices': [12, 18],
                                 'name': 'LeBow College',
                                 'screen_name': 'LeBow'},
                  

#### What about the mention timeline feature?
Reviewing this approach to process mentions more efficiently, it unfortunately appears that the feature is only available to authenticated users, i.e., we can only see our own mentions timeline, or, if someone logs into our 'application'. So, we'll have to stick to the search feature for now, regardless.

In [None]:
mentions_timeline_results = twitter.get_mentions_timeline(name = screen_name, count = 200)

TwythonError: Twitter API returned a 403 (Forbidden), Your credentials do not allow access to this resource

#### Building out the necessary functions
The basic process of finding mentions of a target tweet (as above) can be thought of as 'leaf/branch discovery', mining out the new branches in the thread. So now that we have our technique (scroll through mentions and find any targeting our tweet), we just need to build a stack/queue of 'replying' mentions in the thread, and re-apply our mentions filtering algorithm until the stack/queue is exhausted.

So, we need functions for these purposes.

#### `get_mentions()`
This first function simply requests up to five batches of mentions. The number of batches in this function is truly a parameter of the implementation, and ideally a different stopping mechanism or incremental application would be ideal, and as a result directions for future development. As it turns out, this process is very call-intensive, so optimize the calls will be essential over time.

In [None]:
def get_mentions(screen_name = 'drexeluniv'):
    mentions = dict()
    results = twitter.search(q = "@" + screen_name, count = 100, tweet_mode='extended')
    call_count = 1
    while results['statuses'] and call_count < 5:
        for tweet in results['statuses']:
            mentions[tweet['id']] = tweet
        max_id = min([tweet['id'] for tweet in results['statuses']]) - 1
        next_results = twitter.search(q = "@" + screen_name, count = 100, 
                                      tweet_mode='extended', max_id = max_id)
        results = next_results
        call_count += 1
    return mentions

#### `filter_mentions()`
This function's job is to look through a batch of mentions and keep any which target tweets in a designated set. The reasoning for this is to move towards a batch implementation, assuming that users will often repeatedly mention (reply to) each other. However, this aspect of the function is not presently fully utilized and would be a direction for development.

In [None]:
def filter_mentions(mentions, screen_name = '', targets = set()):
    filtered = {}
    for tid in mentions:
        if 'in_reply_to_screen_name' in mentions[tid]:
            if ((screen_name and (mentions[tid]['in_reply_to_screen_name'] == screen_name)) or
                (not screen_name and (mentions[tid]['in_reply_to_status_id'] in targets))):
                filtered[tid] = mentions[tid]
    return filtered

#### `get_threads()`
This is the control function which manages the mentions acquisiton and filtration process. Most importantly, the function manages the `target_users` queue (first in first out) of users to have their mentions filtered for replies. 

This simply implementation really just relys on the `target_users` object's boolean value in a `while` loop, and managing the state of <i>both</i> of the `target_users` and `targets` object, the latter of which associates the target tweet `id`s to the mentioning users names, i.e., so we can filter their mentions for any subsequent replies.

This queue implementation is simple, and differs from the inspirational github gist:

- https://gist.github.com/edsu/54e6f7d63df3866a87a15aed17b51eaf

which uses recusion. So in addition to the other development directions, another important task will be reviewing the efficiency of a recursive implementation, and its convenience with respect to effectively managing rate limits.

In [None]:
from collections import defaultdict

def get_threads(tid = 1250762697865678849, screen_name = 'drexeluniv'):
    
    mentions = {screen_name: get_mentions(screen_name)}
    tweets = filter_mentions(mentions[screen_name], targets = set([tid]))
    
    tweets_to_get = {tweets[tid]['in_reply_to_status_id'] for tid in tweets}
    ##
    targets = defaultdict(set)
    for tid in tweets:
        targets[tweets[tid]['user']['screen_name']].add(tid)
    
    target_users = list(targets.keys())
    print(len(mentions), len(tweets), len(target_users))
    while target_users:
        target_screen_name = target_users.pop(0)
        if target_screen_name not in mentions:
            mentions[target_screen_name] = get_mentions(target_screen_name)
        new_tweets = filter_mentions(mentions[target_screen_name], 
                                     targets = targets[target_screen_name])
        del(targets[target_screen_name])
        for tid in new_tweets:
            targets[new_tweets[tid]['user']['screen_name']].add(tid)
            tweets[tid] = new_tweets[tid]
            target_users.append(new_tweets[tid]['user']['screen_name'])
        print(len(mentions), len(tweets), len(target_users))
        
    return(tweets, mentions)   

#### Intrepreting our results
In our code we're still printing out the `len()` of the `mentions`, `tweets`,and `target_users` objects. As it runs, we can see the numbers of tracked users (keys by `user` `id` in `mentions`) going up, along with the number of objects in `tweets` (keyed by tweet `id`) in the thread.

Reviewing the full texts, we can see the tweets are indeed those that we started with in our example.

In [None]:
tweets, mentions = get_threads(tid = 1250762697865678849, screen_name = 'drexeluniv')

1 1 1
2 2 1
3 3 1
3 4 1
3 6 2
3 6 1
3 6 0


In [None]:
[tweets[tid]['full_text'] for tid in tweets]

['@DrexelUniv @LeBow @LeBowMBA Only thing I’m missing is the $1212 dollars you’re charging per credit hour for classes that have lost all their value because labs are closed',
 '@hashtag_chris @DrexelUniv @LeBow @LeBowMBA You do realize that Drexel and thousands of other schools didn’t want this at all, but had to adapt in a short time for it. Suck it up.',
 '@therealmaart @DrexelUniv @LeBow @LeBowMBA Hey guys I found the guy who doesn’t have any student loans and mommy and daddy pays their entire tuition!',
 '@hashtag_chris @DrexelUniv @LeBow @LeBowMBA Nope, exact opposite. I’m paying it all myself. Stay woke bro.',
 '@therealmaart @DrexelUniv @LeBow @LeBowMBA https://t.co/q36UvC1FSO',
 '@therealmaart @DrexelUniv @LeBow @LeBowMBA So you like getting charged the same amount for a lesser product? Your argument is dumber than your haircut.']

#### So what's next?
To emphasize from above mentioned directions, we will really have to focus on optimizing our calls to fall inside of our rate limit. To see our limit, here some basic utility using the `datetime` module (which we'll begin to utilize in future chapters) to make our 'reset time', i.e., the end of the current 15-minute window in a human-readable format (as opposed to unix time).

From this example (and an additional debug set of ~5 calls), we've used around 10 calls to populate a thread of seven tweets. This is expensive!

In [None]:
from datetime import datetime as dt
headers = [twitter.get_lastfunction_header(x) for x in 
           ['x-rate-limit-limit', 'x-rate-limit-remaining', 
            'x-rate-limit-class', 'x-rate-limit-reset']]
print(headers)

print(dt.utcfromtimestamp(int(headers[-1])).strftime('%Y-%m-%d %H:%M:%S'))

['450', '436', None, '1587569318']
2020-04-22 15:28:38


### B. Explore Reddit's pushshift api:
Reddit's API's aren't quite as clear and comprehensive as Twitters and as result, the best archive I've seen of complete historical Reddit data is hosted by a third party:

- https://github.com/pushshift/api
- https://pushshift.io/api-parameters/

Let's plan as above to explore this (unauthenticated) API, focused with the intention of mining out comment threads. 

Of particular use to us will be the referenced call to Get all Comment IDs for a particular Submission:

- https://github.com/pushshift/api#get-all-comment-ids-for-a-particular-submission

in addition to the endpoint which will allow us to get all comments from a comma-separated list of ids:

- https://github.com/pushshift/api#getting-comments-based-on-id

Let's work from here, considering the following submission (post) id: `j1dynm`.

Note: we should take extra care to determine if these submission ids are complete, or just the direct replies to the original submission. If they're only these 'top level' replies, then we'll have to approach via the same methods as the Twitter API, recursively or iteratively searching for replies to individual comment objects. If not, then we're in excellent shape to begin requesting the comment objects. Regardless, we may have to adhere to the API's `size` parameter which accepts values $\leq500$, so we should be a little bit cautious here.

#### Discussion
Substituting the provided submission id (`j1dynm`) into the `/submission/comment_ids/` endpoint, we receive a fairly-simply constructed list of comment ids:

In [None]:
url = "https://api.pushshift.io/reddit/submission/comment_ids/j1dynm"
resp = requests.get(url)
ids = resp.json()

In [None]:
len(ids['data']), ids['data'][:10]

(5231,
 ['g6yjzut',
  'g6yk9gi',
  'g6ykhxs',
  'g6ylfml',
  'g6ylry5',
  'g6ylwv8',
  'g6ymza0',
  'g6yo2be',
  'g6yongq',
  'g6yp07d'])

Yikes, it looks like this submission has over 5,000 comments. So, can we use the `comment/search?ids=` endpoint with a comma-separated string of ids? 

In [None]:
url = "https://api.pushshift.io/reddit/comment/search?ids=" + ','.join(ids['data'])
requests.get(url).text

'<html>\r\n<head><title>414 Request-URI Too Large</title></head>\r\n<body>\r\n<center><h1>414 Request-URI Too Large</h1></center>\r\n<hr><center>cloudflare</center>\r\n</body>\r\n</html>\r\n'

So unfortunately this quantity of comments is too much for a single call! But it's not the API which has denied our request for too many comments (>500), but rather that the URI (URL) is too large! 

However, we can batch the requests! Since 500 does appear to be the size limit on other endpoints we'll use batches of that size. The mecahnics of this downloader utilize `batch_size` as a fixed parameter and list slices on the `ids` from our first API call to operate the multiple requests needed to collect the full set of comments:

In [None]:
all_comments = []
batch_size = 500
for batch_num in range(len(ids['data'])//batch_size):
  url = "https://api.pushshift.io/reddit/comment/search?ids=" + ','.join(ids['data'][batch_size*batch_num:batch_size*(batch_num + 1)])
  resp = requests.get(url)
  comments = resp.json()
  all_comments.extend(comments['data'])

if len(all_comments) != len(ids['data']):
  url = "https://api.pushshift.io/reddit/comment/search?ids=" + ','.join(ids['data'][len(all_comments):])
  resp = requests.get(url)
  comments = resp.json()
  all_comments.extend(comments['data'])

len(all_comments)

5231

### C. Investigate the ProPublica API

As an additional exercise we'll investigate the availability of congressional data from ProPublica's API:

- https://www.propublica.org/datastore/api/propublica-congress-api

You'll have to request an api key, which you can do via the link above. To pass our key through the api, we'll have to use a header, in a format like:

```
requests.get(URL, headers={'X-API-Key': key})
```

As a target, let's see if we can find the transcript of the recent congressional hearing from the House Committee on the Judiciary:

- https://www.congress.gov/event/116th-congress/house-event/110883
- https://docs.house.gov/Committee/Calendar/ByEvent.aspx?EventID=110883
- https://docs.house.gov/Committee/Calendar/ByDay.aspx?DayID=07292020

We can start by looking for our committee on the list provided by the following endpoint:

- https://projects.propublica.org/api-docs/congress-api/committees/#lists-of-committees

From this we'll be able to determine our specific committee's id, and then use the endpoint for hearing from a specific committee:

- https://projects.propublica.org/api-docs/congress-api/committees/#get-hearings-for-a-specific-committee

If it turns out our hearing isn't listed, we might be able to check and see how up to date the data are, by looking at what hearings are 'recent':

- https://projects.propublica.org/api-docs/congress-api/committees/#get-recent-committee-hearings

#### Discussion
Taking a look at the committees, we can see that the committee of interest is keyed by `'id': 'HSJU'`.

In [None]:
import requests

committees_URL = "https://api.propublica.org/congress/v1/116/house/committees.json"

key = ""
result = requests.get(committees_URL, headers={'X-API-Key': key})

committees = result.json()
committees

{'copyright': 'Copyright (c) 2020 Pro Publica Inc. All Rights Reserved.',
 'results': [{'chamber': 'House',
   'committees': [{'api_uri': 'https://api.propublica.org/congress/v1/116/house/committees/HSHA.json',
     'chair': 'Zoe Lofgren',
     'chair_id': 'L000397',
     'chair_party': 'D',
     'chair_state': 'CA',
     'chair_uri': 'https://api.propublica.org/congress/v1/members/L000397.json',
     'chamber': 'House',
     'id': 'HSHA',
     'name': 'Committee on Administration',
     'ranking_member_id': 'D000619',
     'subcommittees': [{'api_uri': 'https://api.propublica.org/congress/v1/116/house/committees/HSHA/subcommittees/HSHA01.json',
       'id': 'HSHA01',
       'name': 'Elections'},
      {'api_uri': 'https://api.propublica.org/congress/v1/116/house/committees/HSHA/subcommittees/HSHA08.json',
       'id': 'HSHA08',
       'name': 'Elections'}],
     'url': 'https://cha.house.gov/'},
    {'api_uri': 'https://api.propublica.org/congress/v1/116/house/committees/HSAG.json',
 

Looking at the specific committee hearings, it unfortunately appears that the most up to date information is from `2019-02-14`, i.e., the hearing of interest is not listed! Unfortunately, searching through the returned object confirms this. Moreover, we can see from the returned object's values

```
'num_results': 9
'offset': 0
```

that the search for result in the committee only resulted in the nine we see, i.e., that there also aren't any older ones!

In [None]:
comm_id = "HSJU"
hearings_URL = "https://api.propublica.org/congress/v1/116/house/committees/" + comm_id + "/hearings.json"

result = requests.get(hearings_URL, headers={'X-API-Key': key})

hearings = result.json()
hearings

{'copyright': 'Copyright (c) 2020 Pro Publica Inc. All Rights Reserved.',
 'results': [{'congress': '116',
   'hearings': [{'api_uri': 'https://api.propublica.org/congress/v1/116/house/committees/HSJU.json',
     'bill_ids': [],
     'chamber': 'House',
     'committee': 'Committee on the Judiciary',
     'committee_code': 'HSJU',
     'date': '2019-02-14',
     'description': 'The State of Competition in the Wireless Market: Examining the Impact of the Proposed Merger of T-Mobile and Sprint on Consumers, Workers, and the Internet',
     'location': 'RHOB 2141',
     'meeting_type': 'Hearing',
     'time': '10:00:00',
     'url': 'https://docs.house.gov/Committee/Calendar/ByEvent.aspx?EventID=108892'},
    {'api_uri': 'https://api.propublica.org/congress/v1/116/house/committees/HSJU.json',
     'bill_ids': [],
     'chamber': 'House',
     'committee': 'Committee on the Judiciary',
     'committee_code': 'HSJU',
     'date': '2019-02-13',
     'description': 'H.R. 8 and H.R. 1112',
   

One last thing we can do is look at all recent hearings to see if there are some 'recent' data that just weren't indexed in their database.

Looking at the record from the same time period, we can see that 12pm ET on `2020-7-29` is passed over. So our hearing is not present at all, even though more recent ones are present! This then seems to indicate that ProPublica's data are not in fact sufficiently up to date, and only updated in targeted areas of the congressional record. To find more current records and certain records of interest we'll either have to wait for better features and more data, or build bespoke data collection from sources like `congress.gov`.

In [None]:
recent_hearings_URL = "https://api.propublica.org/congress/v1/116/committees/hearings.json?&offset=119"

result = requests.get(recent_hearings_URL, headers={'X-API-Key': key})

recent_hearings = result.json()
recent_hearings['results'][0]['hearings'][0:2]

[{'api_uri': 'https://api.propublica.org/congress/v1/116/senate/committees/SSVA.json',
  'bill_ids': [],
  'chamber': 'Senate',
  'committee': 'Committee on Veterans&#39; Affairs',
  'committee_code': 'SSVA',
  'date': '2020-07-29',
  'description': 'Hearings to examine VA telehealth during and beyond COVID-19, focusing on challenges and opportunities in rural America.',
  'location': 'SD-G50',
  'meeting_type': '',
  'time': '14:00:00',
  'url': ''},
 {'api_uri': 'https://api.propublica.org/congress/v1/116/senate/committees/SSFI.json',
  'bill_ids': [],
  'chamber': 'Senate',
  'committee': 'Committee on Finance',
  'committee_code': 'SSFI',
  'date': '2020-07-29',
  'description': 'Hearings to examine WTO reform, focusing on making global rules work for global challenges.',
  'location': 'SD-215',
  'meeting_type': '',
  'time': '10:15:00',
  'url': ''}]