# DSCI 511: Data acquisition and pre-processing<br>Chapter 4: Pre-processing considerations: foresight for downstream needs
## Exercises
Note: numberings refer to the main notes.

#### 4.1.1.1 Exercise: CSV to JSON conversion
Read the `cities.csv` file and look at its contents. It should have a header (the first line of the file) that tells you which fields contain what data. Next, take the data for  only the cities which have their population listed and store this in JSON format.

In [None]:
## code here

#### 4.1.2.1 Exercise: JSON to CSV conversion
Load the data in the `american-movies.json` file. We only want the movies that were made from 1990 to 1999 (it was a truly glorious decade for American cinema). Your task is to take the title and year of making for these movies and put these in a tab-separated values file.

In [None]:
## code here

#### 4.1.2.4 Exercise: Making JSON file reading scalable
Create a specialized JSON serialization of the data in `'nobel-laureates.json'`. Specifically, create a file called `'data/nobel-laureates-lines.json'` that has each lauriate's record serialized seprately as a json object, with newlines `'\n'` in between, as delimiters. As a follow up, combine the line-by-line file reading syntax introduced in Section 1.4.1.5 in conjunction with the `json.dumps()` string serialization function in Section 1.4.2.2 to _read only the first ten lines_. As you read these lines, load each from json and print the laureate's list of prizes.

In [None]:
## code here

#### 4.4.1.3 Exercise: Regex phone numbers
Read the file `phone-numbers.txt`. It contains a phone number in each line. \[Hint: use something like `lines = open("file.txt", "r").readlines()`\] Store only the phone numbers with the area code "215" in a list and print it out. Use regex-based pattern matching, not any other methods which occur to you.

In [None]:
## code here

#### 4.4.1.8 Exercise: Names of the gods
In the cell below is some text. It's an extract from [A Clash of Kings](https://www.goodreads.com/book/show/10572.A_Clash_of_Kings), specifically, about a character's prayer to some fictional gods. Use regex to extract the names of these gods. Your output should be a list that looks something like `["the Father", "the Mother", "the Warrior"]`.

In [None]:
## code here

#### 4.4.4.2 Exercise: Calculate youre exact age
Calculate your own age using datetime parsing! Can you come up with a datetime format for your birthday that `dateutil.parser` doesn't recognize or recognizes incorrectly? If so, use the `datetime` module to specify the format exactly. [Hint. Review these docs: 
- https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
- https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
]

In [None]:
## code here

## Additional In-depth Exercises
### A. Batching Twitter Replies Requests
At the following path:

- ./data/1254014899950366720.json

we have data represents a 'thread' initiated by Elon Musk's tweet:

- https://twitter.com/elonmusk/status/1254014899950366720

we'll be working with this thread through a few acquisition-related pre-processing exercises throughout, so here's a data load:

In [None]:
data = json.loads(open("./data/1254014899950366720.json").read())
data.keys(), data['tweets']["1254197528607637505"]

(dict_keys(['thread', 'tweets', 'rpws']),
 {'created_at': 'Sat Apr 25 23:56:07 +0000 2020',
  'id': 1254197528607637505,
  'id_str': '1254197528607637505',
  'full_text': '@elonmusk Looks* sheesh... algorithms hahaha',
  'truncated': False,
  'display_text_range': [10, 44],
  'entities': {'hashtags': [],
   'symbols': [],
   'user_mentions': [{'screen_name': 'elonmusk',
     'name': 'Elon Musk',
     'id': 44196397,
     'id_str': '44196397',
     'indices': [0, 9]}],
   'urls': []},
  'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
  'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
  'in_reply_to_status_id': 1254179739620655105,
  'in_reply_to_status_id_str': '1254179739620655105',
  'in_reply_to_user_id': 1165558118262030343,
  'in_reply_to_user_id_str': '1165558118262030343',
  'in_reply_to_screen_name': 'kohtakangas',
  'user': {'id': 1165558118262030343,
   'id_str': '1165558118262030343',
   'name': 'Misty Brooke 

The three keys in this dictionary correspond to the 
1. `thread`'s topology, organized as a heirarchy of nested dictionaries, keyed by tweet ids,
2. conversation's `tweets`, themselves, keyed in a dictionary by their own tweet ids, and
3. a log of the current average <i>replies per week</i> (`rpw`) observed in a separate, replies database for each user. 

Here's a more programmatic view of the overall schema:

```
data = {
    'thread': {
        source: {
            child_1: {
                child_1_1: {...},
                ...
            },
            child_2: {...},
            ...
        }
    },
    'rpws': {
        user_id: [rpw, max_time],
        ...
    },
    'tweets': {
        tweet_id: tweet,
        ...
    }
}
```
where `max_time` refers to the most recent time stamp of any tweet in the `thread`.

#### A.1 Exercise: Working with twitter timestamps
Utilise the datetime module and determine a string, `ttime`, which expresses the Twitter `'created_at'` date string's format to the datetime module, and then use datetime to parse any timestamp of your choosing in `data`.

In [None]:
## code here

#### A.2 Exercise: Penn or Drexeluniv Who get's more replies per API batch: Penn or Drexeluniv?
This problem focuses on query batching on Twitter's search API. To get started, review the search API standard operators, and determine which (__at least 2 operators__) we'll need to be able to 1) filter for user replies 2) query for replies targeting two users at the same time. Here's the docs:

- https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators

For Twython reference, please review:

- https://twython.readthedocs.io/en/latest/api.html

In [None]:
## code here

#### A.3 Exercise: Querying Twitter's rate limits, and working with "unix time"
Now that we've checked on some replies to two Twitter users, let's check to see when our rate limit will be refreshed. In particular, use the `twitter.get_lastfunction_header(header_name)` method to recover each, by `header_name`:
- `'x-rate-limit-limit'` 
- `'x-rate-limit-remaining'`
- `'x-rate-limit-class'`
- `'x-rate-limit-reset'`

For information about the header-request method, see the Twython docs for details:

- https://twython.readthedocs.io/en/latest/api.html

Once you've collected the headers, convert the value of `'x-rate-limit-reset'` to a datetime object, and compare it to four hours past the current EST time (using `dt.now()`), to account for the time shift with Tweets, which are expressed in GMT. When this is complete exhibit both.

[Hint: here's a relevant stackoverflow post about this datetime coversion in Python: https://stackoverflow.com/questions/7703865/going-from-twitter-date-to-python-datetime-date]

In [None]:
## code here

#### A.4 Exercise: Maximizing query batch size for Twitter's rate limits
Now that we've vetted how to make batch requests for replies our next job is to determine how many we can combine into a single query&mdash;__is this always the same for each query?__

To answer this question, first consult the restrictions on the query parameter, `q`:

- https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets

__In particular, how many characters can we utilize per query, and how can we construct queries coming as close as possible to this maximum?__

When you've determined your strategy, build as many batches as is required from the sorting order of user ids.


In [None]:
## code here

#### A.5 Exercise: Evaluating query batch size for Twitter's rate limits
Now that we know how to build queries that request replies for the most users at once, our next goal is to find a way to batch users together into queries in a way that balances their agreegated numbers of replies per week. So the first thing we'll need is to have a way of evaluating the API rate-limit 'burden' of a given query-batch of users.

So, to get this off of the ground, put together code that determines the number of replies per week per batch constructed in __A.4__ In particular, functionalize your code so that the aggregated replies per week can be assessed for an arbitrary batch in the dataset.

In [None]:
## code here

### B. Structuring Reddit Threads

Run the script in the cell below to build the reddit data object `data`. We'll be using these throughout the exercise.


In [None]:
import requests

def get_data(sub_id):
  post_url = "https://api.pushshift.io/reddit/search/submission/?ids=" + sub_id
  post = requests.get(post_url)
  post_resp = requests.get(post_url)
  post = post_resp.json()

  data = [post['data'][0]]

  comments_url = "https://api.pushshift.io/reddit/submission/comment_ids/" + sub_id
  comments_resp = requests.get(comments_url)
  ids = comments_resp.json()

  batch_size = 500
  for batch_num in range(len(ids['data'])//batch_size):
    url = "https://api.pushshift.io/reddit/comment/search?ids=" + ','.join(ids['data'][batch_size*batch_num:batch_size*(batch_num + 1)])
    resp = requests.get(url)
    batch = resp.json()
    data.extend(batch['data'])

  if len(data) != len(ids['data']) + 1:
    url = "https://api.pushshift.io/reddit/comment/search?ids=" + ','.join(ids['data'][len(data):])
    resp = requests.get(url)
    batch = resp.json()
    data.extend(batch['data'])

  return(data)

sub_id = "j1dynm"
data = get_data(sub_id)

#### B.1 Exercise: Reviewing the Reddit comment data structure
Let's just take 5 minutes to review `data` and determine the following:
- What is the overall object type?
- What does a single element (comment) look like? (think schema)
- How do these data connect together, i.e., where's the 'thread'?
- Are the data ordered by time, and if not how could they be? 

Write any responses to these questions that you determine in the response box below.

_Response._

In [None]:
## code here

#### B.2 Exercise: Fast access by comment id
If we want to be able to quickly interact between comment, a convenient option would be to re-format into a dictionary. In particular, consutrct a `dict` called `comments` from `data` that is of the format:

```
comments = {
  id: comment,
  ...
}
```

In [None]:
## code here

#### B.3 Exercise: De-serializing timestamps
Now that we have our data set up for fast access, let's see if we can `activate` the timestamps, keyed by the `'created_utc'` field.

In [None]:
from datetime import datetime as dt
## code here

#### B.4 Exercise: Computing timedeltas
Now that we have the ability to compute over Reddit's time stamps, let's create a function that takes a comment and determines the number of seconds that have elapsed in between the comment and its parent post, keyed by `'parent_id'`. For this, utilize the `timedelta` function within the `datetime` module and consider how to make linkages between comments and the post (top-level), vs. comments and other comments (replies).

In [None]:
from datetime import timedelta
## code here