# Social Media Analytics
## School of Information, University of Michigan

## Week 1: 
 
- Intro to social data: types of data, platforms, profiles
- Ethical considerations in working with social media data
- Intro to platform APIs: obtaining and managing data
- Understanding the structure of social platform data


## Assignment Overview
### The objective of this assignment is to:

- Access social platform data using an API
    - figure out how to get and use authentication credentials
- Manipulate the data accessed using python

### The total score of this assignment will be 100 points consisting of:

- retrieve one tweet: 20 points
- retrieve one follower: 20 points
- `create_tweet_df` function: 20 points
- `create_hashtag_df` function: 20 points
- `create_weekday_hour_count_df` function: 20 points

### Resources:

 - [Tweepy API documentation (v4.10.0 is used for this assignment)](https://docs.tweepy.org/en/v4.10.0/index.html)
 - [Tweepy Getting Started Tutorial](https://docs.tweepy.org/en/v4.10.0/getting_started.html)
 - [Twitter API documentation](https://developer.twitter.com/en/docs/api-reference-index) 

### Instructions: 
In the first part of this assignment, you will use the Twitter API, Tweepy, and the Twitter API documentation to guide you through the process of obtaining social platform data. Once obtained, you will manipulate the data using Python to explore the types of data found on social platforms and the way in which those data are structured. 


---
## Part A (100 points)


## Important Note
You can execute calls to Twitter's API in your notebook. The autograder **can not**.

To get around this limitation, we ask you for the first two problems to *paste a text string that is extracted from the results of a call to the Twitter API*.

Once you have run your code that produces the string that you need, you should comment out your code that produced it. **If you don't comment out your code that makes Twitter API calls, the auto-grader will fail and won't score the rest of your cells**.

In [1]:
import tweepy
import json

consumer_key = 'rMdn4OpQLle1iqOkzeGoGwXTs'
consumer_secret = 'VO8AQtsVvZwssSxwk5uSIZQdxbBJjHUySdOrXyqLuV7HjENwfV'
access_token = '1075512862964150272-WoLvl5PLVOnWtltQpVNndKZRRTbdV2'
access_token_secret = 'uszwZzyott40XgFnC5S7sFt9zcqcJHNX4q1axHXfqIdCL'  

# YOUR CODE HERE
#raise NotImplementedError()

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)


## Get the data for one tweet as a json-formatted string
20 points
1. Call Twitter's API to get data for a single tweet
    - One way is to make an API request for a single tweet. You can find one using the twitter website (check the URL string).
    - Alternately, you can use an API call that returns data for multiple tweets. But be sure that you extract just a single one.
2. Serialize the python object as a JSON formatted string, print it out, and copy the string literal
    - Hint: json.dumps
    
3. Save it to a variable called `tweet_data_string`. Note that you will need to assign a hard-coded string literal to this variable, the string that you copied in the previous step.

Note: the example provided shows only a few fields on the tweet object. Your solution should include all the fields that the 
          tweet json includes.

In [2]:
results = api.search_tweets(q="@TheRock")[0]

In [3]:
tweet_data_string = json.dumps(results._json)

In [4]:
# Some tests for your code in the previous cell. 
# There are additional hidden tests not shown in this cell that the autograder uses.
import json
from jsonschema import validate, ValidationError
schema = '{"definitions": {}, "$schema": "http://json-schema.org/draft-07/schema#", "$id": "http://example.com/root.json", "type": "object", "additionalProperties": true, "title": "The Root Schema", "required": ["created_at", "id", "id_str"], "properties": {"created_at": {"$id": "#/properties/created_at", "type": "string", "title": "The Created_at Schema", "default": "", "examples": ["Wed Nov 20 14:14:20 +0000 2019"], "pattern": "^(.*)$"}, "id": {"$id": "#/properties/id", "type": "integer", "title": "The Id Schema", "default": 0, "examples": [1197156225588301800]}, "id_str": {"$id": "#/properties/id_str", "type": "string", "title": "The Id_str Schema", "default": "", "examples": ["1197156225588301824"], "pattern": "^(.*)$"}}}'
validation_schema = json.loads(schema)
sample = json.loads(tweet_data_string)
raised = False
try:
    validate(sample, schema=validation_schema)
except:
    raised = True
assert raised == False, 'tweet_data_string, Json schema is not correct'

 

## Get the data for a single follower of the umsi twitter account

20 points.

Again, dump it out as a json formatted string, print it, and paste it as a string literal assigning it to the variable `follower_data_as_string`. Don't forget to comment out your code that makes Twitter API calls.

Note: you should paste the data for just a single follower.

In [5]:
follower_data_as_string = json.dumps(api.get_followers(screen_name = "@umsi")[0]._json)

In [6]:
# replace the string literal below
#follower_data_as_string = '{"id": 18033550, "id_str": "18033550", "name": "School of Information"}'

# YOUR CODE HERE
#raise NotImplementedError()

In [7]:
# Some tests for your code in the previous cell. 
# There are additional hidden tests not shown in this cell that the autograder uses.
submission = follower_data_as_string
assert type(submission) == str, 'follower_sample, function does not return a json string.'

converted = json.loads(submission)
assert type(converted) == dict, 'follower_sample, json string does not contain a dictionary response.'


## Using Cached Data
The remaining questions will have you process data of the kind that comes back from Twitter.

So that you are working with the same data that our auto-grader is, you will be processing data that we have already retrieved and saved in a text file.

In other words, congratulations on establishing that you can use the Twitter API to fetch data. But from here on out, you won't be using it for the graded assignments; you'll only be using data that we've already fetched and saved in files.


In [8]:
import json
import pandas as pd
def create_tweet_df(json_file_path):
    """ 20 points
        Transform the tweets_json object into a dataframe with the following columns and dataypes:
        'retweet_count', int64
        'created_at', datetime64[ns, UTC]
        'full_text', object
        'favorited', bool
        'retweeted', bool
        'lang', object
        'favorite_count', int64
        
        Return the dataframe
    """ 
    return pd.read_json(json_file_path)[["retweet_count", "created_at", "full_text", "favorited", "retweeted", "lang", "favorite_count"]]

In [9]:
json_file_path = "assets/POTUS_2019-03-07_2020-01-28.json"
df = create_tweet_df(json_file_path)
df

Unnamed: 0,retweet_count,created_at,full_text,favorited,retweeted,lang,favorite_count
0,5069,2020-01-28 21:20:27+00:00,RT @realDonaldTrump: https://t.co/tvvvnGEmjo h...,False,False,und,0
1,2502,2020-01-28 19:34:00+00:00,"RT @WhiteHouse: ""All humanity should be able t...",False,False,en,0
2,1170,2020-01-28 19:05:57+00:00,"RT @WhiteHouse: ""Perhaps most importantly, my ...",False,False,en,0
3,1174,2020-01-28 18:50:45+00:00,"RT @WhiteHouse: ""We must break free of yesterd...",False,False,en,0
4,1732,2020-01-28 18:50:45+00:00,"RT @WhiteHouse: On Sunday, President @realDona...",False,False,en,0
...,...,...,...,...,...,...,...
3196,5674,2019-03-07 22:05:56+00:00,RT @FLOTUS: It was wonderful to welcome the Pr...,False,False,en,0
3197,2853,2019-03-07 21:21:02+00:00,RT @FLOTUS: Honored to celebrate a group of ex...,False,False,en,0
3198,27347,2019-03-07 14:40:32+00:00,RT @realDonaldTrump: We are on track to APPREH...,False,False,en,0
3199,2940,2019-03-07 14:21:03+00:00,RT @WhiteHouse: .@IvankaTrump: “The mission of...,False,False,en,0


In [10]:
# Some tests for your code in the previous cell. 
# There are additional hidden tests not shown in this cell that the autograder uses.
import json
import pandas as pd
import numpy as np
df = create_tweet_df('assets/POTUS_2019-03-07_2020-01-28.json')
df_length = 3201
assert len(df) == df_length, "create_tweet_df, the length of the dataframe should be %d" % df_length
df_cols = ['retweet_count','created_at','full_text','favorited','retweeted','lang','favorite_count','retweet_count']
for col_name in df_cols:
    assert col_name in df.columns.values, "create_tweet_df, the column %s should be included" % col_name

In [11]:
import json
import pandas as pd
def create_hashtag_df(json_file_path):
    """ 20 points
        Transform the tweets_json object into a dataframe with the following columns:
        'text', object, the text of the hashtag
        'user', object, the screen name of the user who tweeted; if it's a retweet then the retweeter, not the original tweeter
        'created_at', datetime, the time the hashtag was tweeted
        HINT: Use the entities.hashtags attribute in the tweet to build this dataframe
    """ 
    df = pd.read_json(json_file_path)
    tags = []
    for i,row in df.iterrows():
        user = row["user"]["screen_name"]
        created_at = row["created_at"]
        for tag in row["entities"]["hashtags"]:
            tags.append({"text":tag["text"], "user":user, "created_at":created_at})

    return pd.DataFrame(tags) 

In [12]:
json_file_path = "assets/POTUS_2019-03-07_2020-01-28.json"
hashtags = create_hashtag_df(json_file_path)

In [13]:
# Some tests for your code in the previous cell. 
# There are additional hidden tests not shown in this cell that the autograder uses.
import json
import pandas as pd
import numpy as np
df = create_hashtag_df('assets/POTUS_2019-03-07_2020-01-28.json')
df_length = 86
assert len(df) == df_length, "create_hashtag_df, the length of the dataframe should be %d" % df_length
df_cols = ['text','user','created_at']
for col_name in df_cols:
    assert col_name in df.columns.values, "create_hashtag_df, the column %s should be included" % col_name

In [14]:
import json
import pandas as pd
def create_weekday_hour_count_df(tweets_dataframe):
    """ 20 points
        Create a pivot table where the columns are the day hours (0 -23) 
        and rows are weekdays ('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday').
        Each cell would be the count of tweets at a given weekday and a given hour.
        If there are no values at a specific weekday and hour the value should be 0
        Sort the hours in ascending order starting from 0 and the weekdays starting from Monday
    """
    df = tweets_dataframe.copy()
    df["hour"] = df["created_at"].dt.hour
    df["dow"] = df["created_at"].dt.day_name()
    pt = df.pivot_table(values="created_at", index="dow", columns="hour", aggfunc="count").fillna(0).loc[["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]]
    for t in range(24):
        if t not in pt.columns:
            pt.insert(t, t, [0.0]*len(pt))
    return pt

In [15]:
df["hour"] = df["created_at"].dt.hour
df["dow"] = df["created_at"].dt.day_name()
df.set_index("created_at").resample("1h")["dow"].agg("count")

created_at
2019-03-07 21:00:00+00:00    1
2019-03-07 22:00:00+00:00    0
2019-03-07 23:00:00+00:00    0
2019-03-08 00:00:00+00:00    0
2019-03-08 01:00:00+00:00    0
                            ..
2020-01-22 12:00:00+00:00    0
2020-01-22 13:00:00+00:00    0
2020-01-22 14:00:00+00:00    0
2020-01-22 15:00:00+00:00    0
2020-01-22 16:00:00+00:00    1
Freq: H, Name: dow, Length: 7700, dtype: int64

In [16]:
# Some tests for your code in the previous cell. 
# There are additional hidden tests not shown in this cell that the autograder uses.
import json
import pandas as pd
import numpy as np
t_df = create_tweet_df('assets/POTUS_2019-03-07_2020-01-28.json')
df = create_weekday_hour_count_df(t_df)
for col_name in range(0,24):
    assert col_name in df.columns.values, "create_weekday_hour_count_df, the column %s should be included" % col_name
for row_name in ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday','Wednesday']:
    assert row_name in df.index.values, "create_weekday_hour_count_df, the column %s should be included" % row_name