In [1]:
%matplotlib inline

import json
import pandas as pd

# JSON - JavaScript Object Notation

Much of the data with which we will work comes in the JavaScript Object Notation (JSON) format.
JSON is a lightweight text format that allows one to describe objects by __keys__ and __values__ without needing to specify a schema beforehand (as compared to XML).

Many "RESTful" APIs available on the web today return data in JSON format, and the data we have stored from Twitter follows this rule as well.

Python's JSON support is relatively robust and is included in the language under the json package.
This package allows us to read and write JSON to/from a string or file and convert many of Python's types into a text format.

## JSON and Keys/Values

The main idea here is that JSON allows one to specify a key, or name, for some data and then that data's value as a string, number, or object.

An example line of JSON might look like:

> {"key": "value"}

In [2]:
json_string = '{"key": "value"}'

# Parse the JSON string
dict_from_json = json.loads(json_string)

# Python now has a dictionary representing this data
print ("Resulting dictionary object:\n", dict_from_json)

Resulting dictionary object:
 {'key': 'value'}


In [3]:
# Will print the value
print ("Data stored in \"key\":\n", dict_from_json["key"])

Data stored in "key":
 value


In [4]:
# This will cause an error!

# print ("Data stored in \"value\":\n", dict_from_json["value"])

## Multiple Keys and Values

A JSON string/file can have many keys and values, but a key should always have a value.
We can have values without keys if we're doing lists, but this can be awkward.

An example of JSON string with multiple keys is below:

``
{
"name": "Jake",
"age": 31,
"gender": "male",
"city": "Boston"
}
``

Note the __comma__ after the first two values. 
These commas are needed for valid JSON and to separate keys from other values.

In [5]:
json_string = '{"name": "Jake","age": 31,"gender": "male","city": "Boston"}'

# Parse the JSON string
dict_from_json = json.loads(json_string)

# Python now has a dictionary representing this data
print ("Resulting dictionary object:\n", dict_from_json)

Resulting dictionary object:
 {'name': 'Jake', 'age': 31, 'gender': 'male', 'city': 'Boston'}


## JSON and Lists

The above JSON string describes an __object__ whose name is "Cody".
How would we describe a list of similar students?
Lists are useful here and are denoted with "[]" rather than the "{}" object notation.
For example:

``
{
    "students": [
        {
            "name": "Jake",
            "age": 31,
            "gender": "male",
            "city": "Boston"
        }
        {
            "name": "Natalia",
            "age": 28,
            "gender": "female",
            "city": "Cambridge"
        }
    ]
}
``

Again, note the comma between the "}" and "{" separating the two student objects and how they are both surrounded by "[]".

In [6]:
json_string = '{"students":[{"name":"Jake","age":31,"gender":"male","city":"Boston"},{"name": "Natalia","age":28,"gender":"female","city":"Cambridge"}]}'

# Parse the JSON string
dict_from_json = json.loads(json_string)

# Python now has a dictionary representing this data
print("Resulting list:\n", dict_from_json)



print("\nEach student:\n")
for student in dict_from_json["students"]:
    print(student, '\n')

Resulting list:
 {'students': [{'name': 'Jake', 'age': 31, 'gender': 'male', 'city': 'Boston'}, {'name': 'Natalia', 'age': 28, 'gender': 'female', 'city': 'Cambridge'}]}

Each student:

{'name': 'Jake', 'age': 31, 'gender': 'male', 'city': 'Boston'} 

{'name': 'Natalia', 'age': 28, 'gender': 'female', 'city': 'Cambridge'} 



## More JSON + Lists

A couple of things to note:
1. JSON does not *need* a name for the list. It could be declared just as an list.
1. The student objects need not be identical.

As an example:

``
[
        {
            "name": "Jake",
            "age": 31,
            "gender": "male",
            "city": "Boston"
        }
        {
            "name": "Natalia",
            "gender": "female",
            "city": "Cambridge",
            "goal": "Master degree"
        }
]
``

In [7]:
json_string = '{"students":[{"name":"Jake","age":31,"gender":"male","city":"Boston"},{"name": "Natalia","gender":"female","city":"Cambridge","goal": "Master degree"}]}'

# Parse the JSON string
dict_from_json = json.loads(json_string)

# Python now has a dictionary representing this data
print("Resulting list:\n", dict_from_json)



print("\nEach student:\n")
for student in dict_from_json["students"]:
    print(student, '\n')

Resulting list:
 {'students': [{'name': 'Jake', 'age': 31, 'gender': 'male', 'city': 'Boston'}, {'name': 'Natalia', 'gender': 'female', 'city': 'Cambridge', 'goal': 'Master degree'}]}

Each student:

{'name': 'Jake', 'age': 31, 'gender': 'male', 'city': 'Boston'} 

{'name': 'Natalia', 'gender': 'female', 'city': 'Cambridge', 'goal': 'Master degree'} 



## Nested JSON Objects

We've shown you can have an list as a value, and you can do the same with objects.
In fact, one of the powers of JSON is its essentially infinite depth/expressability. 
You can very easily nest objects within objects, and JSON in the wild relies on this heavily.

Below is an example of JSON data collected from Google map API (https://maps.googleapis.com/maps/api/geocode/json?address=Babson+College):



``
{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "231",
               "short_name" : "231",
               "types" : [ "street_number" ]
            },
            {
               "long_name" : "Forest Street",
               "short_name" : "Forest St",
               "types" : [ "route" ]
            },
            {
               "long_name" : "Babson Park",
               "short_name" : "Babson Park",
               "types" : [ "neighborhood", "political" ]
            },
            {
               "long_name" : "Wellesley",
               "short_name" : "Wellesley",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "Norfolk County",
               "short_name" : "Norfolk County",
               "types" : [ "administrative_area_level_2", "political" ]
            },
            {
               "long_name" : "Massachusetts",
               "short_name" : "MA",
               "types" : [ "administrative_area_level_1", "political" ]
            },
            {
               "long_name" : "United States",
               "short_name" : "US",
               "types" : [ "country", "political" ]
            },
            {
               "long_name" : "02457",
               "short_name" : "02457",
               "types" : [ "postal_code" ]
            },
            {
               "long_name" : "0310",
               "short_name" : "0310",
               "types" : [ "postal_code_suffix" ]
            }
         ],
         "formatted_address" : "231 Forest St, Babson Park, MA 02457, USA",
         "geometry" : {
            "location" : {
               "lat" : 42.2993708,
               "lng" : -71.2659951
            },
            "location_type" : "ROOFTOP",
            "viewport" : {
               "northeast" : {
                  "lat" : 42.3007197802915,
                  "lng" : -71.26464611970849
               },
               "southwest" : {
                  "lat" : 42.2980218197085,
                  "lng" : -71.26734408029149
               }
            }
         },
         "place_id" : "ChIJ7xQZi0GB44kRiWrnmTgf904",
         "types" : [ "establishment", "point_of_interest" ]
      }
   ],
   "status" : "OK"
}
``

Read the following code and think what will be printed before excuting the cell.

In [8]:
import urllib.request

url = "https://maps.googleapis.com/maps/api/geocode/json?address=Babson+College"
f = urllib.request.urlopen(url)
response_text = f.read().decode('utf-8')
response_data = json.loads(response_text)
print(response_data['results'][0]['address_components'][5]['long_name'])

Massachusetts


## Reading Twitter JSON

We should now have all the tools necessary to understand how Python can read Twitter JSON data.
To show this, we'll read in the @realdonaldtrump's first tweet, and parse it with Python's JSON loader.

In [9]:
import gzip
with gzip.open('data/realdonaldtrump_22893.gz', 'rb') as f:
    tweet_content = f.readline()


# Print the raw json
print("Raw Tweet JSON:\n")
print(tweet_content)

# Convert the JSON to a Python object
tweet = json.loads(tweet_content)
print("\nTweet Object:\n")
print(tweet)


Raw Tweet JSON:

b'{"in_reply_to_screen_name":null,"favorited":false,"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"truncated":false,"screen_name":"realDonaldTrump","is_quote_status":false,"text":"Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!","display_text_range":[0,117],"in_reply_to_user_id":null,"favorite_count":359,"contributors":null,"created_at":"Mon May 04 18:54:25 +0000 2009","user_id":25073877,"in_reply_to_user_id_str":null,"id":1698308935,"coordinates":null,"entities":{"user_mentions":[],"hashtags":[],"urls":[],"symbols":[]},"id_str":"1698308935","lang":"en","source":"<a href=\\"http://twitter.com\\" rel=\\"nofollow\\">Twitter Web Client</a>","geo":null,"place":null,"retweet_count":412,"retweeted":false}\n'

Tweet Object:

{'in_reply_to_screen_name': None, 'favorited': False, 'in_reply_to_status_id_str': None, 'in_reply_to_status_id': None, 'truncated': False, 'screen_name': 'realDonaldTru

## Twitter JSON Fields

This tweet is pretty big, but we can still see some of the fields it contains. 
Note it also has many nested fields.
We'll go through some of the more important fields below.

In [10]:
# What fields can we see?
print("Keys:")
for k in sorted(tweet.keys()):
    print ("\t", k)
    
print("\n")

for k, v in tweet.items():
    print(k,"\t",v)

# Tweets have a list of hashtags, mentions, URLs, and other
# attachments in "entities" field
print ("\nEntities:")
for entity in tweet["entities"]:
    print ("\t", entity)
    
    for e in tweet["entities"][entity]:
        print ("\t\t", e)

Keys:
	 contributors
	 coordinates
	 created_at
	 display_text_range
	 entities
	 favorite_count
	 favorited
	 geo
	 id
	 id_str
	 in_reply_to_screen_name
	 in_reply_to_status_id
	 in_reply_to_status_id_str
	 in_reply_to_user_id
	 in_reply_to_user_id_str
	 is_quote_status
	 lang
	 place
	 retweet_count
	 retweeted
	 screen_name
	 source
	 text
	 truncated
	 user_id


in_reply_to_screen_name 	 None
favorited 	 False
in_reply_to_status_id_str 	 None
in_reply_to_status_id 	 None
truncated 	 False
screen_name 	 realDonaldTrump
is_quote_status 	 False
text 	 Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!
display_text_range 	 [0, 117]
in_reply_to_user_id 	 None
favorite_count 	 359
contributors 	 None
created_at 	 Mon May 04 18:54:25 +0000 2009
user_id 	 25073877
in_reply_to_user_id_str 	 None
id 	 1698308935
coordinates 	 None
entities 	 {'user_mentions': [], 'hashtags': [], 'urls': [], 'symbols': []}
id_str 	 1698308935