# JSON - JavaScript Object Notation

One of the most popular ways to transfer data on the web is JSON. It stands for JavaScript Object Notation. Although it comes from JavaScript, it looks pretty readable in Python. From a high-level Python perspective, JSON is merely a combination of lists and dictionaries. The following code, which looks like Python, is a dictionary nested in a list nested in a dictionary. Yet, it is actually valid JSON: 

~~~ JavaScript
muppet_json = {"Type":"Muppet",
               "Cases":[{"Kermit":"Frog",
                         "Miss Piggy":"Pig",
                         "Gonzo":"Weirdo"}]}
~~~

To load json into Python directly you can use the ```json``` library. It provides a means to load data into memory (```json.loads(THE_DATA)```) and a means to take a data structure and transform it into valid JSON for writing to disk (```json.dumps(THE_DATA)```). 

Finding some JSON to work with is not difficult. xx(You can check the website files for an example of json). For example, the website Reddit, which is one of the largest link-sharing communities on the web, will format just about every piece of data on the site as JSON. In order to do this, you simply append ```.json``` to the url, as in ```http://www.reddit.com/r/aww.json```. Alternatively, go directly to ```http://api.reddit.com/r/aww``` which will also format the data as json by default. IF you go to these URLs on a browser you might end up seeing a page with a complete wall of text. For example: 

~~~ json
{"kind": "Listing", "data": {"modhash": "", "dist": 26, "children": [{"kind": "t3", "data": { "approved_at_utc": null, "subreddit": "aww", "selftext": "", "author_fullname": "t2_asw2a", "saved": false, "mod_reason_title": null, "gilded": 0, "clicked": false, "title": "/r/samoyeds: /r/Aww Subreddit of the Week!", "link_flair_richtext": [], "subreddit_name_prefixed": "r/aww", "hidden": false, "pwls": 6, "link_flair_css_class": null, "downs": 0, "thumbnail_height": 140, "hide_score": false, "name": "t3_cc1syd", "quarantine": false, "link_flair_text_color": "dark", "author_flair_background_color": null, "subreddit_type": "public", "ups": 95, "total_awards_received": 0, "media_embed": {}, "thumbnail_width": 140, "author_flair_template_id": null, "is_original_content": false, ... }}]}} 
~~~

This is just a wall of characters, though we can see both the ```{``` braces and ```[``` brackets we associate with dictionaries and lists, respectively. We can see strings are in quotes, there are also numbers and booleans.  You can also note that here we can see some of the subtle ways in which JSON is not quite Python. For example, the booleans are written ```false``` in lower-case, whereas Python uses ```False```. Similarly, empty strings are written as ```null```, where Python uses ```None```. Despite these subtle differences it should look pretty familiar. 

With all of these braces and brackets, it would appear that this structure is nested, but it is hard to see how when it is a wall of text. For JSON we have the notion of 'pretty-printing'. This is where the text is formatted in a more readable way. The method ```json.dumps``` has an argument ```index = X``` where ```X``` refers to the number of spaces to indent for each level in the hierarchy. Conventionally we use 4 like so: 

~~~ json
print(json.dumps(THE_DATA, indent=4))

> {"kind": "Listing", "data": 
      {"modhash": "", "dist": 26, "children": 
          [{"kind": "t3", "data": 
               {"approved_at_utc": null, 
                "subreddit": "aww", 
                "selftext": "",...                                                       
~~~

Pretty printing can help us navigate the JSON. Recall that the data that we receive is in a structure that is not oriented to data science at the outset, it's organised towards the system that creates and manages the data. For Reddit, this means that it sends down a huge amount of extra text that would be useful if you wanted to display Reddit yourself. This is what one might do with a third-party Reddit client for iOS or Android. They take this data and use it to format the Reddit page for their app users. We, on the other hand, want to repurpose this data to ask questions _about_ Reddit. This means we have to learn a little about how the data is formatted, ask for the correct data, transform it into a DataFrame and then ask questions of the DataFrame. 

Since we are just learning to get data into Python at this point, our questions should not be too complicated, but they can still be useful. We will focus on some of the skills learned in the last chapter, like slicing data, counting elements, and getting an average. But first we have to get the data in. 

To practice, I have prepared a json file of Muppet episodes from the first four seasons of The Muppet show. This data came from TheTVdb.com. That site is a third party database set up to describe the episodes, characters, summaries, and details of television shows. In a later chapter, we will show how to access this data directly using authentication. For now, however, we will simply use the data provided with this book. If you need to find this data, simple download it from xx.

First we will want to import JSON and then load up the file. When we do that, we can start to explore its structure. Since JSON converts to dictionaries and lists let's find out which one is the root? If it is a list, we will have to iterate through the elements. If it is a dictionary, we will have to navigate the keys. 

In the example below I show first how to read in json and then navigate some of the keys.  From this code we discover that the JSON has a dictionary with two keys, ```links``` and ```data```. We are interested in ```data```. Under there are 100 elements in a list, which turns out to be the maximum number of entries theTVdb returns in one query. Each element is a dictionary corresponding to that episode with keys ```airedSeason```,```writers```, ```overview```,```seriesId```, etc. 

In [1]:
import json 
import os 

filein = json.loads(open("..{}Data{}muppetEpisodes.json".format(os.sep,os.sep)).read())

print(type(filein)) # This shows it is a dictionary, so let's ask for keys. 
      
print(filein.keys()) # Perhaps we want to explore the 'data' key. 

print(type(filein['data'])) # It would appear 'data' is a list. 

print(len(filein['data'])) # This list has 100 entries. 

# print(filein['data'][0]) # Let's view the first entry. It's very long with a summary and other details.

print(filein['data'][0].keys()) # Inspect the keys - these will go in our table.


<class 'dict'>
dict_keys(['links', 'data'])
<class 'list'>
100
dict_keys(['id', 'airedSeason', 'airedSeasonID', 'airedEpisodeNumber', 'episodeName', 'firstAired', 'guestStars', 'director', 'directors', 'writers', 'overview', 'language', 'productionCode', 'showUrl', 'lastUpdated', 'dvdDiscid', 'dvdSeason', 'dvdEpisodeNumber', 'dvdChapter', 'absoluteNumber', 'filename', 'seriesId', 'lastUpdatedBy', 'airsAfterSeason', 'airsBeforeSeason', 'airsBeforeEpisode', 'thumbAuthor', 'thumbAdded', 'thumbWidth', 'thumbHeight', 'imdbId', 'siteRating', 'siteRatingCount'])


With these 100 entries it seems like we should be able to make a table with 100 rows corresponding to each entry. In theory we could create a DataFrame with the first dictionary as a single row. Then add each new entry just like how we demonstrated adding a row in the previous chapter. However, that would be pretty tedious. Luckily, pandas provides a command for importing json directly ```json_normalize``` (with a 'z' not an 's'). This function has some quirks and has to be imported separately from Pandas, but is is a really handy command. Notice that it takes in the json once it has already been imported with ```json.loads```. Observe how it then takes the list and reshapes it as a table. 

In [9]:
from pandas.io.json import json_normalize

muppetjson = json.loads(open("..{0}Data{0}muppetEpisodes.json".format(os.sep)).read())
muppetdf = json_normalize(muppetjson["data"])
display(muppetdf.head())

Unnamed: 0,absoluteNumber,airedEpisodeNumber,airedSeason,airedSeasonID,airsAfterSeason,airsBeforeEpisode,airsBeforeSeason,director,directors,dvdChapter,...,productionCode,seriesId,showUrl,siteRating,siteRatingCount,thumbAdded,thumbAuthor,thumbHeight,thumbWidth,writers
0,1.0,24,1,4221,,,,,[],,...,,72476,,8.0,1,,3549,300,400,[]
1,2.0,22,1,4221,,,,,[],,...,,72476,,7.0,3,,3549,300,400,[]
2,3.0,5,1,4221,,,,,[],,...,,72476,,6.8,4,,3549,300,400,[]
3,4.0,4,1,4221,,,,,[],,...,,72476,,7.7,3,,3549,300,400,[]
4,5.0,3,1,4221,,,,,[],,...,,72476,,6.6,5,,3549,300,400,[]


In the last chapter we would simply display the entire table. This was not an issue since we had upwards of three rows and four columns. But now we have 100 rows and 34 columns. That is too much to print. Nevertheless, it is important to practice _data skepticism_. That is, did it work? So, we should print a little bit of the data using the commands ```df.head()``` and ```df.tail()``` to see if things worked as expected. These commands print the first 5 and last 5 rows respectively. If you want to print more rows you can use that as an argument as in ```muppetdf.tail(10)``` to print the last 10 entries in the DataFrame. 

You might also notice that we did not do ```json_normalize(muppetjson)```. Instead, we did ```json_normalize(muppetjson["data"])```. Try removing ```["data"]``` and see for yourself what happens - it will be one row where all of the data is in a single, very long, cell. _This_ is the reason that we wanted to explore the data a little bit first. It seems that ```json_normalize``` will transform JSON into a table, but it will only use the top level keys as columns (or second-level keys, if the value of the key is itself another dictionary, in the muppetjson case these would be ```links.first``` and ```links.last```).

TheTVdb is certainly not the only place for data about The Muppet Show. Later, in the chapter on merging, we will look at the data dumps from the much larger Internet Movie Database (http://imdb.com/). This data set is much larger as it contains information about every show and movie on imdb. Thus, it will involve considerable care to properly slice the data down to the cases that we want. 