In [2]:
import pandas as pd      # the necessary library
import json              # json duh
from rich import print   # extra python package to giove std output colorful and more readable

## Json Into DataFrame

Ok, so we have done a crash course about `DataFrames` but we haven't scratched the surface (yet). To read much more, but still condensed you can go here: https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html

For now I want to discuss converting a json file into something a DataFrame can handle. CSV into a dataFrame is really straight forward, but Json is not really setup that way. This isn't always convenient. Most data isn't typically stored this way, but get used to it. Data frames are extremely popular across multiple libraries. Much of your time in the research game will be converting data from on format to another. Whether you are removing anomalies, cleaning up bad values, or altering the format to use for a specific purpose, you will find yourself manipulating data to increase its usability for your purpose. That's what were doing here, taking a somewhat useful Json file and making it fit into a dataFrame. 

Json is `key:value` pairs and not columnar like dataFrames expect. To fix Json data so that a pandas dataFrame can read it we need to turn all of the `key:values` into parallel lists. Remember parallel arrays from early CS? The image below shows how you can use the same index to access values for a single entity. Entity being (in this case) a persons: `Name`, `Age`, and `Sex`.

<img src="./images/parallel_arrays.png" width="400">

It is equivalent to the this Json object (ignoring the empty spaces):
```json
{
    "Name": [
        "Braund, Mr. Owen Harris",
        "Allen, Mr. William Henry",
        "Bonnell, Miss. Elizabeth",
    ],
    "Age": [22, 35, 58],
    "Sex": ["male", "male", "female"],
}
```

When using a Python `dictionary of lists`, the dictionary keys will be used as `column headers` and the values in each list are the values placed the `columns` for the `DataFrame`. Given  the following Json object, can we create a copy that will load into a dataFrame? 

```json
[
    {
        "city": "New York",
        "growth": 4.8,
        "latitude": 40.7127837,
        "longitude": -74.0059413,
        "population": 8405837,
        "rank": 1,
        "state": "New York"
    },
    {
        "city": "Los Angeles",
        "growth": 4.8,
        "latitude": 34.0522342,
        "longitude": -118.2436849,
        "population": 3884307,
        "rank": 2,
        "state": "California"
    },
    {
        "city": "Chicago",
        "growth": -6.1,
        "latitude": 41.8781136,
        "longitude": -87.6297982,
        "population": 2718782,
        "rank": 3,
        "state": "Illinois"
    },
    ...
]
```

This is found [here](./cities_latlon_w_pop.json) in its full form. Let's start by loading the file:

In [None]:
with open("./cities_latlon_w_pop.json") as f:
    cities = json.load(f)

# print first 5 cities
print(cities[:3])

We loaded the data, now let's create a new Json object with the seven parallel arrays needed, one for each of the data elements. The keys are `city`, `growth`, ... `state`. I will grab them programmatically, then  create the parallel array version of the json object.


In [None]:
# grab keys from 1st entry
keys = cities[0].keys()

# create a new dictionary (synonymous with json)
cities2 = {}

# iterate over keys and create a list for every key
for key in keys:
    cities2[key] = []

# Note: the keys may not print in the same order you added them. 
# This is ok behavior since dictionaries are not ordered.
print(cities2)

Now we have the new structure, lets load it up:

In [None]:
for city in cities:
    for key in city:
        cities2[key].append(city[key])

for key in cities2:
    print(cities2[key][:5])


Now we can load it into a dataFrame no problem! 

In [None]:
df = pd.DataFrame(cities2)
print(df)