# Lab Assignment 3: How to Load, Convert, and Write JSON Files in Python - Rachel Holman
## DS 6001: Practice and Application of Data Science

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

## Problem 0
Import the following libraries:

In [1]:
import numpy as np
import pandas as pd
import requests
import json
import sys
sys.tracebacklimit = 0 # turn off the error tracebacks

## Problem 1 
JSON and CSV are both text-based formats for the storage of data. It's possible to open either one in a plain text editor. Given this similarity, why does a CSV file usually take less memory than a JSON formatted file for the same data? Under what conditions could a JSON file be smaller in memory than a CSV file for the same data? (2 points)

**A CSV file takes less memory than a JSON formatted file for the same data because JSON implements a tree structure and the data constains a lot of brackets, values, lists, and extra information beyond simply the data. CSV files on the other hand only have the data and a separating delimiter.**

**A JSON file could be smaller in memory than a CSV file for the same data if there are a lot of missing values. This is because missing values in JSON are simply omitted while in a CSV file there are recorded as "NA".**

## Problem 2
NASA has a dataset of all meteorites that have fallen to Earth between the years A.D. 860 and 2013. The data contain the name of each meteorite, along with the coordinates of the place where the meteorite hit, the mass of the meteorite, and the date of the collison. The data is stored as a JSON here: https://data.nasa.gov/resource/y77d-th95.json

Look at the data in your web-browser and explain which strategy for loading the JSON into Python makes the most sense and why. 

Then write and run the code that will work for loading the data into Python. (2 points)

**Because this JSON data is nested but does not include metadata, the strategy for loading the JSON into Python that makes the most sense is as follows:**
1. Uve `requests.get()` to download the raw JSON data 
2. Use json.loads() on the .text attribute of the output from step 1 to register the data as a list in Python
3. Use the pd.json_normalize() function on the list that is the output of step 2

In [2]:
url = "https://data.nasa.gov/resource/y77d-th95.json"
nasa = requests.get(url)
nasa_json = json.loads(nasa.text)
nasa_df = pd.json_normalize(nasa_json)
nasa_df

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation.type,geolocation.coordinates,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,Aachen,1,Valid,L5,21,Fell,1880-01-01T00:00:00.000,50.775000,6.083330,Point,"[6.08333, 50.775]",,
1,Aarhus,2,Valid,H6,720,Fell,1951-01-01T00:00:00.000,56.183330,10.233330,Point,"[10.23333, 56.18333]",,
2,Abee,6,Valid,EH4,107000,Fell,1952-01-01T00:00:00.000,54.216670,-113.000000,Point,"[-113, 54.21667]",,
3,Acapulco,10,Valid,Acapulcoite,1914,Fell,1976-01-01T00:00:00.000,16.883330,-99.900000,Point,"[-99.9, 16.88333]",,
4,Achiras,370,Valid,L6,780,Fell,1902-01-01T00:00:00.000,-33.166670,-64.950000,Point,"[-64.95, -33.16667]",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Tirupati,24009,Valid,H6,230,Fell,1934-01-01T00:00:00.000,13.633330,79.416670,Point,"[79.41667, 13.63333]",,
996,Tissint,54823,Valid,Martian (shergottite),7000,Fell,2011-01-01T00:00:00.000,29.481950,-7.611230,Point,"[-7.61123, 29.48195]",,
997,Tjabe,24011,Valid,H6,20000,Fell,1869-01-01T00:00:00.000,-7.083330,111.533330,Point,"[111.53333, -7.08333]",,
998,Tjerebon,24012,Valid,L5,16500,Fell,1922-01-01T00:00:00.000,-6.666670,106.583330,Point,"[106.58333, -6.66667]",,


## Problem 3
https://open-meteo.com/ provides free and accurate weather forecasts for any location, and shares these forecasts via a free and open API in JSON format. The JSON that contains the next week of forecasts for Charlottesville is here: https://api.open-meteo.com/v1/forecast?latitude=38.03&longitude=-78.48&hourly=temperature_2m,relativehumidity_2m,precipitation,cloudcover&temperature_unit=fahrenheit (You can paste this URL to jsonhero.io if you want)

Create a dataframe with 168 rows (one for each hour of each day for the next 7 days) and only columns for features contained within the `hourly` key. [Note: this problem does not require either `pd.read_json()` or `pd.json_normalize()`.]

Also, make sure to include your user-agent string.

As an aside: consider for a moment what we could use access to the API to do. We could write Python code that connects to events on Facebook or Meetup, pulls weather data from this API, and automatically cancels outdoor events that have a high probability of rain in the forecast. Or we can set up automated notifications for stargazing events when the skies will be clear. Maybe we can build a routing app for tornado chasers. Or we can build a model that predicts plant growth from watering times under different weather conditions and notify a gardener about the ideal times to tend to the plants. Can you think of other potential uses of this fast, free, and accurate data? (3 points)

In [3]:
r=requests.get("http://httpbin.org/user-agent")
useragent = json.loads(r.text)['user-agent']
headers={'User-agent': useragent}

url = "https://api.open-meteo.com/v1/forecast?latitude=38.03&longitude=-78.48&hourly=temperature_2m,relativehumidity_2m,precipitation,cloudcover&temperature_unit=fahrenheit"
weather = requests.get(url, headers=headers)
weather_json = json.loads(weather.text)
weather_df = pd.DataFrame(weather_json['hourly'])
weather_df

Unnamed: 0,time,temperature_2m,relativehumidity_2m,precipitation,cloudcover
0,2023-06-27T00:00,71.0,82,0.0,99
1,2023-06-27T01:00,69.8,91,0.0,99
2,2023-06-27T02:00,69.5,90,0.0,99
3,2023-06-27T03:00,68.1,89,0.0,100
4,2023-06-27T04:00,67.0,89,0.0,99
...,...,...,...,...,...
163,2023-07-03T19:00,88.1,45,0.0,96
164,2023-07-03T20:00,87.9,45,0.0,93
165,2023-07-03T21:00,87.0,47,0.0,89
166,2023-07-03T22:00,85.2,52,0.0,67


**Another potential use for this fast, free, and accurate data is using it to automate solar panels to shift direction to gather the most sunlight and avoid sloud cover. Or, this could be used to alert people of the best time to tan and the duration of tanning before injury or burn.**

## Problem 4
The NBA has saved data on all 30 teams' shooting statistics for the 2014-2015 season here: https://stats.nba.com/js/data/sportvu/2015/shootingTeamData.json. Take a moment and look at this JSON file in your web browser. The structure of this particular JSON is complicated, but see if you can find the team-by-team data. In this problem our goal is to use `pd.json_normalize()` to get the data into a dataframe. The following questions will guide you towards this goal.

### Part a
Download the raw text of the NBA JSON file and register it as JSON formatted data in Python's memory. (2 points)

In [4]:
# this dataset lists the column headers first then the data...
url = "https://stats.nba.com/js/data/sportvu/2015/shootingTeamData.json"
nba = requests.get(url)
nba_json = json.loads(nba.text)
#nba_json

### Part b
Describe, in words, the path that leads to the team-by-team data. (2 points)

**The path that leads to the team-by-team data is by looking in the "resultSets" dictionary, entering the first entry (at index = 0), then entering the "rowSet" disctionary. Every entry, or index, in this rowSet dictionary is one row of the team-by-team data.**

### Part c
Use the `pd.json_normalize()` function to pull the team-by-team data into a dataframe. This is going to be tricky. You will need to use indexing on the JSON data as well as the `record_path` parameter. 

If you are successful, you will have a dataframe with 30 rows and 33 columns. The first row will refer to the Golden State Warriors, the second row will refer to the San Antonio Spurs, and the third row will refer to the Cleveland Cavaliers. The columns will only be named 0, 1, 2, ... at this point. (4 points)

In [5]:
nba_df = pd.json_normalize(nba_json, record_path=['resultSets', 'rowSet'])
nba_df


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,32
0,1610612744,Golden State,Warriors,GSW,,82,48.7,114.9,14.9,0.498,...,0.478,21.2,42.5,0.497,2.3,6.3,0.363,10.8,25.3,0.429
1,1610612759,San Antonio,Spurs,SAS,,82,48.3,103.5,14.8,0.481,...,0.506,18.3,39.8,0.46,0.9,2.6,0.341,6.1,15.9,0.381
2,1610612739,Cleveland,Cavaliers,CLE,,82,48.7,104.3,16.9,0.481,...,0.473,18.2,40.7,0.447,1.7,5.7,0.299,9.0,23.9,0.378
3,1610612746,Los Angeles,Clippers,LAC,,82,48.6,104.5,15.0,0.497,...,0.48,18.9,42.0,0.45,2.0,6.0,0.334,7.7,20.8,0.373
4,1610612760,Oklahoma City,Thunder,OKC,,82,48.6,110.2,16.1,0.48,...,0.497,17.5,38.7,0.451,1.6,5.1,0.321,6.6,18.6,0.356
5,1610612737,Atlanta,Hawks,ATL,,82,48.6,102.8,19.0,0.463,...,0.483,19.4,44.6,0.435,1.0,3.1,0.311,9.0,25.3,0.355
6,1610612745,Houston,Rockets,HOU,,82,48.6,106.5,17.2,0.433,...,0.472,15.5,36.4,0.426,2.3,7.4,0.318,8.4,23.5,0.355
7,1610612757,Portland,Trail Blazers,POR,,82,48.5,105.1,17.5,0.441,...,0.447,18.0,39.8,0.453,1.7,5.9,0.295,8.8,22.6,0.389
8,1610612758,Sacramento,Kings,SAC,,81,48.4,106.7,18.7,0.452,...,0.473,18.1,39.7,0.454,0.9,3.1,0.276,7.2,19.4,0.372
9,1610612764,Washington,Wizards,WAS,,82,48.5,104.1,15.4,0.48,...,0.483,19.5,44.3,0.439,0.7,2.7,0.254,8.0,21.5,0.371


### Part d
Find the path that leads to the headers (the column names), and extract these names as a list. Then set the `.columns` attribute of the dataframe you created in part c equal to this list. The result should be that the dataframe now has the correct column names. (3 points)

In [6]:
headers = nba_json['resultSets'][0]['headers']
nba_df.columns = headers
nba_df

Unnamed: 0,TEAM_ID,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CODE,GP,MIN,PTS,PTS_DRIVE,FGP_DRIVE,...,CFGP,UFGM,UFGA,UFGP,CFG3M,CFG3A,CFG3P,UFG3M,UFG3A,UFG3P
0,1610612744,Golden State,Warriors,GSW,,82,48.7,114.9,14.9,0.498,...,0.478,21.2,42.5,0.497,2.3,6.3,0.363,10.8,25.3,0.429
1,1610612759,San Antonio,Spurs,SAS,,82,48.3,103.5,14.8,0.481,...,0.506,18.3,39.8,0.46,0.9,2.6,0.341,6.1,15.9,0.381
2,1610612739,Cleveland,Cavaliers,CLE,,82,48.7,104.3,16.9,0.481,...,0.473,18.2,40.7,0.447,1.7,5.7,0.299,9.0,23.9,0.378
3,1610612746,Los Angeles,Clippers,LAC,,82,48.6,104.5,15.0,0.497,...,0.48,18.9,42.0,0.45,2.0,6.0,0.334,7.7,20.8,0.373
4,1610612760,Oklahoma City,Thunder,OKC,,82,48.6,110.2,16.1,0.48,...,0.497,17.5,38.7,0.451,1.6,5.1,0.321,6.6,18.6,0.356
5,1610612737,Atlanta,Hawks,ATL,,82,48.6,102.8,19.0,0.463,...,0.483,19.4,44.6,0.435,1.0,3.1,0.311,9.0,25.3,0.355
6,1610612745,Houston,Rockets,HOU,,82,48.6,106.5,17.2,0.433,...,0.472,15.5,36.4,0.426,2.3,7.4,0.318,8.4,23.5,0.355
7,1610612757,Portland,Trail Blazers,POR,,82,48.5,105.1,17.5,0.441,...,0.447,18.0,39.8,0.453,1.7,5.9,0.295,8.8,22.6,0.389
8,1610612758,Sacramento,Kings,SAC,,81,48.4,106.7,18.7,0.452,...,0.473,18.1,39.7,0.454,0.9,3.1,0.276,7.2,19.4,0.372
9,1610612764,Washington,Wizards,WAS,,82,48.5,104.1,15.4,0.48,...,0.483,19.5,44.3,0.439,0.7,2.7,0.254,8.0,21.5,0.371


## Problem 5
Save the NBA dataframe you extracted in problem 4 as a JSON-formatted text file on your local machine. Format the JSON so that it is organized as dictionary with three lists: `columns` lists the column names, `index` lists the row names, and `data` is a list-of-lists of data points, one list for each row. (Hint: this is possible with one line of code) (2 points)

In [7]:
nba_j = nba_df.to_json(orient="split")

import os
os.chdir("/Users/rachelholman/Desktop/MSDS/DS6001 - Application of DS/Module 3- JSON Data")

with open('nba_j', 'w') as outfile:
     json.dump(nba_j, outfile, sort_keys = True, indent = 4,
               ensure_ascii = False)