# `004-data-manip-jsonlines`

Task: use list and dict comprehensions to work with data stored as newline-delimited JSON

## Setup

In [1]:
import json
import requests
from collections import Counter

## Task

As you discovered on Homework 1, preparing data is often a key and tedious component of model training. Here we'll practice a few basic data prep tasks. Large datasets often use streaming file formats like ndjson aka JSONL, so we'll practice with a (small) dataset in that format.

In [2]:
url = 'https://raw.githubusercontent.com/jsonlines/guide/master/datagov100.json'

1. Load the data. Remove `tags` and `extras` as you read it in, since these are large data structures that we don't need. (You can use `del dct[key]` to remove a key from a dictionary.)
1. What is the most common `license_title` for these datasets? (use `Counter`, imported above from the `collections` module, with a list comprehension). *you should get 'U.S. Government Work'*
2. What is the average number of `resources` for each dataset? (use `len(dataset['resources'])` in a list comprehension. *you should get 1.36*
3. Create a dictionary mapping the title of the dataset to the `url` of the first resource listed. (use a dict comprehension). Skip datasets with no resources. Use this dict to find the URL of `'Geologic map of Arkansas (NGMDB)'`.

## Solution

In [3]:
# Load the data into a list of dictionaries
data = []

# Here's how to stream a JSONL from a URL one line at a time
response = requests.get(url, stream=True)
for line in response.iter_lines():
    line = json.loads(line.decode('utf-8'))  # convert line from bytes to dictionary
    del line['tags'], line['extras']
    data.append(line)

In [4]:
# Grab license_title values 
license_title_values = [x['license_title'] for x in data]
Counter(license_title_values)

Counter({'Creative Commons CCZero': 8,
         'Other License Specified': 8,
         'U.S. Government Work': 15,
         None: 69})

2. 'U.S. Government Work' is the most common `license title`.

In [5]:
# Find average number of resources for each dataset
resources_lengths = [len(x['resources']) for x in data]
sum(resources_lengths) / len(resources_lengths)

1.36

In [6]:
name_and_url = {x['title']: x['resources'][0]['url'] for x in data if x['num_resources'] > 0}
name_and_url['Geologic map of Arkansas (NGMDB)']

'http://ngmdb.usgs.gov/Prodesc/proddesc_16308.htm'