# Data Structures / Formats
#### Attribution
* Airplane Crash Data: https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-Since-1908/q2te-8cvq

## Working with Files

In [None]:
# The naive way
f1 = open('important_text.txt', 'r')
print(f1.read())
f1.close()

In [None]:
# The Pythonic (good) way
with open('important_text.txt') as f2:
    print(f2.read())

# No need to call `f2.close`!
# However, you can no longer access f2 outside of the context

In [None]:
f2.read()

## Working with Data

### CSV

In [1]:
import csv

# Read the data
with open('airplane_crashes.csv') as f:
    csv_reader = csv.reader(f)
    airplane_data = list(csv_reader)
print(airplane_data[0])


['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type', 'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground', 'Summary']


In [None]:
from collections import defaultdict
from datetime import datetime

# Tabulate the data
fatalities_per_year = defaultdict(int)
for incident in airplane_data[1:]:
    year = datetime(int(incident[0].split('/')[-1]), 1, 1)
    num_fatalities = int(incident[11]) if incident[11] else 0
    fatalities_per_year[year] += num_fatalities

In [None]:
from matplotlib import pyplot as plt
% matplotlib inline

# Plot the data
plt.scatter(*zip(*fatalities_per_year.items())) # Feed data into pyplot
plt.ylim((0,200)) # Remove outliers
plt.title("Airplane Fatalities per Year")
plt.ylabel("Fatalities")
plt.tight_layout()


In [None]:
from collections import Counter
import pprint

aircraft_types = Counter(x[6] for x in airplane_data[1:])
print("Most common aircraft in crashes:")
pprint.pprint(aircraft_types.most_common(10))

fatal_aircraft_types = Counter(x[6] for x in airplane_data[1:] if x[11] and int(x[11]) > 0)
print("\nMost common aircraft in fatal crashes:")
pprint.pprint(fatal_aircraft_types.most_common(10))

### JSON

In [None]:
import json
import requests

r = requests.get(
    'https://reddit.com/r/all.json',
    headers={'User-Agent': ''}
)
r.text # Unformatted text

In [None]:
reddit_json = json.loads(r.text)
reddit_json

In [None]:
# Data exploration
print(reddit_json.keys())
print(reddit_json['data'].keys())
print(reddit_json['data']['children'][0].keys())
print(reddit_json['data']['children'][0]['data'].keys())
print(reddit_json['data']['children'][0]['data'])

In [None]:
# Extracting scores from data
scores = [x['data']['score'] for x in reddit_json['data']['children']]
plt.hist(scores)

In [None]:
# Getting more data
print(reddit_json['data']['after']) # Remember seeing this?

# Spoofing User-Agent
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

data_sets = [reddit_json]
before = reddit_json['data']['after']
for x in range(10):
    r = requests.get('https://reddit.com/r/all/.json?count=100&before=' + before,
                     headers=headers)
    data_sets.append(r.json())
    before = data_sets[-1]['data']['after']

In [None]:
# Nested list comprehensions
scores = [x['data']['score']
          for y in data_sets 
          for x in y['data']['children']]

print(scores)
plt.hist(scores)
plt.xlabel("Scores")
plt.ylabel("Counts")
plt.title("Scores of Posts on /r/all")

#### Further Reading
Other cool JSON data sources / datasets: https://github.com/jdorfman/awesome-json-datasets

## Challenge
Parse all the JSON in `parse_them_yourself`.

What is this dataset? Can you graph something interesting in this data?

In [None]:
!ls ./parse_them_yourself