In [1]:
from traitlets.config.manager import BaseJSONConfigManager
path = '/Users/jmk/anaconda2/envs/data601/etc/jupyter/nbconfig'
cm = BaseJSONConfigManager(config_dir=path)
cm.update('livereveal', {
              'theme': 'simple',  
              #'transition': 'zoom',
              'start_slideshow_at': 'selected',
})

# theme names: biege, blood, default, moon, night, serif, simple, sky, solarized

{'scroll': True, 'start_slideshow_at': 'selected', 'theme': 'simple'}

# About Data

TL;DR  The measurements are not "the thing" being measured.  They are a simplified model of it.

* Data are measurements of a phenomenon
* Formats are a representation of the data
* Systems store formatted data

These are very interconnected.

For example, 

* The solar system has planets
* Each planet has 0+ moons
* The moon-count for each planet is `[0, 0, 1, 2, 67, 62, 27, 14]` (Source:  https://www.windows2universe.org/our_solar_system/moons_table.html)
* Those can be written as above (a python list) or a CSV file `0,0,1,2,67,27,14` or XML

```
...
<mooncount>
   <planet name="Mercury" moons="0" />
   ...
</mooncount>
...
```

... or any number of arbitrary formats.  
* Those can be stored inside of filesystems, databases, etc.

* The process of writing data in a format like this is often called "serialization".
* e.g. "I serialized the data to a JSON file in order to store it in the database"

## Why bother?

### To make data portable!

# Common Formats
For this semester, you'll mostly _read_ from these formats using libraries that are built just for that purpose.  The goal is to get data into your program in a form that you can use.

There's a deep, deep rabbit hole to follow if you're interested in the intricacies of any of these formats.

# CSV
* It's a plain text representation of tabular data
* You can think of it as a tab from a spreadsheet 
* It's like a "list of rows of cells"

## Basic rules
* One row per line (usually)
* Each value separated by a comma (`,`)
* Quotes are allowed so that commas can appear inside of cells or newlines can appear inside of cells

# Example...
```
RM,LSTAT,PTRATIO,MDEV
6.575,4.98,15.3,504000.0
6.421,9.14,17.8,453600.0
7.185,4.03,17.8,728700.0
...
```
(source:  Boston Housing Prices Dataset copied here https://raw.githubusercontent.com/ggallo/boston-housing/master/housing.csv)

## Note that the first row is often labels for each column

# JSON
* JavaScript Object Notation
* It's a way to write out a data structure literal in JavaScript
* As JavaScript has risen to ubiquity for modern web UIs, JSON has become a simple format for serializing data to transmit from servers to browsers
* Turns out to be a simple and fairly universal way to represent and transmit most basic data structures between languages
* The format is almost identical to what's used in python dictionaries and lists
* Supports arbitrary nesting

# Example JSON
```
[
  {'RM': 6.575, 'LSTAT': 4.98, 'PTRATIO': 15.3, 'MDEV': 504000.0},
  ...
]
```

# XML
* "Tag"-based markup language
* More structured than HTML (Common ancestor in SGML)
* Can be parsed as a one-shot document object model (DOM)
* Can also be parsed in a "streaming" fashion (via Simple API for XML (aka SAX))

# Example XML
```
<?xml version="1.0" encoding="UTF-8"?>
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me <em>this</em> weekend!</body>
</note>
```

## Note the open _and_ close tags.  These allow XML to "mark up" other content.

#  Field Trip!

Well, sort of:  https://en.m.wikipedia.org/wiki/Comparison_of_data_serialization_formats
        
        

# From disk to memory
* To load data from one of these formats...
* ... we first read them from disk
* ... then "parse" them to convert them from a complex string to a data structure!

## Normally there's a library for a format.  Use it.  (e.g. `import csv`)



In [4]:
# Parsing a CSV file in python...
import csv
import requests
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
    
response = requests.get('https://raw.githubusercontent.com/ggallo/boston-housing/master/housing.csv')
print(response)
print(response.content)

<Response [200]>
b'RM,LSTAT,PTRATIO,MDEV\n6.575,4.98,15.3,504000.0\n6.421,9.14,17.8,453600.0\n7.185,4.03,17.8,728700.0\n6.998,2.94,18.7,701400.0\n7.147,5.33,18.7,760200.0\n6.43,5.21,18.7,602700.0\n6.012,12.43,15.2,480900.0\n6.172,19.15,15.2,569100.0\n5.631,29.93,15.2,346500.0\n6.004,17.1,15.2,396900.0\n6.377,20.45,15.2,315000.0\n6.009,13.27,15.2,396900.0\n5.889,15.71,15.2,455700.0\n5.949,8.26,21.0,428400.0\n6.096,10.26,21.0,382200.0\n5.834,8.47,21.0,417900.0\n5.935,6.58,21.0,485100.0\n5.99,14.67,21.0,367500.0\n5.456,11.69,21.0,424200.0\n5.727,11.28,21.0,382200.0\n5.57,21.02,21.0,285600.0\n5.965,13.83,21.0,411600.0\n6.142,18.72,21.0,319200.0\n5.813,19.88,21.0,304500.0\n5.924,16.3,21.0,327600.0\n5.599,16.51,21.0,291900.0\n5.813,14.81,21.0,348600.0\n6.047,17.28,21.0,310800.0\n6.495,12.8,21.0,386400.0\n6.674,11.98,21.0,441000.0\n5.713,22.6,21.0,266700.0\n6.072,13.04,21.0,304500.0\n5.95,27.71,21.0,277200.0\n5.701,18.35,21.0,275100.0\n6.096,20.34,21.0,283500.0\n5.933,9.68,19.2,396900.0\n5.84

In [7]:
fake_file = StringIO(response.text)
sheet = [line.replace('\n', '').split(',') for line in fake_file]

fake_file = StringIO(response.text)
reader = csv.reader(fake_file)
csvsheet = [row for row in reader]
csvsheet

[['RM', 'LSTAT', 'PTRATIO', 'MDEV'],
 ['6.575', '4.98', '15.3', '504000.0'],
 ['6.421', '9.14', '17.8', '453600.0'],
 ['7.185', '4.03', '17.8', '728700.0'],
 ['6.998', '2.94', '18.7', '701400.0'],
 ['7.147', '5.33', '18.7', '760200.0'],
 ['6.43', '5.21', '18.7', '602700.0'],
 ['6.012', '12.43', '15.2', '480900.0'],
 ['6.172', '19.15', '15.2', '569100.0'],
 ['5.631', '29.93', '15.2', '346500.0'],
 ['6.004', '17.1', '15.2', '396900.0'],
 ['6.377', '20.45', '15.2', '315000.0'],
 ['6.009', '13.27', '15.2', '396900.0'],
 ['5.889', '15.71', '15.2', '455700.0'],
 ['5.949', '8.26', '21.0', '428400.0'],
 ['6.096', '10.26', '21.0', '382200.0'],
 ['5.834', '8.47', '21.0', '417900.0'],
 ['5.935', '6.58', '21.0', '485100.0'],
 ['5.99', '14.67', '21.0', '367500.0'],
 ['5.456', '11.69', '21.0', '424200.0'],
 ['5.727', '11.28', '21.0', '382200.0'],
 ['5.57', '21.02', '21.0', '285600.0'],
 ['5.965', '13.83', '21.0', '411600.0'],
 ['6.142', '18.72', '21.0', '319200.0'],
 ['5.813', '19.88', '21.0', '3045

In [9]:
#  Or as dictionaries...
import csv
from io import StringIO   #  This lets us use a string in place of a file.

rows = []
for row in csv.DictReader(StringIO(response.text)):
    rows.append(row)
    print(row)

OrderedDict([('RM', '6.575'), ('LSTAT', '4.98'), ('PTRATIO', '15.3'), ('MDEV', '504000.0')])
OrderedDict([('RM', '6.421'), ('LSTAT', '9.14'), ('PTRATIO', '17.8'), ('MDEV', '453600.0')])
OrderedDict([('RM', '7.185'), ('LSTAT', '4.03'), ('PTRATIO', '17.8'), ('MDEV', '728700.0')])
OrderedDict([('RM', '6.998'), ('LSTAT', '2.94'), ('PTRATIO', '18.7'), ('MDEV', '701400.0')])
OrderedDict([('RM', '7.147'), ('LSTAT', '5.33'), ('PTRATIO', '18.7'), ('MDEV', '760200.0')])
OrderedDict([('RM', '6.43'), ('LSTAT', '5.21'), ('PTRATIO', '18.7'), ('MDEV', '602700.0')])
OrderedDict([('RM', '6.012'), ('LSTAT', '12.43'), ('PTRATIO', '15.2'), ('MDEV', '480900.0')])
OrderedDict([('RM', '6.172'), ('LSTAT', '19.15'), ('PTRATIO', '15.2'), ('MDEV', '569100.0')])
OrderedDict([('RM', '5.631'), ('LSTAT', '29.93'), ('PTRATIO', '15.2'), ('MDEV', '346500.0')])
OrderedDict([('RM', '6.004'), ('LSTAT', '17.1'), ('PTRATIO', '15.2'), ('MDEV', '396900.0')])
OrderedDict([('RM', '6.377'), ('LSTAT', '20.45'), ('PTRATIO', '15.2'

In [10]:
rows[0]['RM']

'6.575'

# Reading JSON
* `import json`
* `json.load()` or `json.loads()`

#  Writing JSON
* `json.dump()` or `json.dumps()`


In [11]:
records = [row for row in csv.DictReader(StringIO(response.text))]

import json
json.dump(records, open('./boston.json', 'w'))
json.load(open('./boston.json', 'r'))

serialized = json.dumps(records)
serialized

deserialized = json.loads(serialized)
deserialized

[{'LSTAT': '4.98', 'MDEV': '504000.0', 'PTRATIO': '15.3', 'RM': '6.575'},
 {'LSTAT': '9.14', 'MDEV': '453600.0', 'PTRATIO': '17.8', 'RM': '6.421'},
 {'LSTAT': '4.03', 'MDEV': '728700.0', 'PTRATIO': '17.8', 'RM': '7.185'},
 {'LSTAT': '2.94', 'MDEV': '701400.0', 'PTRATIO': '18.7', 'RM': '6.998'},
 {'LSTAT': '5.33', 'MDEV': '760200.0', 'PTRATIO': '18.7', 'RM': '7.147'},
 {'LSTAT': '5.21', 'MDEV': '602700.0', 'PTRATIO': '18.7', 'RM': '6.43'},
 {'LSTAT': '12.43', 'MDEV': '480900.0', 'PTRATIO': '15.2', 'RM': '6.012'},
 {'LSTAT': '19.15', 'MDEV': '569100.0', 'PTRATIO': '15.2', 'RM': '6.172'},
 {'LSTAT': '29.93', 'MDEV': '346500.0', 'PTRATIO': '15.2', 'RM': '5.631'},
 {'LSTAT': '17.1', 'MDEV': '396900.0', 'PTRATIO': '15.2', 'RM': '6.004'},
 {'LSTAT': '20.45', 'MDEV': '315000.0', 'PTRATIO': '15.2', 'RM': '6.377'},
 {'LSTAT': '13.27', 'MDEV': '396900.0', 'PTRATIO': '15.2', 'RM': '6.009'},
 {'LSTAT': '15.71', 'MDEV': '455700.0', 'PTRATIO': '15.2', 'RM': '5.889'},
 {'LSTAT': '8.26', 'MDEV': '42840

In [12]:
deserialized = json.loads(serialized)
deserialized[0]

{'LSTAT': '4.98', 'MDEV': '504000.0', 'PTRATIO': '15.3', 'RM': '6.575'}

#  Sources of Data

Where do data come from?

* Filesystem
* Network server

Also (in follow-on lectures):
* Relational Databases (e.g. Oracle, MySQL, PostgreSQL)
* Non-relational Database (e.g. NoSQL)

# The Filesystem

* It's a place (system?) to keep files
* It normally forms a tree with the computer at the top
* It's got _directories_ and _files_ in it.

In [13]:
# Filesystem

#  We can read the whole thing...
f = open('./README.md', 'r')
contents = f.read()
f.close()
print('%d characters' % len(contents))

#  We can read a line at a time...
f = open('./README.md', 'r')
lines = [line for line in f.readlines()]
f.close()
print('%d lines' % len(lines))

#  We can read a line at a time...
with open('./README.md', 'r') as f:
    line_counts = [len(line) for line in f.readlines()]
line_counts

315 characters
9 lines


[1, 76, 71, 1, 107, 1, 56, 1, 1]

In [14]:
print(lines)

['\n', 'The following command will start the datascience notebook docker image and \n', 'mount `repos/data601` from your home directory into the `work` folder.\n', '\n', '    docker run -it --rm -v $HOME/repos/data601:/home/jovyan/work -p 8888:8888 jupyter/datascience-notebook\n', '\n', 'From here you can run R, Julia, and python 3 notebooks.\n', '\n', '\n']


# Network servers

* Many moons ago there were arguments about how computers should connect to each other over the internet
* The agreed-upon rules that people designed are called _protocols_
* Long ago TCP/IP won _mindshare_ due to simplicity (comparatively), effectiveness, and openness


* Back then there was no "web", just an internet
* In the years since then, HTTP (which runs on top of TCP/IP) has also taken over an enormous amount of marketshare

## You can get remarkably far just by knowing how to fetch data over the web

In [16]:
import requests
response = requests.get('https://raw.githubusercontent.com/ggallo/boston-housing/master/housing.csv')
response


response2 = requests.get('https://en.wikipedia.org')
response2.ok
response2.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":807996266,"wgRevisionId":807996266,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonth

HTTP responses include a "status code".  This is an integer that tells you if it worked (e.g. `200 OK`) or failed (e.g. `404 Not Found`).

In the above example, we were able to load the page.  Let's see what it contains...

In [17]:
response.text

'RM,LSTAT,PTRATIO,MDEV\n6.575,4.98,15.3,504000.0\n6.421,9.14,17.8,453600.0\n7.185,4.03,17.8,728700.0\n6.998,2.94,18.7,701400.0\n7.147,5.33,18.7,760200.0\n6.43,5.21,18.7,602700.0\n6.012,12.43,15.2,480900.0\n6.172,19.15,15.2,569100.0\n5.631,29.93,15.2,346500.0\n6.004,17.1,15.2,396900.0\n6.377,20.45,15.2,315000.0\n6.009,13.27,15.2,396900.0\n5.889,15.71,15.2,455700.0\n5.949,8.26,21.0,428400.0\n6.096,10.26,21.0,382200.0\n5.834,8.47,21.0,417900.0\n5.935,6.58,21.0,485100.0\n5.99,14.67,21.0,367500.0\n5.456,11.69,21.0,424200.0\n5.727,11.28,21.0,382200.0\n5.57,21.02,21.0,285600.0\n5.965,13.83,21.0,411600.0\n6.142,18.72,21.0,319200.0\n5.813,19.88,21.0,304500.0\n5.924,16.3,21.0,327600.0\n5.599,16.51,21.0,291900.0\n5.813,14.81,21.0,348600.0\n6.047,17.28,21.0,310800.0\n6.495,12.8,21.0,386400.0\n6.674,11.98,21.0,441000.0\n5.713,22.6,21.0,266700.0\n6.072,13.04,21.0,304500.0\n5.95,27.71,21.0,277200.0\n5.701,18.35,21.0,275100.0\n6.096,20.34,21.0,283500.0\n5.933,9.68,19.2,396900.0\n5.841,11.41,19.2,42000

# Before we leave data formats, let's take a quick peek where we're headed next...

In [21]:
import pandas as pd

housing_data = pd.DataFrame.from_csv('https://raw.githubusercontent.com/ggallo/boston-housing/master/housing.csv', index_col=None)
housing_data

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,RM,LSTAT,PTRATIO,MDEV
0,6.575,4.98,15.3,504000.0
1,6.421,9.14,17.8,453600.0
2,7.185,4.03,17.8,728700.0
3,6.998,2.94,18.7,701400.0
4,7.147,5.33,18.7,760200.0
5,6.430,5.21,18.7,602700.0
6,6.012,12.43,15.2,480900.0
7,6.172,19.15,15.2,569100.0
8,5.631,29.93,15.2,346500.0
9,6.004,17.10,15.2,396900.0


## Looks handy, right?  Welcome to pandas.

But first, let's look under the hood of pandas at NumPy - the heart of all of data science in python.