# Example Notebook

Jupyter notebooks are a great tool for data analysis and prototyping. Each engineer will have their own folder under `notebooks/{username}/` to store Jupyter notebooks.

This notebook shows some examples.

Every notebook will start with this code cell. It moves up two directories from where the notebook is, so that we can execute code from the root directory, `transithealth/`. This allows us to import code from across project as well as read/write files using paths relative to the root directory.

In [1]:
import os
os.chdir("../../")

## Importing Code and Accessing Files

Now we can import code from the backend API and run it!

In [2]:
from api.metrics.rideshare import RideshareMetrics
from api.utils.testing import create_test_db

We can also refer to scripts from the offline pipeline using their relative paths from the root.

In [3]:
con, cur = create_test_db(
    scripts=[
        "./pipeline/load/rideshare.sql"
    ],
    tables={
        "rideshare": [
            { "n_trips": 7 },
            { "n_trips": 14 },
            { "n_trips": 3 }
        ]
    }
)

metric = RideshareMetrics(con)
actual = metric.get_max_trips()

expected = { "max_trips": 14 }

assert actual == expected

print(f"Actual:   {actual}")
print(f"Expected: {expected}")

None
Actual:   {'max_trips': 14}
Expected: {'max_trips': 14}


## Writing SQL

There are two ways to write SQL in a Jupyter notebooks:

1. With the `sqlite3` module built into Python.
2. With the SQL extension for Jupyter notebooks.

Here is an example using Python. The default fetched response does not include the columns, so we have a method in `api.utils.database` to help. We can also use Pandas to load the result into a DataFrame.

In [4]:
import sqlite3

con = sqlite3.connect("./pipeline/database.db")
cur = con.cursor()

rows = cur.execute("""
SELECT *
FROM traffic_intensity
Where period IN (2016, 2017, 2018, 2019, 2020)

""").fetchall()

rows

[(35, 'all', 1409.5759192199896, None, 2020),
 (36, 'all', 2187.42000708787, None, 2020),
 (37, 'all', 12519.203851864098, None, 2020),
 (38, 'all', 619.207858606121, None, 2020),
 (39, 'all', 1493.64931934343, None, 2020),
 (4, 'all', 381.292721818742, None, 2020),
 (40, 'all', 811.282765402462, None, 2020),
 (41, 'all', 888.4804790248842, None, 2020),
 (42, 'all', 411.452469386972, None, 2020),
 (1, 'all', 444.675036896848, None, 2020),
 (11, 'all', 1905.7654520675696, None, 2020),
 (12, 'all', 1927.4555074795496, None, 2020),
 (13, 'all', 473.696753060702, None, 2020),
 (14, 'all', 476.536509929277, None, 2020),
 (15, 'all', 794.261271652258, None, 2020),
 (16, 'all', 4838.59722182382, None, 2020),
 (17, 'all', 554.9821530572441, None, 2020),
 (18, 'all', 252.313765695363, None, 2020),
 (19, 'all', 273.395860080378, None, 2020),
 (2, 'all', 611.37515081414, None, 2020),
 (20, 'all', 159.97931862329202, None, 2020),
 (21, 'all', 6318.37093815389, None, 2020),
 (22, 'all', 4466.252519

In [5]:
import sqlite3

con = sqlite3.connect("./pipeline/database.db")
cur = con.cursor()

rows = cur.execute("""
SELECT *
FROM traffic_intensity
Where period IN (2016, 2017, 2018, 2019, 2020)

""").fetchall()

rows

[(35, 'all', 1409.5759192199896, None, 2020),
 (36, 'all', 2187.42000708787, None, 2020),
 (37, 'all', 12519.203851864098, None, 2020),
 (38, 'all', 619.207858606121, None, 2020),
 (39, 'all', 1493.64931934343, None, 2020),
 (4, 'all', 381.292721818742, None, 2020),
 (40, 'all', 811.282765402462, None, 2020),
 (41, 'all', 888.4804790248842, None, 2020),
 (42, 'all', 411.452469386972, None, 2020),
 (1, 'all', 444.675036896848, None, 2020),
 (11, 'all', 1905.7654520675696, None, 2020),
 (12, 'all', 1927.4555074795496, None, 2020),
 (13, 'all', 473.696753060702, None, 2020),
 (14, 'all', 476.536509929277, None, 2020),
 (15, 'all', 794.261271652258, None, 2020),
 (16, 'all', 4838.59722182382, None, 2020),
 (17, 'all', 554.9821530572441, None, 2020),
 (18, 'all', 252.313765695363, None, 2020),
 (19, 'all', 273.395860080378, None, 2020),
 (2, 'all', 611.37515081414, None, 2020),
 (20, 'all', 159.97931862329202, None, 2020),
 (21, 'all', 6318.37093815389, None, 2020),
 (22, 'all', 4466.252519

In [6]:
import pandas as pd
from api.utils.database import rows_to_dicts

pd.DataFrame(rows_to_dicts(cur, rows))

Unnamed: 0,area_number,segment,value,std_error,period
0,35,all,1409.575919,,2020
1,36,all,2187.420007,,2020
2,37,all,12519.203852,,2020
3,38,all,619.207859,,2020
4,39,all,1493.649319,,2020
...,...,...,...,...,...
380,74,all,17.484098,,2016
381,75,all,721.120725,,2016
382,76,all,1168.534244,,2016
383,77,all,976.951149,,2016


In [6]:
import requests

r = requests.post("http://18.191.174.84:5000/community/metrics", json={"metrics": ["traffic_intensity_2016"]})
pd.DataFrame(r.json())

Unnamed: 0,metrics
0,"{'area_number': 1, 'name': 'Rogers Park', 'par..."
1,"{'area_number': 2, 'name': 'West Ridge', 'part..."
2,"{'area_number': 3, 'name': 'Uptown', 'part': '..."
3,"{'area_number': 4, 'name': 'Lincoln Square', '..."
4,"{'area_number': 5, 'name': 'North Center', 'pa..."
...,...
72,"{'area_number': 73, 'name': 'Washington Height..."
73,"{'area_number': 74, 'name': 'Mount Greenwood',..."
74,"{'area_number': 75, 'name': 'Morgan Park', 'pa..."
75,"{'area_number': 76, 'name': 'O'Hare', 'part': ..."


Jupyter also supports notebook extensions. This extension allows us to declare a cell as a SQL cell with `%%sql` on the first line of the cell, write queries directly in the cell body, and then view the result as a table.

In [5]:
%%capture
%load_ext sql
%sql sqlite:///pipeline/database.db

In [6]:
%%sql
SELECT *
FROM rideshare
LIMIT 5

 * sqlite:///pipeline/database.db
Done.


week,pickup_community_area,dropoff_community_area,n_trips,n_trips_pooled_authorized,n_trips_pooled,avg_cost_no_tip_cents
2018-11-04,1,1,2223,701,260,627
2018-11-04,52,28,20,14,8,2567
2018-11-04,52,24,3,2,1,3228
2018-11-04,52,22,1,0,0,6800
2018-11-04,52,20,1,1,1,2260


## Querying a Socrata Data Portal

Socrata SQL (SoQL) is a special dialect of SQL that we can use to access datasets from the City of Chicago data portal, as well as other data portals hosted on Socrata.

Below is an example code cell, which uses the `request` module to send a query and get the response as well as Pandas to display the result.

It can take a while to get a response, because you are sending a request to a remote server that will run your SoQL query against the entire dataset. That is why we write scripts in our offline pipeline to aggregate and download data before applying transformations locally.

In [7]:
import pandas as pd
import requests

dataset_json_url = "https://data.cityofchicago.org/resource/m6dm-c72p.json"
query = """
SELECT
    pickup_community_area,
    trip_seconds / 60 as trip_minutes
LIMIT 5
"""
r = requests.get(dataset_json_url, params={"$query": query})
pd.DataFrame(r.json())

Unnamed: 0,pickup_community_area,trip_minutes
0,72,4.65
1,7,3.4833333333333334
2,8,10.8
3,28,15.316666666666666
4,8,9.9
