# Example Notebook

Jupyter notebooks are a great tool for data analysis and prototyping. Each engineer will have their own folder under `notebooks/{username}/` to store Jupyter notebooks.

This notebook shows some examples.

Every notebook will start with this code cell. It moves up two directories from where the notebook is, so that we can execute code from the root directory, `transithealth/`. This allows us to import code from across project as well as read/write files using paths relative to the root directory.

In [1]:
import os
os.chdir("../../")

## Importing Code and Accessing Files

Now we can import code from the backend API and run it!

In [2]:
from api.metrics.rideshare import RideshareMetrics
from api.metrics.rent_burdened import RentBurdenedMetrics
from api.utils.testing import create_test_db

We can also refer to scripts from the offline pipeline using their relative paths from the root.

In [3]:
con, cur = create_test_db(
    scripts=[
        "./pipeline/load/rideshare.sql"
    ],
    tables={
        "rideshare": [
            { "n_trips": 7 },
            { "n_trips": 14 },
            { "n_trips": 3 }
        ]
    }
)

metric = RideshareMetrics(con)
actual = metric.get_max_trips()

expected = { "max_trips": 14 }

assert actual == expected

print(f"Actual:   {actual}")
print(f"Expected: {expected}")

None
Actual:   {'max_trips': 14}
Expected: {'max_trips': 14}


## Writing SQL

There are two ways to write SQL in a Jupyter notebooks:

1. With the `sqlite3` module built into Python.
2. With the SQL extension for Jupyter notebooks.

Here is an example using Python. The default fetched response does not include the columns, so we have a method in `api.utils.database` to help. We can also use Pandas to load the result into a DataFrame.

In [15]:
import sqlite3

con = sqlite3.connect("./pipeline/database.db")
cur = con.cursor()

rows = cur.execute("""
SELECT *
FROM rideshare
LIMIT 30
""").fetchall()

rows

[('2018-11-04', 1, 1, 2223, 701, 260, 627),
 ('2018-11-04', 52, 28, 20, 14, 8, 2567),
 ('2018-11-04', 52, 24, 3, 2, 1, 3228),
 ('2018-11-04', 52, 22, 1, 0, 0, 6800),
 ('2018-11-04', 52, 20, 1, 1, 1, 2260),
 ('2018-11-04', 52, 19, 2, 0, 0, 4385),
 ('2018-11-04', 52, 13, 1, 0, 0, 4500),
 ('2018-11-04', 52, 8, 28, 9, 6, 3191),
 ('2018-11-04', 52, 7, 5, 2, 1, 3521),
 ('2018-11-04', 52, 6, 7, 2, 1, 3345),
 ('2018-11-04', 52, 5, 1, 0, 0, 3681),
 ('2018-11-04', 52, 2, 1, 1, 0, 4021),
 ('2018-11-04', 51, 76, 6, 2, 2, 5157),
 ('2018-11-04', 51, 75, 14, 3, 3, 1536),
 ('2018-11-04', 51, 73, 29, 13, 5, 1237),
 ('2018-11-04', 51, 72, 2, 2, 0, 1401),
 ('2018-11-04', 52, 31, 4, 1, 0, 2765),
 ('2018-11-04', 52, 32, 18, 10, 9, 2408),
 ('2018-11-04', 52, 33, 11, 3, 3, 2441),
 ('2018-11-04', 52, 35, 1, 0, 0, 1750),
 ('2018-11-04', 52, 54, 2, 0, 0, 1752),
 ('2018-11-04', 52, 53, 4, 0, 0, 1465),
 ('2018-11-04', 52, 52, 59, 14, 1, 719),
 ('2018-11-04', 52, 51, 27, 12, 0, 857),
 ('2018-11-04', 52, 50, 2, 1, 

In [16]:
import pandas as pd
from api.utils.database import rows_to_dicts

records = pd.DataFrame(rows_to_dicts(cur, rows))

In [17]:
records

Unnamed: 0,week,pickup_community_area,dropoff_community_area,n_trips,n_trips_pooled_authorized,n_trips_pooled,avg_cost_no_tip_cents
0,2018-11-04,1,1,2223,701,260,627
1,2018-11-04,52,28,20,14,8,2567
2,2018-11-04,52,24,3,2,1,3228
3,2018-11-04,52,22,1,0,0,6800
4,2018-11-04,52,20,1,1,1,2260
5,2018-11-04,52,19,2,0,0,4385
6,2018-11-04,52,13,1,0,0,4500
7,2018-11-04,52,8,28,9,6,3191
8,2018-11-04,52,7,5,2,1,3521
9,2018-11-04,52,6,7,2,1,3345


In [11]:
records.columns

Index(['week', 'pickup_community_area', 'dropoff_community_area', 'n_trips',
       'n_trips_pooled_authorized', 'n_trips_pooled', 'avg_cost_no_tip_cents'],
      dtype='object')

Jupyter also supports notebook extensions. This extension allows us to declare a cell as a SQL cell with `%%sql` on the first line of the cell, write queries directly in the cell body, and then view the result as a table.

In [None]:
%%capture
%load_ext sql
%sql sqlite:///pipeline/database.db

In [None]:
%%sql
SELECT *
FROM rideshare
LIMIT 5

## Querying a Socrata Data Portal

Socrata SQL (SoQL) is a special dialect of SQL that we can use to access datasets from the City of Chicago data portal, as well as other data portals hosted on Socrata.

Below is an example code cell, which uses the `request` module to send a query and get the response as well as Pandas to display the result.

It can take a while to get a response, because you are sending a request to a remote server that will run your SoQL query against the entire dataset. That is why we write scripts in our offline pipeline to aggregate and download data before applying transformations locally.

In [None]:
import pandas as pd
import requests

dataset_json_url = "https://data.cityofchicago.org/resource/m6dm-c72p.json"
query = """
SELECT
    pickup_community_area,
    trip_seconds / 60 as trip_minutes
LIMIT 5
"""
r = requests.get(dataset_json_url, params={"$query": query})
pd.DataFrame(r.json())

My own stuff below

In [None]:
import sqlite3

con = sqlite3.connect("./pipeline/database.db")
cur = con.cursor()

rows = cur.execute("""
SELECT *
FROM rent_burdened_households
LIMIT 5
""").fetchall()

rows

In [None]:
%%capture
%load_ext sql
%sql sqlite:///pipeline/database.db

In [None]:
%%sql
SELECT *
FROM rent_burdened_households
LIMIT 5

In [None]:
from api.utils.database import rows_to_dicts


class CommunityMetrics:
    """
    Metrics for community area data.
    """

    def __init__(self, con):
        self.con = con

    def income(self, year, segment):
        """
        Returns the rounded income value for each community area.
        Args:
            year (int): period ending year to filter by
            segment (str): population segment to filter by
        """
        query = """
        SELECT
            area_number,
            CAST(value AS INTEGER) AS value
        FROM income
        WHERE period_end_year == {year}
        AND segment == "{segment}"
        """.format(year=year, segment=segment)
        cur = self.con.cursor()
        cur.execute(query)
        rows = rows_to_dicts(cur, cur.fetchall())
        return rows

In [None]:
import sys
sys.path.append("../")

from api.metrics.community import CommunityMetrics
from api.utils.testing import create_test_db

...

def test_income():
    income_table = [
        {
            "area_number": 1,
            "period_end_year": 2019,
            "segment": "all",
            "value": 13000
        },
        {
            "area_number": 2,
            "period_end_year": 2019,
            "segment": "all",
            "value": 27000
        },
        {
            "area_number": 1,
            "period_end_year": 2010,
            "segment": "all",
            "value": 10000
        }
    ]
    con, cur = create_test_db(
        scripts=[
            "./pipeline/load/income.sql"
        ],
        tables={
            "income": income_table
        }
    )

    metric = CommunityMetrics(con)

    assert metric.income(year=2019, segment="all") == [
        { "area_number": 1, "value": 13000 },
        { "area_number": 2, "value": 27000 }
    ], "Should have two results for 2019."

    assert metric.income(year=2010, segment="all") == [
        { "area_number": 1, "value": 10000 }
    ], "Should have one result for 2010."

    assert metric.income(year=2013, segment="all") == [], "Should have no results for 2013."

In [None]:
test_income()