Urban Data Science & Smart Cities <br>
URSP688Y Spring 2025<br>
Instructor: Chester Harvey <br>
Urban Studies & Planning <br>
National Center for Smart Growth <br>
University of Maryland

# Demo 5 - Accessing and Wrangling Data from the Web

- GitHub Branches
- Loading Data from the Web with APIs
- Debugging

## GitHub Branches

Branches allow you to organize work in a contained space. For our purposes, their most important feature is allowing you to make a pull request with only the specific changes you want to submit and *not* all the other things you may be experimenting with in your repository.

[Here's a more extended overview](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-branches) of what branches are and how they work.

### Key Concepts
- Every repository or fork has a default branch ('main').
- You can make as many additional branches as you'd like.
- Commits are always made to a branch (even if it's 'main').
- When making a new branch in order to make a pull request to an upstream repository, I recommend using the 'main' branch of that upstream repository as the source for your new branch.

### Tips for Using Branches for Exercises
- Make a new branch for each exercise
- Make a copy of the exercise notebook, then rename the copy with your name as an underscored suffix (e.g., `exercise02_chester.ipynb`).
- Don't make changes unrelated to your exercise on the branch you set up for that exercise.
- If you accidentally make other changes, the easiest way to clean things up is be to make a new branch, then copy only the files you want to submit into that branch. You can temporarily copy them to the desktop as you move them between branches.
- Sync your branch before making a pull request.

### Detailed Steps for Doing Your Exercise on a Branch
1. Go to your fork of the course respository on the GitHub website (e.g., https://github.com/[your username]/ursp688y_sp2025).
2. In the upper-left, click where it says "1 Branch" or "[n] Branches" to open branches page.
3. Click the green "New branch" button in the upper left.
4. Write a name for your new branch (e.g., "exercise-2").
5. Choose "ncsg/ursp688y_sp2025" as the source and "main" as the source branch. This will ensure your branch starts out being in sync with the course respository, reducing the likelihood of a conflict when you make a pull request.
6. Open GitHub Desktop and navigate to the clone of your fork.
7. Fetch from the origin, which will sync your new branch.
8. Select your new branch from the "Current Branch" dropdown at the center-top of the GitHub Desktop window
9. If you have uncommitted changes on your current branch (e.g., you may have been working on your exercise 2 notebook but hadn't yet committed the changes) a dialog will pop up asking if you want to keep the changes on 'main' or move them to your new branch. I recommend only moving changes if you're confident they're related to the purpose of your new branch.
10. Once you're on the new branch, work on your code (e.g., open Jupyter Lab and write code, copy and move files, etc.). Make commits to your new branch. You can keep and come back to this branch for however long you're working on the exercise.
    - When you have a branch selected, the Windows Explorer, Mac Finder, or file navigator in Jupyter Lab automatically show the state of that branch within the cloned directory. You are working with the *version* of the repository/fork stored in the selected branch.
    - ***Note: Don't delete the template exercise notebook. Instead, make a copy of it, then rename the copy with your name as an underscored suffix (e.g., `exercise02_chester.ipynb`)***
    - If you previously committed changes related to the exercise in the 'main' branch, I recommend going back to 'main', copying any new/changed files to your Desktop, then going back to your new branch and copying the files into it. Then commit them on the new branch.
11. When you're done coding and ready to make a pull request from your branch, push and fetch the origin a final time to make sure everything is in sync between your computer and the cloud.
12. Go to your fork on GitHub.com.
13. Go to the branches page and click on your branch.
14. Click "Sync fork" to make sure it's up-to-date with upstream course repository.
15. Click "Contribute" and "Open pull request".
16. Scroll down and make sure that only the files you intended to include in the pull request are included.
17. Add a title and description. Click "Create pull request".

## APIs (Application Programming Interfaces)

APIs are an interface for accessing information. At the most general level, nearly all programs that can be accessed with code have an API.

Python functions, for example, are programs with APIs. You access them by calling the function name and defining arguments.

In practice, when people talk about getting data from APIs they are usually talking about web APIs
- These are usually designed with a software architecture called [REST](https://en.wikipedia.org/wiki/REST).
- Using a REST API involves making a request to a URL and receiving a response.
- Often, that response is in a format called [JSON](https://en.wikipedia.org/wiki/JSON), which is structured like nested dictionaries and lists.

Today, we're going to practice retrieving data about cities from web APIs, then wrangling the data they return into a tabular format we can analyze.

Making a request to an API is just another way to get data, similar to downloading it from an open data portal. Why would you bother querying an API instead of just downloading a table?
- APIs allow programmatic access to data that can be easily scaled, replicated, and documented
- APIs can allow customization of which data you are accessing
- JSON allows for much more complex data structures than downloadable tables
- APIs can be an easy way to access real-time data
- Can we think of other reasons?

### Capital Bikeshare Data — Free, simple, real-time

Some APIs with data about cities are free and simple. The Capital Bikeshare (CABI) system, for example, has an API that reports on the status of bikes in its system in real-time. It's available free as part of CABI's operating agreement with the City of Washington, D.C.

The District Department of Transportation (DDOT) lists APIs for all of the micromobility systems operating in the city on [this webpage](https://ddot.dc.gov/page/dockless-api).

Let's request some data from the CABI systems and see what it looks like.

- What could we do with these data?
- What are its limitations?

In [1]:
# Import package dependencies
import pandas as pd
import requests # for making RESTful API requests
import json # for converting strings in JSON format to python dictionaries and lists
import yaml # for converting yaml-structured text into python dictionaries and lists
import os # for basic operating system functions, like compiling paths

In [2]:
# Making a "GET" request
response = requests.get('https://gbfs.lyft.com/gbfs/1.1/dca-cabi/en/free_bike_status.json')

# Get JSON content
data = response.json()

In [3]:
# Preview the data
# data

In [4]:
# Inspect the keys
data.keys()

dict_keys(['data', 'last_updated', 'ttl', 'version'])

In [5]:
data['data']['bikes'][0]

{'is_reserved': 0,
 'is_disabled': 0,
 'fusion_lat': 0.0,
 'lon': -77.076197386,
 'fusion_lon': 0.0,
 'bike_id': '35caeda3733992250766c851fdab0d67',
 'name': '483-2424',
 'lat': 38.987995148,
 'type': 'electric_bike',
 'rental_uris': {'android': 'https://dc.lft.to/lastmile_qr_scan',
  'ios': 'https://dc.lft.to/lastmile_qr_scan'}}

In [6]:
# Make a dataframe out of data for available bikes
df = pd.DataFrame(data['data']['bikes'])

df.head()

Unnamed: 0,is_reserved,is_disabled,fusion_lat,lon,fusion_lon,bike_id,name,lat,type,rental_uris
0,0,0,0.0,-77.076197,0.0,35caeda3733992250766c851fdab0d67,483-2424,38.987995,electric_bike,{'android': 'https://dc.lft.to/lastmile_qr_sca...
1,0,0,0.0,-77.003288,0.0,91dbfcab197aa2573be04d2f081162b2,784-9151,38.890889,electric_bike,{'android': 'https://dc.lft.to/lastmile_qr_sca...
2,0,0,0.0,-77.016247,0.0,7871df699ede9efc68a2f2e410bd6013,483-3620,38.903177,electric_bike,{'android': 'https://dc.lft.to/lastmile_qr_sca...
3,0,0,0.0,-77.087029,0.0,91d887f54520594f5195168949f907bf,339-0820,38.900869,electric_bike,{'android': 'https://dc.lft.to/lastmile_qr_sca...
4,0,0,0.0,-77.072072,0.0,1b573760c0d109b57ba4b24a3db54309,872-1383,38.86826,electric_bike,{'android': 'https://dc.lft.to/lastmile_qr_sca...


In [11]:
# Save the json data for later

def save_json(data, file_name, timestamp=False):
    """Save data as json file
    data: json-compatable data structure (nested dicts and lists)
    file_name: string for file name; DO NOT include file extension (e.g., ".json")
    """
    if timestamp:
        file_name = f'{file_name}_{timestamp}.json'
    else:
        file_name = f'{file_name}.json'
    with open(file_name, 'w') as f:
        json.dump(data, f, indent=4)

save_json(data, 'cabi_data')

In [12]:
# Make an automated workflow to retrieve data and save it, all at once

def get_and_save_cabi_data():
    """Get current data from the CABI API and save as a timestamped JSON
    """
    # Making a "GET" request
    response = requests.get('https://gbfs.lyft.com/gbfs/1.1/dca-cabi/en/free_bike_status.json')
    # Get JSON content
    data = response.json()
    # Get timestamp from data
    timestamp = data['last_updated']
    # Save to file
    save_json(data, 'cabi_data', timestamp) 

In [13]:
# Run the automated workflow
# Could we do this on a schedule to collect "snapshots" of the state of the CABI system?
get_and_save_cabi_data()

### Rentcast — Paid, more complex, updated less frequently

Free APIs like for CABI are becoming less common. (Does this sound familiar in light of today's reading about smart cities as emerging markets?) Many other APIs require that you pay for data, either through a subscription or for request you make. Some have free tiers, but they're usually quite limited.

Several years ago, Zillow provided data about real estate markets through a free API available to the public. You now have to go through a complicated application process to get access to their API, and your use case needs to be aligned with their business model.

An alternative source of real estate data is a company called [Rentcast](https://www.rentcast.io/). They allow anyone to set up an account and purchase data through an API, and it [gets expensive fast](https://www.rentcast.io/api#api-pricing). You get 50 requests free for "development," but after that you pay \\$0.20 per request or \\$74 per month for a subscription to make up to 1,000 requests.

They keep track of who is making requests with an 'API key', which is a long string of characters you include in your request as a 'header'. Because API keys are attached to billing information (i.e., your credit card), they're very sensitive. You ***NEVER*** want to commit your API key to GitHub or share it anywhere else publicly.

It's best practice to store your API key in a separate file—I like to use a format called YAML—that you prevent from being committed by adding it to your respository's `.gitignore` file. This is a list of files that you explicitly tell git not to keep track of.

When you want to use your API key, you load the configs into memory in the Python kernel you're currently working in. When you close or restart the kernel, the computer forgets it.

In [14]:
# Load personal data from a configs file (API key, local data path)
with open('configs.yaml', 'r') as file:
    CONFIGS = yaml.safe_load(file)

In [15]:
# Load eviction data we used last week
df = pd.read_csv('District_Court_of_Maryland_Eviction_Case_Data_MG_PG.csv')

In [16]:
# Preview columns
df.head(1)

Unnamed: 0.1,Unnamed: 0,Event Date,Event Type,Event Comment,County,Location,Tenant City,Tenant State,Tenant ZIP Code,Case Type,Case Number,Evicted Date,Event Year,Eviction Year
0,0,01/03/2023,Warrant of Restitution - Return of Service - E...,,Montgomery,Rockville,Silver Spring,MD,20910.0,Failure to Pay Rent,D-061-LT-22-004107,12/08/2022,2023.0,2022.0


In [19]:
# Get rentcast market data for the 10 zipcodes that are most represented in the eviction case data
zipcodes = df['Tenant ZIP Code'].value_counts().head(1).index.astype('Int64')

for zipcode in zipcodes:
    # Make GET request to rentcast API
    url = f'https://api.rentcast.io/v1/markets?zipCode={zipcode}&dataType=All&historyRange=6'
    headers = {
        'X-Api-Key': CONFIGS['rentcast_api_key'],
        'Accept': 'application/json', 
    }
    response = requests.get(url, headers=headers)
    data = response.json()
    # Save to json
    file_path = f'rentcast_{zipcode}.json'
    with open(file_path, 'w') as file:
        json.dump(data, file, indent=4)

In [23]:
# Preview the data
data.keys()

dict_keys(['id', 'zipCode', 'saleData', 'rentalData'])

In [22]:
data['saleData'].keys()

dict_keys(['lastUpdatedDate', 'averagePrice', 'medianPrice', 'minPrice', 'maxPrice', 'averagePricePerSquareFoot', 'medianPricePerSquareFoot', 'minPricePerSquareFoot', 'maxPricePerSquareFoot', 'averageSquareFootage', 'medianSquareFootage', 'minSquareFootage', 'maxSquareFootage', 'averageDaysOnMarket', 'medianDaysOnMarket', 'minDaysOnMarket', 'maxDaysOnMarket', 'newListings', 'totalListings', 'dataByPropertyType', 'dataByBedrooms', 'history'])

In [26]:
data['saleData']['dataByPropertyType'][0]

{'propertyType': 'Condo',
 'averagePrice': 186115,
 'medianPrice': 159995,
 'minPrice': 60000,
 'maxPrice': 395000,
 'averagePricePerSquareFoot': 156.86,
 'medianPricePerSquareFoot': 159.43,
 'minPricePerSquareFoot': 58.54,
 'maxPricePerSquareFoot': 235.56,
 'averageSquareFootage': 1148,
 'medianSquareFootage': 992,
 'minSquareFootage': 743,
 'maxSquareFootage': 2126,
 'averageDaysOnMarket': 59.58,
 'medianDaysOnMarket': 56,
 'minDaysOnMarket': 4,
 'maxDaysOnMarket': 167,
 'newListings': 4,
 'totalListings': 19}

---
We made it this far in class on Week 5

---

## Debugging

Errors are frustrating and inevitable. Even professional programmers probably spend most of their time debugging.

Luckily, there are good tools and techniques for making debugging a little easier.

An "interactive debugger" helps diagnose problems by stepping through the code one line at a time.

The debugger provides tools for setting "breakpoints" where the code will stop running temporarily, a table that shows the values of variables at that time, and buttons to start, stop, and step through the code.

https://jupyterlab.readthedocs.io/en/stable/user/debugger.html

In [None]:
def check_adult(age, cutoff=18):
    if age < cutoff:
        adult = False
    else:
        adult = True
    return adult

check_adult(20)

## Style guidelines for Python
- At the very least, do things consistently
- One statement per line
- Try to limit line length to 72 characters
- Use four spaces to indent
- Put spaces around operators (e.g., `1 + 1` or `day = 'Monday'`) (except in keyword function arguments)
- Use blank lines intentionally and consistently
- Use meaningful names
- Name variables and functions with `lowercase_underscores`
- Constants are often named in `ALL_CAPS_WITH_UNDERSCORES` (e.g., `C = 2.99792458e+8`)
- Name custom classes with `CapWords`
- In general, avoid spaces in folder and filenames used for programming

See [Code Readability](https://github.com/ncsg/ursp688y_sp2024/blob/main/README.md#code-readability) on the syllabus. [CS61A](https://cs61a.org/articles/composition/) has an excellent composition guide. [PEP 8](https://peps.python.org/pep-0008/) is a standard Python style guide. [Google](https://google.github.io/styleguide/pyguide.html) publishes their internal Python style guide.