Urban Data Science & Smart Cities <br>
URSP688Y Spring 2025<br>
Instructor: Chester Harvey <br>
Urban Studies & Planning <br>
National Center for Smart Growth <br>
University of Maryland

# Demo 5 - Accessing and Wrangling Data from the Web

- GitHub Branches
- APIs and JSON data
- Debugging

## GitHub Branches

Branches allow you to organize work in a contained space. Their most important feature, for our purposes, is allowing you to make a pull request with only the changes (new files) related to a specific exercise and ***not*** all the other things you may be experimenting with in your repository.

[Here's a more extended overview](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-branches) of what branches are and how they work.

Here are a few key concepts we'll go over in class:
- Every repository or fork has a default branch ('main')
- You can make unlimited additional branches
- You always make commits to a branch (even if it's 'main')
- When making a new branch for the purpose of a pull request to an upstream repository, I recommend using the 'main' branch of that upstream repository as the source for your new branch.
    - To make your own version of an exercise notebook, make a copy of the template and leave the original file in place. That way the only change in your PR will be the addition of your new file, not the deletion of the template.
    - Don't include any other changes in your PR. If you accidentally have other changes, the easiest way to clean things up may be to make a new branch and copy only the notebook you want to submit into that branch.
    - Sync your branch before making a pull request.

In [1]:
# Import packages
import pandas as pd

In [2]:
import requests
import json
import yaml
import os

In [3]:
# Load configs file with API key
with open('configs.yaml', 'r') as file:
    CONFIGS = yaml.safe_load(file)

## Loading Data from a File

Let's get our hands on some real-world data by loading a table from a file.

Let's load data from the [Maryland Eviction Case Database](https://opendata.maryland.gov/Housing/District-Court-of-Maryland-Eviction-Case-Data/mvqb-b4hf/data).

A CSV file that is stored in the same directory as our notebook can be opened by entering just the file name as an argument to `pd.read_csv`.

In [4]:
df = pd.read_csv('District_Court_of_Maryland_Eviction_Case_Data_MG_PG.csv')

In [None]:
# # Get rentcast market data for the 10 zipcodes that are most represented in the eviction case data
# zipcodes = df['Tenant ZIP Code'].value_counts().head(10).index.astype('Int64')
# for zipcode in zipcodes:
#     # Make GET request to rentcast API
#     url = f'https://api.rentcast.io/v1/markets?zipCode={zipcode}&dataType=All&historyRange=6'
#     headers = {
#         'X-Api-Key': CONFIGS['rentcast_api_key'],
#         'Accept': 'application/json', 
#     }
#     response = requests.get(url, headers=headers)
#     data = response.json()
#     # Save to json
#     file_path = os.path.join(CONFIGS['rentcast_data_dir'], f'rentcast_{zipcode}.json')
#     with open(file_path, 'w') as file:
#         json.dump(data, file, indent=4)

Let's practice navigating and doing some analysis with our DataFrame.

Preview the dataframe

In [8]:
os.path.join(CONFIGS['rentcast_data_dir'], f'rentcast_{zipcode}.json')

NameError: name 'zipcode' is not defined

How many rows does it have?

In [None]:
len(df)

What columns does it have?

In [None]:
df.columns.tolist()

Which counties are represented?

In [None]:
df['County'].value_counts()

What is the earlist date?

Is this true?

In [None]:
df.sort_values('Event Date', ascending=True).head(2)

Convert the event date column to a `datetime` data type

In [None]:
df['Event Date'] = pd.to_datetime(df['Event Date'])

In [None]:
df.sort_values('Event Date', ascending=True).head(2)

How many unique cases are there?

In [None]:
df_dedup = df.sort_values('Event Date', ascending=True).drop_duplicates(subset='Case Number', keep='first')

In [None]:
len(df_dedup)

How many unique cases per zip code?

In [None]:
# Count cases within each zip code
cases_per_zip = df_dedup.groupby('Tenant ZIP Code')['Case Number'].count().sort_values(ascending=False)

cases_per_zip

In [None]:
# Convert results to a dataframe with a column for zip and a column for cases
cases_per_zip = pd.DataFrame(cases_per_zip).reset_index()
cases_per_zip = cases_per_zip.rename(columns={'Tenant ZIP Code':'case_zip','Case Number':'case_count'})

cases_per_zip.head()

Which zip codes have the most unique cases per person?

Let's join data from [CensusReporter](https://censusreporter.org/).

### Combining/Merging/Joining Tables

Combining information from multiple tables into a single table is one of the most useful data wrangling operations.

There are lots of different ways to join tables, but two basic types are:

1. Joining column with a shared key, which outputs a table that is wider than either input.
2. Concatenating rows with shared column names, which outputs a table that is longer than either input.

#### Joining columns based on a key

![joining columns with a shared key](https://rforhr.com/horizontal_join.png)


#### Concatenating rows with the same column names
![joining rows with shared column names](https://rforhr.com/vertical_join.png)

First, let's concatenate census tables for Montgomery and Prince George's county to make a single table with populations for each zip code.

Then, we'll merge counts of eviction cases onto each zip code.

Finally, we'll calcuate the number of eviction cases per capita.

In [None]:
# Load census reporter data, ignoring the row with data for the whole county (first row under the header)
def load_census_reporter_csv(path, skiprows=[1]):
    return pd.read_csv(path, skiprows=skiprows)

df_pop_mg = load_census_reporter_csv('acs2023_5yr_B01003_mg.csv')
df_pop_pg = load_census_reporter_csv('acs2023_5yr_B01003_pg.csv') 

In [None]:
# Combine into a single dataframe
df_pop = pd.concat([df_pop_mg, df_pop_pg], axis=0)

In [None]:
# Rename columns with readable names
df_pop = df_pop.rename(columns={'name':'census_zip', 'B01003001':'population', 'B01003001, Error':'population_error'})

In [None]:
df_pop.head()

In [None]:
df_pop.census_zip.dtype

In [None]:
# Merge the case counts onto the zip codes
cases_per_zip.head()

In [None]:
# Make sure zip codes are stored as strings in both dataframes
cases_per_zip['case_zip'] = cases_per_zip['case_zip'].astype('int64').astype('string')
df_pop['census_zip'] = df_pop['census_zip'].astype('string')

In [None]:
df_pop = df_pop.merge(cases_per_zip, left_on='census_zip', right_on='case_zip', how='left')

In [None]:
df_pop.head()

In [None]:
# Cleanup
df_pop['case_count'] = df_pop['case_count'].fillna(0)
df_pop = df_pop.drop(columns=['population_error','case_zip'])

In [None]:
df_pop.head()

In [None]:
df_pop['cases_per_pop'] = df_pop['case_count'] / df_pop['population']

In [None]:
df_pop.sort_values('cases_per_pop', ascending=False)

## Errors and debugging

Errors are frustrating and inevitable. Even professional programmers probably spend most of their time debugging.

Luckily, there are good tools and techniques for making debugging a little easier.

Despite these, you will probably nearly tear your hair out with some frequency, especially as a beginner. It will get better with time.

There are two types of errors in programming: logic and syntax. They both result in your program not achieving its goal, but the first may not be as easily detectable because the code may still run.

### Logic errors
These are issues with how you have approached or executed your problem. If your code runs but produces nonsensical results, there is probably a logic error. However, your erroneous code might also produce logical but *wrong* results; you might never notice until the problem has rippled downstream. It's best to address this proactively by planning your code well so it's less likely to be illogical, and writing readable code that can be easily reviewed.

Here's a logic error. Can you find it? (Hint: the issue is syntactical, but it's still a logic error because the code works without throwing an error.)

In [None]:
def check_adult(age, cutoff=18):
    if age > cutoff:
        adult = False
    else:
        adult = True
    return adult

check_adult(20)

### Syntax errors
These are more obvious because your code will simply fail. There are lots of tools for figuring out where and why.

Error messages are usually the starting place for debugging a syntax error.

In [None]:
def check_adult(age, cutoff=18):
    if age < cutoff:
        adult = False
    else:
        adult = True
    return adult

check_adult('20')

The error message tells us where the problem is located.

Sometimes, it can be helpful to turn on line numbers.
- In Colab: `Tools -> Settings -> Editor -> Show line numbers`
- In JupyterLab: `View -> Show Line Numbers`

The `ValueError` tells us that the issue is related to the value of a variable on this line, but it's still pretty vague.

Time to start [Googling](https://www.google.com/).


### Debugging
We can also use an "interactive debugger" to help diagnose our problem by stepping through the code one line at a time.

The debugger provides tools for setting "breakpoints" where the code will stop running temporarily, a table that shows the values of variables at that time, and buttons to start, stop, and step through the code.

https://jupyterlab.readthedocs.io/en/stable/user/debugger.html

In [None]:
def check_adult(age, cutoff=18):
    if age < cutoff:
        adult = False
    else:
        adult = True
    return adult

check_adult(20)

## Style guidelines for Python
- At the very least, do things consistently
- One statement per line
- Try to limit line length to 72 characters
- Use four spaces to indent
- Put spaces around operators (e.g., `1 + 1` or `day = 'Monday'`) (except in keyword function arguments)
- Use blank lines intentionally and consistently
- Use meaningful names
- Name variables and functions with `lowercase_underscores`
- Constants are often named in `ALL_CAPS_WITH_UNDERSCORES` (e.g., `C = 2.99792458e+8`)
- Name custom classes with `CapWords`
- In general, avoid spaces in folder and filenames used for programming

See [Code Readability](https://github.com/ncsg/ursp688y_sp2024/blob/main/README.md#code-readability) on the syllabus. [CS61A](https://cs61a.org/articles/composition/) has an excellent composition guide. [PEP 8](https://peps.python.org/pep-0008/) is a standard Python style guide. [Google](https://google.github.io/styleguide/pyguide.html) publishes their internal Python style guide.