# Notebook Title

## Setup Python and R environment
you can ignore this section

In [9]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [11]:
%%R

# My commonly used R imports

require('tidyverse')

## Load & Clean Data

👉 Load the data along with the census connectors below (the output of the `connect-to-census.ipynb` notebook) and do any cleanup you'd like to do.

In [12]:
%%R
df <- read_csv('2023_subway_censusgeo.csv')

Rows: 18208 Columns: 40
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (26): Common Name, Equipment Description, Executive Comment, Outage Cod...
dbl  (10): Outage, Station MRN, Station ID, Complex ID, lat, long, ADA, GEOI...
dttm  (4): Out of Service Date, Estimated Return to Service Date, Actual Ret...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [13]:
df = pd.read_csv('data/2023_subway_censusgeo.csv')

In [14]:
df.shape

(18208, 40)

In [15]:

pd.set_option('display.max_columns', None)
df.head()


Unnamed: 0,Out of Service Date,Common Name,Outage,Equipment Description,Executive Comment,Outage Code,Status,External Source Note,Reason Shown to Public,Reason Shown to Public Description,Estimated Return to Service Date,Actual Return to Service Date,Reference,Source,Service Code,Date Created,Status Code,Outage Comments,Station MRN,Station ID,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,lat,long,North Direction Label,South Direction Label,ADA,ADA Notes,Georeference,GEOID,STATE,COUNTY,TRACT,BLOCK
0,2023-01-01 00:14:00,EL376,791238,ELE: EL376 - 068 - Bay Pkwy - Outside Area,,PM,Closed,,MAINTENANCE,Maintenance,2023-01-01 07:00:00,2023-01-01 05:30:00,,Phone,EE-MANOUT,2023-01-01 00:14:00,CL,***This Elevator is out of service for Mainten...,68,68,68,B21,BMT,West End,Bay Pkwy,Bk,D,Elevated,40.601875,-73.993728,Manhattan,Coney Island,1,,POINT (-73.993728 40.601875),360470296002002,36,47,29600,2002
1,2023-01-01 01:04:00,EL189,791257,ELE: EL189 - 212 - Kingsbridge Rd - Outside Area,,AP,Closed,,PLANNEDWORK,Planned Work,2023-01-01 09:00:00,2023-01-01 01:18:00,,Phone,EE-MANOUT,2023-01-01 01:05:00,CL,\n\n***This elevator is out of service for Acc...,212,212,212,D04,IND,Concourse,Kingsbridge Rd,Bx,B D,Subway,40.866978,-73.893509,Bedford Pk Blvd & 205 St,Manhattan,1,,POINT (-73.893509 40.866978),360050403021003,36,5,40302,1003
2,2023-01-01 02:01:00,EL293,791274,ELE: EL293 - 119 - 1 Av - Brooklyn Bound Platform,,EUF,Closed,UPS Battery Failure,REPAIR,Repair,2023-01-01 11:00:00,2023-01-01 04:50:00,ST230101.txt - 192,External Monitoring System,EE-LNOUT,2023-01-01 02:08:00,CL,,119,119,119,L06,BMT,Canarsie,1 Av,M,L,Subway,40.730953,-73.981628,8 Av,Brooklyn,1,,POINT (-73.981628 40.730953),360610034004000,36,61,3400,4000
3,2023-01-01 02:08:00,EL449X,791275,ELE: EL449X - 279 - Sutphin Blvd-Archer Av - Mezz,,TPE,Closed,,REPAIR,Repair,2023-01-05 14:00:00,2023-01-05 03:06:00,,Phone,EE-MANOUT,2023-01-01 02:09:00,CL,***This elevator is out of service as reported...,279,279,279,G06,IND,Queens - Archer,Sutphin Blvd-Archer Av-JFK Airport,Q,E J Z,Subway,40.700486,-73.807969,Jamaica Center,Manhattan,1,,POINT (-73.807969 40.700486),360810208001000,36,81,20800,1000
4,2023-01-01 05:43:00,EL428,791299,ELE: EL428 - 273 - Queens Plaza - Outside Area,,VAN,Closed,,REPAIR,Repair,2023-01-01 13:00:00,2023-01-01 09:10:00,,Phone,EE-MANOUT,2023-01-01 05:44:00,CL,,273,273,273,G21,IND,Queens Blvd,Queens Plaza,Q,E M R,Subway,40.748973,-73.937243,Forest Hills - Jamaica,Manhattan,1,,POINT (-73.937243 40.748973),360810033011019,36,81,3301,1019


## 👉 Grab Census Data

1. loading the Census API key

In [20]:
import dotenv

# Load the environment variables
# (loads CENSUS_API_KEY from .env)
dotenv.load_dotenv()


True

In [21]:
%%R 

require('tidycensus')

# because it an environment variable, we don't have to 
# explicitly pass this string to R, it is readable here
# in this R cell.
census_api_key(Sys.getenv("CENSUS_API_KEY"))

Loading required package: tidycensus
To install your API key for use in future sessions, run this function with `install = TRUE`.


2. Decide which Census variables you want

    Use <https://censusreporter.org/> to figure out which tables you want. (if censusreporter is down, check out the code in the cell below)

    -   Scroll to the bottom of the page to see the tables.
    -   If you already know the table ID, stick that in the "Explore" section to learn more about that table.

    By default this code loads (B01003_001) which we found in censusreporter here: https://censusreporter.org/tables/B01003/

    - find some other variables that you're also interested in
    - don't forget to pick a geography like "tract", "county" or "block group". here is the list of [all geographies](https://walker-data.com/tidycensus/articles/basic-usage.html#geography-in-tidycensus
    ).


In [22]:
# %%R 

# Finding the Census Varaibles for the ACS 5 year survey
# Generally you'd do this in CensusReporter, but since it's down sometimes, here it is using tidycensus's load_variables function

# get every single variable in the ACS5
all_census_vars <- load_variables(2021, "acs5", cache = TRUE) 

filtered_census_vars <- all_census_vars %>% 
    filter(grepl("median income", label, ignore.case = TRUE))   # filter to those containing "median income"
    
# write to CSV so we can view it in python
filtered_census_vars %>% 
    write_csv("filtered_census_vars.csv")

# show the first few rows
filtered_census_vars %>%
    select(-geography) %>% # remove the geography column
    print(n = 20) # print the first 20 rows

SyntaxError: invalid syntax (3197094227.py, line 9)

In [23]:
%%R 
# the variable B01003_001E was selectd from the census table 
# for population, which we found in censusreporter here:
# https://censusreporter.org/tables/B01003/

# in the table below, pick the geography, the variables, and the survey you want to pull from
# see the possible values here https://walker-data.com/tidycensus/articles/basic-usage.html

# Get variable from ACS
nyc_census_data <- get_acs(geography = "tract", 
                      state='NY',
                      county = c("New York", "Kings", "Queens", "Bronx", "Richmond"),
                      variables = c(
                        population="B01003_001", 
                        med_earn="B19013_001", # Median household income in the past 12 months
                        sub_pop='B08301_012', # Population using subway or elevated rail to work
                        amb_pop='B18105_001' # Population with Ambulatory Difficulty
                      ), 
                      year = 2021,
                      survey="acs5",
                      geometry=T)
options(width = 1000)

nyc_census_data

Simple feature collection with 9308 features and 5 fields (with 4 geometries empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25609 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                       NAME   variable estimate   moe                       geometry
1  36081014700  Census Tract 147, Queens County, New York population     2863   513 MULTIPOLYGON (((-73.9137 40...
2  36081014700  Census Tract 147, Queens County, New York    sub_pop      824   210 MULTIPOLYGON (((-73.9137 40...
3  36081014700  Census Tract 147, Queens County, New York    amb_pop     2738   508 MULTIPOLYGON (((-73.9137 40...
4  36081014700  Census Tract 147, Queens County, New York   med_earn    71815 18034 MULTIPOLYGON (((-73.9137 40...
5  36047058400   Census Tract 584, Kings County, New York population     3655   529 MULTIPOLYGON (((-73.96103 4...
6  36047058400   Census Tract 584, Kings County, New York    sub

Getting data from the 2017-2021 5-year ACS
Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
Using FIPS code '36' for state 'NY'
Using FIPS code '061' for 'New York County'
Using FIPS code '047' for 'Kings County'
Using FIPS code '081' for 'Queens County'
Using FIPS code '005' for 'Bronx County'
Using FIPS code '085' for 'Richmond County'


## 👉 Merge it with your data

hint...`tidycensus` provides you data in long format you may need to pivot the census data from long to wide format before merging it with your data

In [24]:
%%R

# pivot from long to wide
nyc_census_data <- nyc_census_data %>% 
  pivot_wider(
    names_from = variable, 
    values_from = c(estimate, moe),
    names_glue = "{variable}_{.value}"
  )
options(width = 1000)
nyc_census_data

Simple feature collection with 2327 features and 10 fields (with 1 geometry empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25609 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
Geodetic CRS:  NAD83


# A tibble: 2,327 × 11
   GEOID       NAME                                                                                                                       geometry population_estimate sub_pop_estimate amb_pop_estimate med_earn_estimate population_moe sub_pop_moe amb_pop_moe med_earn_moe
   <chr>       <chr>                                                                                                            <MULTIPOLYGON [°]>               <dbl>            <dbl>            <dbl>             <dbl>          <dbl>       <dbl>       <dbl>        <dbl>
 1 36081014700 Census Tract 147, Queens County, New York   (((-73.9137 40.76548, -73.9121 40.76473, -73.9113 40.76435, -73.9105 40.76398, -73.9...                2863              824             2738             71815            513         210         508        18034
 2 36047058400 Census Tract 584, Kings County, New York    (((-73.96103 40.59616, -73.95978 40.5963, -73.95878 40.59641, -73.95785 40.59651, -7...                36

In [25]:
%%R
df

# A tibble: 18,208 × 40
   `Out of Service Date` `Common Name` Outage `Equipment Description`                                  `Executive Comment` `Outage Code` Status `External Source Note`              `Reason Shown to Public` `Reason Shown to Public Description` `Estimated Return to Service Date` `Actual Return to Service Date` Reference          Source                     `Service Code` `Date Created`      `Status Code` `Outage Comments`                                                                                                                                                                                                                                                                                                                                                     `Station MRN` `Station ID` `Complex ID` `GTFS Stop ID` Division Line  `Stop Name` Borough `Daytime Routes` Structure   lat  long North Direction Labe…¹ South Direction Labe…²   ADA `ADA Notes` Georeference   GEOID 

In [27]:
%%R 

# keep the first 11 digits in df$GEOID
df$GEOID <-  substr(df$GEOID, 1, 11) %>%
    as.numeric(df$GEOID)

# change df$GEOID to double
nyc_census_data$GEOID <- as.numeric(nyc_census_data$GEOID)
df$GEOID <- as.numeric(df$GEOID)
    
# merge nyc_census_data with df on GEOID
df_census <- merge(df, nyc_census_data, by = "GEOID")

In [28]:
%%R 
write_csv(df_census, "data/2023_subway_censusvar.csv")

## Create an agg dataset

In [29]:
df = pd.read_csv('data/2023_subway_censusvar.csv')

In [33]:
value_count = df['Common Name'].value_counts()
df['outage_count'] = df['Common Name'].map(value_count)
df.sample(5)

Unnamed: 0,GEOID,Out of Service Date,Common Name,Outage,Equipment Description,Executive Comment,Outage Code,Status,External Source Note,Reason Shown to Public,Reason Shown to Public Description,Estimated Return to Service Date,Actual Return to Service Date,Reference,Source,Service Code,Date Created,Status Code,Outage Comments,Station MRN,Station ID,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,lat,long,North Direction Label,South Direction Label,ADA,ADA Notes,Georeference,STATE,COUNTY,TRACT,BLOCK,NAME,geometry,population_estimate,sub_pop_estimate,amb_pop_estimate,med_earn_estimate,population_moe,sub_pop_moe,amb_pop_moe,med_earn_moe,outage_count
2172,36047000501,2023-04-15T19:26:00Z,EL310,833097,ELE: EL310 - 334 - Clark St - Mezz A,,UNI,Cancelled,No Comm/Power,UNDERINVESTG,Under Investigation,2023-04-16T04:00:00Z,,ST230415.txt - 1764,External Monitoring System,EE-LNOUT,2023-04-15T19:34:00Z,C,"As per SA Anderson (R600, #013122) EL310 is In...",334,334,334,231,IRT,Clark St,Clark St,Bk,2 3,Subway,40.697466,-73.993086,Manhattan,Flatbush - New Lots,0,,POINT (-73.993086 40.697466),36,47,501,1001,"Census Tract 5.01, Kings County, New York","list(list(c(-73.996197, -73.995855, -73.995547...",4604,1094,4385,141354.0,493,192,497,30955.0,94
11352,36061012602,2023-12-20T17:38:00Z,EL267,957993,ELE: EL267 - 477 - 72 St - Outside Area,this elevator out of service due to LO/TO trai...,AP,Closed,,PLANNEDWORK,Planned Work,2023-12-20T23:00:00Z,2023-12-20T17:55:00Z,,Phone,EE-MANOUT,2023-12-20T17:39:00Z,CL,\n\n\nE&EH Gurung taking machine out for LO/TO...,477,477,477,Q03,IND,Second Av,72 St,M,Q,Subway,40.768799,-73.958424,Uptown,Downtown & Brooklyn,1,,POINT (-73.958424 40.768799),36,61,12602,2000,"Census Tract 126.02, New York County, New York","list(list(c(-73.961559, -73.961102, -73.960602...",4847,808,4616,177882.0,1092,291,945,39112.0,43
2604,36047001100,2023-06-16T00:20:00Z,EL709,883831,ELE: EL709 - 025 - Jay St-MetroTech - Island Plat,,PM,Closed,,MAINTENANCE,Maintenance,2023-06-16T07:22:00Z,2023-06-16T05:30:00Z,,Phone,EE-MANOUT,2023-06-16T00:22:00Z,CL,***The elevator is out of service due to Reven...,25,25,636,R29,BMT,Broadway,Jay St-MetroTech,Bk,R,Subway,40.69218,-73.985942,Manhattan,Bay Ridge - 95 St,1,,POINT (-73.985942 40.69218),36,47,1100,1011,"Census Tract 11, Kings County, New York","list(list(c(-73.990447, -73.990663, -73.990746...",1508,670,1414,154167.0,254,111,197,34801.0,51
11929,36061014500,2023-07-26T00:03:00Z,EL280,899695,ELE: EL280 - 161 - 59 St-Columbus Circle - Mezz A,,CLN,Closed,,CLEANING,Cleaning,2023-07-26T07:00:00Z,2023-07-26T04:30:00Z,,Phone,EE-MANOUT,2023-07-26T00:04:00Z,CL,***This Elevator is out of service for Cleanin...,161,161,614,A24,IND,8th Av - Fulton St,59 St-Columbus Circle,M,A B C D,Subway,40.768296,-73.981736,Uptown & The Bronx,Downtown & Brooklyn,1,,POINT (-73.981736 40.768296),36,61,14500,4002,"Census Tract 145, New York County, New York","list(list(c(-73.987615, -73.987146, -73.986695...",6401,1240,5873,184231.0,817,315,735,25765.0,59
15484,36061028300,2023-07-17T23:25:00Z,EL178,896395,ELE: EL178 - 299 - Dyckman St,,PM,Closed,,MAINTENANCE,Maintenance,2023-07-18T05:25:00Z,2023-07-18T05:00:00Z,,Phone,EE-MANOUT,2023-07-17T23:27:00Z,CL,**Elevator OOS for Monthly PM ***\n\n\nWO 693...,299,299,299,109,IRT,Broadway - 7Av,Dyckman St,M,1,Elevated,40.860531,-73.925536,Uptown & The Bronx,Downtown,1,,POINT (-73.925536 40.860531),36,61,28300,2000,"Census Tract 283, New York County, New York","list(list(c(-73.931818, -73.931313, -73.92842,...",8463,2156,8295,64394.0,1556,438,1550,9896.0,45


In [37]:
df_agg = df.drop_duplicates(subset=['Common Name'])
# drop column `Outage`, `Out of Service Date`, `Executive Comment,	Outage Code	Status,	External Source Note,	Reason Shown to Public,	Reason Shown to Public Description,	`Estimated Return to Service Date`
df_agg = df_agg.drop(columns=['Outage', 
                              'Out of Service Date', 
                              'Executive Comment', 
                              'Outage Code', 
                              'Status', 
                              'External Source Note', 
                              'Reason Shown to Public', 
                              'Reason Shown to Public Description', 
                              'Estimated Return to Service Date',
                              'Actual Return to Service Date',
                              'Reference',
                              'Source',
                              'Service Code',
                              'Date Created',
                              'Status Code',
                            'Outage Comments'])
df_agg.shape

(362, 35)

In [39]:
df_agg

Unnamed: 0,GEOID,Common Name,Equipment Description,Station MRN,Station ID,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,lat,long,North Direction Label,South Direction Label,ADA,ADA Notes,Georeference,STATE,COUNTY,TRACT,BLOCK,NAME,geometry,population_estimate,sub_pop_estimate,amb_pop_estimate,med_earn_estimate,population_moe,sub_pop_moe,amb_pop_moe,med_earn_moe,outage_count
0,36005006500,EL130,ELE: EL130 - 434 - 3 Av-149 St,434,434,434,221,IRT,Lenox - White Plains Rd,3 Av-149 St,Bx,2 5,Subway,40.816109,-73.917757,Wakefield - Eastchester,Manhattan,1,,POINT (-73.917757 40.816109),36,5,6500,1000,"Census Tract 65, Bronx County, New York","list(list(c(-73.925185, -73.924025, -73.922652...",5681,1004,4827,21962.0,908,270,587,5408.0,94
6,36005006500,EL129,ELE: EL129 - 434 - 3 Av-149 St,434,434,434,221,IRT,Lenox - White Plains Rd,3 Av-149 St,Bx,2 5,Subway,40.816109,-73.917757,Wakefield - Eastchester,Manhattan,1,,POINT (-73.917757 40.816109),36,5,6500,1000,"Census Tract 65, Bronx County, New York","list(list(c(-73.925185, -73.924025, -73.922652...",5681,1004,4827,21962.0,908,270,587,5408.0,41
135,36005008300,EL516,ELE: EL516 - 373 - E 149TH St,373,373,373,615,IRT,Pelham,E 149 St,Bx,6,Subway,40.812118,-73.904098,Pelham Bay Park,Manhattan,1,,POINT (-73.904098 40.812118),36,5,8300,5000,"Census Tract 83, Bronx County, New York","list(list(c(-73.904183, -73.903804, -73.90347,...",6121,864,5751,42386.0,913,285,847,4187.0,10
136,36005008300,EL515,ELE: EL515 - 373 - E 149TH St,373,373,373,615,IRT,Pelham,E 149 St,Bx,6,Subway,40.812118,-73.904098,Pelham Bay Park,Manhattan,1,,POINT (-73.904098 40.812118),36,5,8300,5000,"Census Tract 83, Bronx County, New York","list(list(c(-73.904183, -73.903804, -73.90347,...",6121,864,5751,42386.0,913,285,847,4187.0,8
153,36005011900,EL196,ELE: EL196 - 371 - Hunts Point Av,371,371,371,613,IRT,Pelham,Hunts Point Av,Bx,6,Subway,40.820948,-73.890549,Pelham Bay Park,Manhattan,1,,POINT (-73.890549 40.820948),36,5,11900,1002,"Census Tract 119, Bronx County, New York","list(list(c(-73.892005, -73.89139, -73.890586,...",6193,1312,5690,32007.0,1055,466,969,6069.0,76
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18024,36081093800,EL499,ELE: EL499 - 203 - Rockaway Park-Beach 116 St,203,203,203,H15,IND,Rockaway,Rockaway Park-Beach 116 St,Q,A S,At Grade,40.580903,-73.835592,Manhattan,,1,,POINT (-73.835592 40.580903),36,81,93800,3003,"Census Tract 938, Queens County, New York","list(list(c(-73.840704, -73.833507, -73.82733,...",5289,368,4352,60917.0,744,181,607,15288.0,14
18038,36081100803,EL498,ELE: EL498 - 209 - Far Rockaway-Mott Av - Isla...,209,209,209,H11,IND,Rockaway,Far Rockaway-Mott Av,Q,A,Viaduct,40.603995,-73.755405,Manhattan,,1,,POINT (-73.755405 40.603995),36,81,100803,1000,"Census Tract 1008.03, Queens County, New York","list(list(c(-73.763868, -73.761639, -73.760598...",4081,562,3899,44146.0,792,210,748,20424.0,43
18039,36081100803,EL497,ELE: EL497 - 209 - Far Rockaway-Mott Av - Isla...,209,209,209,H11,IND,Rockaway,Far Rockaway-Mott Av,Q,A,Viaduct,40.603995,-73.755405,Manhattan,,1,,POINT (-73.755405 40.603995),36,81,100803,1000,"Census Tract 1008.03, Queens County, New York","list(list(c(-73.763868, -73.761639, -73.760598...",4081,562,3899,44146.0,792,210,748,20424.0,100
18181,36085013400,EL788,ELE: EL788 - 510 - NEW DORP,510,510,510,S22,SIR,Staten Island,New Dorp,SI,SIR,Open Cut,40.573480,-74.117210,St George,Tottenville,1,,POINT (-74.11721 40.57348),36,85,13400,1005,"Census Tract 134, Richmond County, New York","list(list(c(-74.123427, -74.122169, -74.121286...",4225,76,4113,79412.0,704,48,692,28002.0,12


In [38]:
df_agg.to_csv('data/2023_subway_censusvar_agg.csv', index=False)