# Notebook Title

## Setup Python and R environment
you can ignore this section

In [17]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [19]:
%%R

# My commonly used R imports

require('tidyverse')

## Load & Clean Data

👉 Load the data along with the census connectors below (the output of the `connect-to-census.ipynb` notebook) and do any cleanup you'd like to do.

In [20]:
import pandas as pd

In [21]:
df_merge2 = pd.read_csv('output.csv')
df_merge2

Unnamed: 0,Borough,Borough/Citywide Office (B/CO),District,School,School Name_x,School Category,Program,Language,Language (Translated),General/Special Education,...,% White,# Missing Race/Ethnicity Data,% Missing Race/Ethnicity Data,# Students with Disabilities,% Students with Disabilities,# English Language Learners,% English Language Learners,# Poverty,% Poverty,Economic Need Index
0,Manhattan,Manhattan,1,01M020,P.S. 020 Anna Silver,Elementary,Dual Language,Chinese,中文,General Education,...,0.050,0,0.000,116,0.215,93,0.172,315,58.3%,67.7%
1,Manhattan,Manhattan,1,01M020,P.S. 020 Anna Silver,Elementary,Dual Language,Chinese,中文,General Education,...,0.032,1,0.002,118,0.237,86,0.173,366,73.6%,80.0%
2,Manhattan,Manhattan,1,01M020,P.S. 020 Anna Silver,Elementary,Dual Language,Chinese,中文,General Education,...,0.046,1,0.002,117,0.243,63,0.131,326,67.8%,75.4%
3,Manhattan,Manhattan,1,01M020,P.S. 020 Anna Silver,Elementary,Dual Language,Chinese,中文,General Education,...,0.067,1,0.002,106,0.228,61,0.131,354,76.3%,78.4%
4,Manhattan,Manhattan,1,01M020,P.S. 020 Anna Silver,Elementary,Dual Language,Chinese,中文,General Education,...,0.073,1,0.002,90,0.218,49,0.119,292,70.9%,76.2%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2757,Queens,ACCESS,79,79Q950,Pathways to Graduation,High school,Transitional Bilingual Education,Spanish,Español,General Education,...,0.054,0,0.000,410,0.107,542,0.141,1569,40.9%,72.0%
2758,Queens,ACCESS,79,79Q950,Pathways to Graduation,High school,Transitional Bilingual Education,Spanish,Español,General Education,...,0.050,0,0.000,400,0.113,473,0.134,2607,73.9%,88.3%
2759,Queens,ACCESS,79,79Q950,Pathways to Graduation,High school,Transitional Bilingual Education,Spanish,Español,General Education,...,0.052,0,0.000,469,0.132,547,0.154,2597,73.1%,87.8%
2760,Queens,ACCESS,79,79Q950,Pathways to Graduation,High school,Transitional Bilingual Education,Spanish,Español,General Education,...,0.048,0,0.000,449,0.128,565,0.161,2629,74.9%,91.2%


## 👉 Grab Census Data

1. loading the Census API key

In [22]:
import dotenv

# Load the environment variables
# (loads CENSUS_API_KEY from .env)
dotenv.load_dotenv()


True

In [23]:
!touch .env

In [24]:
%%R 

require('tidycensus')

# because it an environment variable, we don't have to 
# explicitly pass this string to R, it is readable here
# in this R cell.
census_api_key(Sys.getenv("CENSUS_API_KEY"))

To install your API key for use in future sessions, run this function with `install = TRUE`.


2. Decide which Census variables you want

    Use <https://censusreporter.org/> to figure out which tables you want. (if censusreporter is down, check out the code in the cell below)

    -   Scroll to the bottom of the page to see the tables.
    -   If you already know the table ID, stick that in the "Explore" section to learn more about that table.

    By default this code loads (B01003_001) which we found in censusreporter here: https://censusreporter.org/tables/B01003/

    - find some other variables that you're also interested in
    - don't forget to pick a geography like "tract", "county" or "block group". here is the list of [all geographies](https://walker-data.com/tidycensus/articles/basic-usage.html#geography-in-tidycensus
    ).


In [25]:
%%R 
B03002_vars <- load_variables(2021, "acs5", cache = TRUE) %>%
    # Check if name contains B03002
    filter(str_detect(name, "B03002"))

# Print all rows of the filtered data
print(B03002_vars, n = Inf)

# A tibble: 21 × 4
   name       label                                            concept geography
   <chr>      <chr>                                            <chr>   <chr>    
 1 B03002_001 Estimate!!Total:                                 HISPAN… block gr…
 2 B03002_002 Estimate!!Total:!!Not Hispanic or Latino:        HISPAN… block gr…
 3 B03002_003 Estimate!!Total:!!Not Hispanic or Latino:!!Whit… HISPAN… block gr…
 4 B03002_004 Estimate!!Total:!!Not Hispanic or Latino:!!Blac… HISPAN… block gr…
 5 B03002_005 Estimate!!Total:!!Not Hispanic or Latino:!!Amer… HISPAN… block gr…
 6 B03002_006 Estimate!!Total:!!Not Hispanic or Latino:!!Asia… HISPAN… block gr…
 7 B03002_007 Estimate!!Total:!!Not Hispanic or Latino:!!Nati… HISPAN… block gr…
 8 B03002_008 Estimate!!Total:!!Not Hispanic or Latino:!!Some… HISPAN… block gr…
 9 B03002_009 Estimate!!Total:!!Not Hispanic or Latino:!!Two … HISPAN… block gr…
10 B03002_010 Estimate!!Total:!!Not Hispanic or Latino:!!Two … HISPAN… block gr…
11 B03002

In [36]:
%%R 

B03002_vars <- load_variables(2021, "acs5", cache = TRUE)

# Print all columns of B03002_vars
B03002_vars


# A tibble: 27,886 × 4
   name        label                                    concept        geography
   <chr>       <chr>                                    <chr>          <chr>    
 1 B01001A_001 Estimate!!Total:                         SEX BY AGE (W… tract    
 2 B01001A_002 Estimate!!Total:!!Male:                  SEX BY AGE (W… tract    
 3 B01001A_003 Estimate!!Total:!!Male:!!Under 5 years   SEX BY AGE (W… tract    
 4 B01001A_004 Estimate!!Total:!!Male:!!5 to 9 years    SEX BY AGE (W… tract    
 5 B01001A_005 Estimate!!Total:!!Male:!!10 to 14 years  SEX BY AGE (W… tract    
 6 B01001A_006 Estimate!!Total:!!Male:!!15 to 17 years  SEX BY AGE (W… tract    
 7 B01001A_007 Estimate!!Total:!!Male:!!18 and 19 years SEX BY AGE (W… tract    
 8 B01001A_008 Estimate!!Total:!!Male:!!20 to 24 years  SEX BY AGE (W… tract    
 9 B01001A_009 Estimate!!Total:!!Male:!!25 to 29 years  SEX BY AGE (W… tract    
10 B01001A_010 Estimate!!Total:!!Male:!!30 to 34 years  SEX BY AGE (W… tract    
# ℹ 2

In [39]:
%%R

library(dplyr)
library(tidyr)

# Reshape the data
reshaped_data <- B03002_vars %>%
  mutate(census_tract = stringr::str_extract(name, "\\d+")) %>%
  pivot_longer(cols = -c(name, census_tract), names_to = "race", values_to = "estimate")

# Print the reshaped data
print(reshaped_data)

# A tibble: 83,658 × 4
   name        census_tract race      estimate                              
   <chr>       <chr>        <chr>     <chr>                                 
 1 B01001A_001 01001        label     Estimate!!Total:                      
 2 B01001A_001 01001        concept   SEX BY AGE (WHITE ALONE)              
 3 B01001A_001 01001        geography tract                                 
 4 B01001A_002 01001        label     Estimate!!Total:!!Male:               
 5 B01001A_002 01001        concept   SEX BY AGE (WHITE ALONE)              
 6 B01001A_002 01001        geography tract                                 
 7 B01001A_003 01001        label     Estimate!!Total:!!Male:!!Under 5 years
 8 B01001A_003 01001        concept   SEX BY AGE (WHITE ALONE)              
 9 B01001A_003 01001        geography tract                                 
10 B01001A_004 01001        label     Estimate!!Total:!!Male:!!5 to 9 years 
# ℹ 83,648 more rows
# ℹ Use `print(n = ...)` to see 

In [40]:
%%R 

nyc_census_data <- get_acs(geography = "tract", 
                           state = 'NY',
                           county = c("New York", "Kings", "Queens", "Bronx", "Richmond"),
                           variables = c(
                             race = "B01001A_001",
                             med_inc = "B19013_001"
                           ), 
                           year = 2021,
                           survey = "acs5",
                           geometry = TRUE)

# Display the obtained NYC census data
nyc_census_data

Simple feature collection with 4654 features and 5 fields (with 2 geometries empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25609 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                       NAME variable estimate
1  36081014700  Census Tract 147, Queens County, New York     race     1916
2  36081014700  Census Tract 147, Queens County, New York  med_inc    71815
3  36047058400   Census Tract 584, Kings County, New York     race     2614
4  36047058400   Census Tract 584, Kings County, New York  med_inc    67315
5  36061006900 Census Tract 69, New York County, New York     race     2096
6  36061006900 Census Tract 69, New York County, New York  med_inc   237500
7  36047073000   Census Tract 730, Kings County, New York     race      127
8  36047073000   Census Tract 730, Kings County, New York  med_inc   117857
9  36047100400  Census Tract 1004, Kings County, New York     race    

Getting data from the 2017-2021 5-year ACS
Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
Using FIPS code '36' for state 'NY'
Using FIPS code '061' for 'New York County'
Using FIPS code '047' for 'Kings County'
Using FIPS code '081' for 'Queens County'
Using FIPS code '005' for 'Bronx County'
Using FIPS code '085' for 'Richmond County'


## 👉 Merge it with your data

hint...`tidycensus` provides you data in long format you may need to pivot the census data from long to wide format before merging it with your data