# Raster Aggregation

## Introduction

This script sources previously downloaded USGS NLCD data and aggregates it by the specified geometry for the specified years.

#### Overview
This notebook includes the following sections:

1. ...

# 1. Setup
Before running this script, you will need to install and load the following packages into your R environment:

* [**dplyr**](https://cran.r-project.org/web/packages/dplyr)  A package for data manipulation that provides a consistent set of functions to filter, arrange, summarize, and transform data. It makes it easy to work with data frames and perform operations efficiently.

* [**exactextractr**](https://cran.r-project.org/web/packages/exactextractr)
  
* [**ggplot2**](https://cran.r-project.org/web/packages/ggplot2)


* [**ipumsr**](https://cran.r-project.org/web/packages/ipumml)  A package specifically designed to interact with IPUMS datasets, including NHGIS. It allows users to define and submit data extraction requests, download data, and read it directly into R for analyss.i

* [**sf**](https://cran.r-project.org/web/packages/f)

* [**terra**](https://cran.r-project.org/web/packages/terra).

To install these packages, run:

In [2]:
#install.packages(c("dplyr", "exactextractr", "ggplot2", "ipumsr", "sf"))

Installing packages into ‘/home/jovyan/R/x86_64-conda-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)

“installation of package ‘dplyr’ had non-zero exit status”
“installation of package ‘sf’ had non-zero exit status”


Once installed, make sure to load them.

In [1]:
library(dplyr)
library(exactextractr)
library(ggplot2)
library(ipumsr)
library(sf)
library(terra)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, were retired in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.

Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

terra 1.7.46



You will also need to set up an IPUMS account and obtain an IPUMS API key. You can register for an account and get your API key from [the IPUMS website](https://account.ipums.org/api_keys).   The API key will allow you to programmatically interact with the IPUMS NHGIS datasets and extract data based on your specifications.

Once you have your IPUMS API key, run the following line of code and enter your key.

# Kate's API key
## 59cba10d8a5da536fc06b59dd85f877c475a4c7d96dd08a9ce04d9d0

In [8]:
my_ipumps_api_key = readline("Please enter your IPUMSS API key: ")
set_ipums_api_key(my_ipumps_api_key, save = T, overwrite = T)

Please enter your IPUMSS API key:  59cba10d8a5da536fc06b59dd85f877c475a4c7d96dd08a9ce04d9d0


Existing .Renviron file copied to /home/jovyan/.Renviron_backup for backup purposes.

The environment variable IPUMS_API_KEY has been set and saved for future sessions.



## 2. NHGIS Geography Shapefile Exploration and Selection


#### Steps:
1. Retrieve metadata for available shapefiless.
2. Filter and display shapefiles based on years and geographic level.
3. Select a shapefile for extration.

### 2a. Retrieve Shapefile Metadata¶

Shapefiles are a type of file format which contain geographic boundaries. This type of file is essential for spatial analysis. This section retrieves and filters shapefile metadata to identify shapefiles which correspond to our selected year and geography. This filtering step ensures you have the correct geographic boundaries foryour projecta.

In [11]:
shp_meta <- get_metadata_nhgis("shapefiles")

Provides a list of all availible grographic levels iwthin the NHGIS data repository of shapefiles.

In [None]:
unique(shp_meta$geographic_level)

### 2b. Filter the Shapefiles Based on Year and Geography

For this exercise, we will extract a set of shapefiles at your previously-selected geography as well as a specific year.  As we saw in our metadata exploration above, the available geography levels vary based on the dataset.

For this filtering step, you should also filter based on the year.  For this exercise, we are using time-series table CL8 which contains information on total population harmonized to 2010 geographies.  Therefore, we should only select a shapefile which corresponds to 2010 geographies.

In [13]:
selection_year <- 2022
selection_geog <- "county"

If you are unfamiliar with Census geographies, it might sound strange to include a year specification in this filtring step.  For large geographies, such as "nation" or "state", the year is relatively unimportant because the boundaries of these regions are not redrawn from year to year.  However, for smaller geographies, especially those related to the U.S. Decennial Census, such as "tract", "block" or "blck_grp" (block group), and as those related to political districts, such as "cd" (congressional district), the boundary of the grography can change over time.  Census tract, block group, and block boundaries are redrawn for each Decennial Census based on population numbers, and Congressional Districts are often redrawn for new congresional elections.  For this reason, it is essential to correspond your shapefile selection to your time-series data extraction.

Run the code below to list the available shapefiles based on your year and geography specifications.

In [14]:
shp_meta %>% filter(year == selection_year, grepl(selection_geog, geographic_level, ignore.case = T)) %>% print(n = Inf)

[90m# A tibble: 2 × 6[39m
  name                   year  geographic_level   extent        basis   sequence
  [3m[90m<chr>[39m[23m                  [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m      [3m[90m<int>[39m[23m
[90m1[39m us_county_2022_tl2022  2022  County             United States 2022 T…     [4m1[24m870
[90m2[39m us_cty_sub_2022_tl2022 2022  County Subdivision United States 2022 T…     [4m1[24m872


The filtering step provides us with a list of potential shapefiles we can use for our extraction based on the year and geography criteria.

### 2c. Select a Shapfile

For this exercise, we will select the 2010 Census tract dataset based on the 2010 TIGER line files (file "us_county_2022_tl2022").  And wee will save this selection for use later in our data extraction step.

In [16]:
selection_shp <- "us_county_2022_tl2022"

## 3. NHGIS Shapefile Extraction Specification and Submission

Now that you've identified your dataset and shapefile, this section defines and submits an extraction request to the IPUMS NHGIS API. Extracting data from IPUMS NHGIS allows you to download specific datasets and geographical data directly from the IPUMS server. This method makes it easy to automate and reproduce data requests.  The extraction will include both the selected time-series data and the corresponding shapefiles.

#### Steps:
1. Define and Run the Data Extraction
2. Review the Data Extraction

### 3a. Define the Extraction Parameters and Run the Extraction

Here we will put everything together including out time series data table selection (selection_datts), our selected geography (selection_geog), and our selected shapefiles (selection_shp).

In [17]:
extraction <- define_extract_nhgis(description = "Geographic Boundaries for NLCD Aggregation",
                                   shapefiles = selection_shp)

Submit the extraction request and wait for it to complete, then download the resulting data.

In [18]:
# submit extraction  
extraction_submitted <- submit_extract(extraction)

# wait for completion
extraction_complete <- wait_for_extract(extraction_submitted)

# check completion
extraction_complete$status

# get extraction filepath
filepath <- download_extract(extraction_submitted, overwrite = T)

Successfully submitted IPUMS NHGIS extract number 81

Checking extract status...

Waiting 10 seconds...

Checking extract status...

Waiting 20 seconds...

Checking extract status...

IPUMS NHGIS extract 81 is ready to download.





Shapefile saved to /home/jovyan/ccdatamining/nhgis0081_shape.zip



### 3b. Review the Extracted Files
If you followed along with this exercise, your data extraction and download should contain the following two files.  If you expanded your extraction to additional datasets and shapefiles, you extraction will contain additional files.

1. A dataset containing total population by Census tract (based on 2010 Census tract boundaries) for all available years in the CL8 time-series dataset (1990, 2000, 2010, and 2020).
2. A shapefile with 2010 Census tract boundaries.

In [19]:
# see files in extract
shp_raw <- read_ipums_sf(filepath[1])

### 3c. Subset the Shapefile

In [20]:
# subsets the polygon file to only GEOID10 (geography reference code) and geography
polygons <- shp_raw[c("GISJOIN", "GEOID", "STATEFP", "COUNTYFP", "NAME")]

# exclude Alaska (02), Hawaii (15), and Puerto Rico (72) (not covered by the version of the NLCD data we are working with here)
polygons <- polygons %>% filter(!STATEFP %in% c("02", "15", "72"))

## 4. Import the NLCD File

In [9]:
# imports NLCD file from local directory
nlcd <- raster("/home/jovyan/pipelines/Annual_NLCD_LndCov_2023_CU_C1V0.tif")

“GDAL Error 1: TIFFFetchDirectory:Sanity check on directory count failed, this is probably not a valid IFD offset”
“GDAL Error 1: TIFFReadDirectory:Failed to read directory at offset 962594904”


In [1]:
nlcd

ERROR: Error in eval(expr, envir, enclos): object 'nlcd' not found
