# Spatial Data Management with IPUMS NHGIS
### by [Kate Vavra-Musser](https://vavramusser.github.io) for the [R Spatial Notebook Series](https://vavramusser.github.io/r-spatial)

## Introduction

This notebook demonstrates the process of cleaning and preparing spatial data for analysis.  The notebook uses data extracted from the [IPUMS National Historical Geographic Information System (NHGIS)](https://www.nhgis.org) repository, which provides harmonized data from the U.S. Decennial Census, American Community Survey, and other sources.  Working with spatial data, like the data included in the IPUMS NHGIS repository, requires specalized management of the complex relationships between geographic boundaries and attribute data, as well as attention to the nuances of georeferenced data. This notebook will guide you through steps in a typical spatial data management process, including importing, joining, and preparing tabular and boundary data to create clean, analysis-ready spatial datasets.

### Notebook Goals
This notebook introduces a typical spatial data management workflow using previously-downloaded [IPUMS NHGIS](https://www.nhgis.org) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html).  This notebook is intended as a follow-up to [2.4 IPUMS NHGIS Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc).  By the end of this notebook, users will have the skills to create their own workflows managing spatial NHGIS IPUMS data or other spatial for social and demographic research workflows.

### ★ Prerequisites ★
* Complete [Chapter 2.4 IPUMS NHGIS Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc)
* Have a copy of the *ipums_nhgis_example.shp* shapefile available in your workspace.
  * If you worked through [Chapter 2.4](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc) you should have created and saved a copy of the *ipums_nhgis_example.shp* shapefile in the final section of the notebook.
  * You can also download the zipped shapefile *ipums_nhgis_example.zip* including the four *ipums_nhgis_example* shapefile files from [the I-GUIDE platform](https://platform.i-guide.io/datasets/b033e365-cb1f-41d6-ad99-e6a13c41127c) or [Kate's GitHub](https://github.com/vavramusser/r-spatial/blob/main/ipums_nhgis_example.zip).

#### About the Example Data Set
The [*ipums_nhgis_example*](https://github.com/vavramusser/r-spatial/blob/main/ipums_usa_example.csv) shapefile contains total population counts from the 1990, 2000, 2010, and 2020 [U.S. Decennial Censuses](https://www.census.gov/programs-surveys/decennial-census.html) for all counties in the state of California, standardized to 2010 geographic boundaries.  The United States [Decennial Census](https://www.census.gov/programs-surveys/decennial-census.html) is a population and housing count conducted by the [U.S. Census Bureau](https://www.census.gov) every ten years. The Census aims to count every person living in the United States and its territories, collecting basic demographic information such as age, sex, race, ethnicity, and household relationships.  It is designed to provide a comprehensive snapshot of the nation's population and housing characteristics at a specific point in time.

#### Notebook Overview
1. Setup
2. Importing and Exploring Spatial Data
3. Spatial Data Management and Validation

## 1. Setup

This section will guide you through the process of installing essential packages.

[**geojsonio**](https://cran.r-project.org/web/packages/geojsonio/index.html) · Convert Data from and to *[GeoJSON](https://geojson.org)* or *[TopoJSON](https://github.com/topojson/topojson)*.

[**ggplot2**](https://cran.r-project.org/web/packages/ggplot2/index.html) · Create Elegant Data Visualisations Using the Grammar of Graphics.  A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.  This notebook uses the following functions from *ggplot2*.

* [*aes*](https://rdrr.io/cran/ggplot2/man/aes.html) · Construct aesthetic mappings
* *CoordSf* · Visualize sf objects
  * *geom_sf* · geometric objects (points, lines, or polygons)
* [*ggplot*](https://rdrr.io/cran/ggplot2/man/ggplot.html) · Create a new ggplot
* [*ggtheme*](https://rdrr.io/cran/ggplot2/man/ggtheme.html) · Complete themes
  * *theme_minimal* · Minimal theme
* [*labs*](https://rdrr.io/cran/ggplot2/man/labs.html) · Modify axis, legend, and plot labels

[**sf**](https://cran.r-project.org/web/packages/sf/index.html) · Support for simple features, a standardized way to encode spatial vector data. Binds to 'GDAL' for reading and writing data, to 'GEOS' for geometrical operations, and to 'PROJ' for projection conversions and datum transformations. Uses by default the 's2' package for spherical geometry operations on ellipsoidal (long/lat) coordinates.  This notebook uses the following functions from *sf*.

* [*st_crs*](https://rdrr.io/cran/sf/man/st_crs.html) · Retrieve coordinate reference system from object
* [*st_geometry*](https://rdrr.io/cran/sf/man/st_geometry.html) · Get, set, replace or rename geometry from an sf object
* [*st_read*](https://rdrr.io/cran/sf/man/st_read.html) · Read simple features or layers from file or database
* [*st_write*](https://rdrr.io/cran/sf/man/st_write.html) · Write simple features object to file or database
* [*valid*](https://rdrr.io/cran/sf/man/valid.html) · Check validity or make an invalid geometry valid
  * *st_make_valid* · Make an invalid geometry valid
  * *st_is_valid* · Check validity

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages(c("geojsonio", "ggplot2", "sf"))

Load the packages into your workspace.

In [None]:
library(geojsonio)
library(ggplot2)
library(sf)

## 2. Importing and Exploring Spatial Data

First we will read in the *ipums_nhgis_example.shp* shapefile into memory using the [*st_read*](https://rdrr.io/cran/sf/man/st_read.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package.  You may need to update the file path to reflect the file's location on your machine or in your working directory.

The [*st_read*](https://rdrr.io/cran/sf/man/st_read.html) function reads spatial data files (like shapefiles) into an [*sf object*](https://r-spatial.github.io/sf/articles/sf1.html), which includes both attribute and geometry data.  The *sf object* allows you to treat spatial data like a regular data frame while retaining its spatial attributes.

In [None]:
dat <- st_read("ipums_nhgis_example.shp")

The *ipums_nhgis_example.shp* file was saved as a shapefile, which includes spatial reference informaiton and metadata, and we get a preview of some of this spatial metadata when we read in the file with [*st_read*](https://rdrr.io/cran/sf/man/st_read.html).

* The data includes **58 features** (the counties in the state of California) and has **32 attribute fields**.
* The bounding box is **(-2356114, -364426.6) (-1646660 845925.2)**
* The coordinate reference system (CRS) is the **[USA Contiguous Albers Equal Area Conic](https://en.wikipedia.org/wiki/Albers_projection) projected CRS**

Let's take a look at the first few lines of the data.

As a reminder, the [*ipums_nhgis_example.shp*](https://github.com/vavramusser/r-spatial/blob/main/ipums_usa_example.csv) file contains total population counts from the 1990, 2000, 2010, and 2020 [U.S. Decennial Censuses](https://www.census.gov/programs-surveys/decennial-census.html) for all counties in the state of California.

In [None]:
head(dat)

In [Chapter 2.4](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc) we set up our IPUMS API extraction to return total population data for the 1990, 2000, 2010, and 2020 Decennial Censuses.  However, IPUMS includes a set of preselected variables in data extractions including metadata and other supplemental information whch account for the additional 12 variables.

Let's take a look at the list of column names.

In [None]:
colnames(dat)

Below is a referece list of the variables included in the data.  This list includes the 4 total population variables and 6 total population lower and upper bound variables for the four Censuses represented in the data as well as IPUMS preselected geographic variables from the time-series table and additional geographic variables included in the original shapefile from our extract.  After we merged the time-series tabular and spatial datasets at the end of [Chapter 2.4](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc) the combined dataset includes all variables from both original sources.  Recall that before saving the merged data we removed the Land Area in Square Meters (ALAND10), Water Area in Square Meters (AWATER10), and Shape Area (Shape_area) attributes due to space limitations.

**Population Variables**
* 1990 Total Population Estimate (CL8AA1990)
* 1990 Total Population Lower Bound (CL8AA1990L)
* 1990 Total Population Upper Bound (CL8AA1990U)
* 2000 Total Population Estimate (CL8AA2000)
* 2000 Total Population Lower Bound (CL8AA2000L)
* 2000 Total Population Upper Bound (CL8AA2000U)
* 2010 Total Population (CL8AA2010)
* 2020 Total Population Estimate (CL8AA2020)
* 2020 Total Population Lower Bound (CL8AA2020L)
* 2020 Total Population Upper Bound (CL8AA2020U)

**Geographic Variables (from the Time-Series Tables)**
* Geography Year (GEOGYEAR)
* State Name (STATE)
* State FIPS Code (STATEA)
* County Name (COUNTY)
* County FIPS Code (COUNTYA)

**Geographic Variables (from the original Shapefile)**
* State FIPS Code (STATEFP10)
* County FIPS Code (COUNTYFP10)
* COUNTYNS10
* Geographic Identifier (GEOID10)
* County Name (NAME10)
* Legal/Statistical Area Description (LSAD) Name (NAMELSAD10)
* FIPS Class Code (CLASSFP10)
* Combined Statistical Area (CSA) FIPS Code (CSAFP10)
* Core-Based Statistical Area (CBSA) FIPS Code (CBSAFP10)
* Metropolitan Division FIPS Code (METDIVFP10)
* Functional Status (FUNCSTAT10)
* Internal Point (Centroid) Latitude Coordinate (INTPTLAT10)
* Internal Point (Centroid) Longitude Coordinate (INTPTLON10)
* Shape Length (Shape_len)
* Spatial Geometry (geometry)

**GIS Join**
* GIS Join Match Code (GISJOIN)

### 2a. Exploring the sf Object Data Structure

After importing the data and reviewing the tabular attribute data, we’ll explore its structure and spatial attributes.  First we'll take a look at the structure of the data using the *str* command.

In [None]:
str(dat)

This view provides a different snapshot of the data than the typical tabular format.  We can see each of the attributes listed first with information on the data type for each attribute and a preview of some of the first few rows of data.

Below the attribute list we see the *geometry* column.  This special column differentiates an [*sf object*](https://r-spatial.github.io/sf/articles/sf1.html) from a typical *data.frame*.  The *geometry* column contains a multipolygon for each feature in the dataset and each multipolygon is further made up of a series of coordinate corresponding to the points which makde up the polygon shape.

### 2b. Exploring the Coordinate Reference System
Next we will check out the coordinate reference system (CRS) for our data using the [*st_crs*](https://rdrr.io/cran/sf/man/st_crs.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package.

In [None]:
# retuns the CRS
st_crs(dat)

This view provides us with a lot of metainformation about our data's CRS, the [USA Contiguous Albers Equal Area Conic](https://en.wikipedia.org/wiki/Albers_projection) projection, including:

* [datum](https://en.wikipedia.org/wiki/Geodetic_datum): **[North American Datum 1983 (NAD83)](https://en.wikipedia.org/wiki/North_American_Datum)**
* [ellipsoid](https://en.wikipedia.org/wiki/Earth_ellipsoid): **[Geodetic Reference System 1980](https://en.wikipedia.org/wiki/Geodetic_Reference_System_1980)**
* [(EPSG) code](https://en.wikipedia.org/wiki/EPSG_Geodetic_Parameter_Dataset): **[9822](https://epsg.io/9822-method)** or **[ESRI 102003](https://epsg.io/102003)**
* units: meters

We can also set the [*st_crs*](https://rdrr.io/cran/sf/man/st_crs.html) parameter *parameters* to *TRUE* to get a list of detailed CRS parameters.

In [None]:
# returns the list of CRS parameters
st_crs(dat, parameters = T)

Using this we can also programmitally call individual components of the CRS parameter list.

In [None]:
# returns the name CRS name
st_crs(dat, parameters = T)$Name

# returns the proj4string
st_crs(dat, parameters = T)$proj4string

### 2c. Mapping Raw Geometry

Although our data includes multiple attributes, including the Census total population data we are interested in, it isuseful to take a look at the raw geometry prior to further analysis.  We can access the raw geometry of our file using the [*st_geometry*](https://rdrr.io/cran/sf/man/st_geometry.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package and plot it with the basic *plot* function.

In [None]:
# plots the raw geometry without attributes
plot(st_geometry(dat))

### 2d. Mapping Raw Geometry with ggplot2

In addition to the simple map made with the base R *plot* function, we can also make more complicated maps with the [**ggplot2**](https://cran.r-project.org/web/packages/ggplot2/index.html) package.  We will use *ggplots* to create more elaborate maps in later notebooks, but for now we will create a simple map of the raw geometry for our data.

The code below sets up a *ggplot* and specifies style parameters.

* [*ggplot*](https://rdrr.io/cran/ggplot2/man/ggplot.html) sets up the *ggplot* plot object and specifies that we will use *dat* as the data for the plot
* [*geom_sf*](https://ggplot2.tidyverse.org/reference/ggsf.html) adds a layer to plot spatial features from an sf object
* [*labs*](https://rdrr.io/cran/ggplot2/man/labs.html) allows us to add customization including a map title and caption for this map
* [*theme_minimal*](https://rdrr.io/cran/ggplot2/man/ggtheme.html) provides a clean, minimalist theming for the overall map

In [None]:
ggplot(data = dat) +
  geom_sf() +
  labs(title = "California Counties 2010",
       caption = "Data Source: IPUMS NHGIS") +
  theme_minimal()

### 2e. Mapping Attribute Data with ggplot2

We can build on the *ggplot* we previously created to map the raw geometries and make a version that maps the 2010 total population attribute (CL8AA2010).  This version of the *ggplot* specifies the attribute to use by specifying the attribute with the [*aes*](https://rdrr.io/cran/ggplot2/man/aes.html) function as a parameter to the *geom_sf* parameter.  The *ggplot* map will automatically use a continuous blue color scheme to visualize the geometries based on the specified attribute value.

In [None]:
ggplot(data = dat) +
  geom_sf(aes(fill = CL8AA2010)) +
  labs(title = "California Counties 2010",
       subtitle = "2010 Total Population",
       caption = "Data Source: IPUMS NHGIS") +
  theme_minimal()

We can change the color scheme by adding an additional parameter to the *ggplot* object such as the built-in [*scale_colour_continuous*](https://rdrr.io/cran/ggplot2/man/scale_colour_continuous.html) color schemes such as *scale_fill_viridis_c*.

In [None]:
ggplot(data = dat) +
  geom_sf(aes(fill = CL8AA2010)) +
  labs(title = "California Counties 2010",
       subtitle = "2010 Total Population",
       caption = "Data Source: IPUMS NHGIS") +
  theme_minimal() +
  scale_fill_viridis_c()

## 3. Spatial Data Preprocessing

Prior to carrying out spatial data analysis there are a few essential spatial data preprocessing tasks which we should carry out to ensure the integretity of our data and its compatibility for any downstream analysis.

### 3a. Reprojecting Data

In addition to viewing our data's CRS, we can also reproject data using the [*st_transform*](https://rdrr.io/cran/sf/man/st_transform.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package.  This can help ensure compatibility between layers or to use a CRS suitable for your analysis.

We can reproject to a different CRS using the appropriate [(EPSG) code](https://en.wikipedia.org/wiki/EPSG_Geodetic_Parameter_Dataset) for our desired CRS.

In [None]:
# reproject to WGS84 (EPSG:4326)
dat_reproj <- st_transform(dat, crs = 4326)

If we plot the new version of the data, we can see that the data has been reprojected.

In [None]:
plot(st_geometry(dat_reproj))

You can also use the [*st_transform*](https://rdrr.io/cran/sf/man/st_transform.html) function to easily reproject a file to match the projection of another file by passing the file's CRS using [*st_crs*](https://rdrr.io/cran/sf/man/st_crs.html).  This is especially useful if you are bringing in multiple data sources to your project and need to ensure they are all projected in the same CRS.

In the example below we project the *dat_reproj* object back to the CRS of the *dat* object.

In [None]:
# reproject dat_reproj back to the dat CRS (9822)
dat_reproj <- st_transform(dat_reproj, st_crs(dat))

plot(st_geometry(dat_reproj))

*plot_reproj* is now back to the original CRS of *dat*.

**★ Pro Tip:**  Here are a few common CRSs and their EPSG codes:

* WGS84 (World Geodetic System 1984) (EPSG:4326): great for global datasets, webmapping, and GPS systems
* WGS84 Pseudo-Mercator (EPSG: 3857): great for web mapping applications
* NAD83 (North American Datum 1983) (EPSG: 4269): great for National datasets covering the entire United States
* NAD83 Albers Equal Area (EPSG:5070): great for National datasets covering the entire United States
* NAD83 Contiguious USA Lambert Conformal Conic (EPSG: 102003): great for National datasets covering the contiguious United States (CONUS) 
* NAD83 UTM (Universal Transverse Mercator) (UTM zone-specific EPSGs): great for analyses in specific UTM zones
* State Plane Coordinate System (SPCS) (region-specific EPSGs): great for analysis at the state and county level

### 3b. Validating Spatial Data

Before proceeding to any spatial analysis or significant spatial data manipulation, we should check to make sure that all the geometries in our dataset are valid.  An invalid geometry in spatial data refers to a geometry that does not conform to the rules of its geometry type or is otherwise problematic for spatial operations. These issues can prevent spatial functions like intersections, unions, or buffering from working correctly.  Invalid geometries can cause errors in spatial operations or produce incorrect map visualizations.

We can use the [*st_is_valid*](https://rdrr.io/cran/sf/man/valid.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package to check for invalid geometries in our data.  The following code stores the output of the *st_is_valid* function, a list of *TRUE* and *FALSE* values indicating the validity of each geometry in the data, stores the list to the *invalid_geometries* object, and counts the number of invalid geometries.

In [None]:
# check for invalid geometries
invalid_geometries <- st_is_valid(dat)

# count invalid geometries
sum(!invalid_geometries)

Our data has one invalid geometry!  Fortunatly we can easily fix it using the [*st_make_valid*](https://rdrr.io/cran/sf/man/valid.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package.

In [None]:
# fix invalid geometries
dat <- st_make_valid(dat)

# check for invalid geometries
sum(!st_is_valid(dat))

Testing the file again using *st_is_valid* now returns 0 invalid geometries.  Our data is now validated and ready for analysis.

### 3c. Save the Data

After the processing steps we should save our data again so we can import it to our next notebook.  We will again save the data as a **shapefile** (*.shp*) using the [*st_write*](https://rdrr.io/cran/sf/man/st_write.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package.

In [None]:
st_write(dat, "ipums_nhgis_analysis.shp", driver = "ESRI Shapefile", delete_dsn = T)

At the end of this exercise we have a tested and validated version of our IPUMS NHGIS data saved in our workspace.

## Recommended Next Steps

* **Move on to Chapter 4: Exploratory Data Analysis (EDA)**
  * 4.2 Exploratory Spatial Data Analysis (ESDA) with IPUMS NHGIS

## Quick Code
A clean and simple version of the code included in this notebook.