# Spatial Data Management with IPUMS NHGIS

## Introduction

This notebook is designed to guide you through a data cleaning workflow using 2010 Decennial Census data previously downloaded from the [IPUMS National Historical Geographic Information System (NHGIS)](https://www.nhgis.org/) project using the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.htm).

### ★ Prerequisites ★
* Complete Chapter 1.1: Introduction to IPUMS and the IPUMS API
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)
* Complete Chapter 2.4: IPUMS NHGIS Data Extraction Using ipumsr

At the end of Chapter 2.4: IPUMS NHGIS Data Extraction Using ipumsr, you saved your data extraction as two file formats *ipums_nhgis_example.rds* and *ipums_nhgis_example.csv*.  You will need these files to run this notebook.  If you are working throuhg this chapter without previously completing, Chapter 1.2, you will need to copy the *ipums_nhgis_example.rds* file into your working directory prior to running this notebook.

#### Overview
This notebook includes the following sections:

1. Setup
2. Initial Review
3. Cleaning and Recoding Continuous Variables
5. Cleaning and Recoding Categorical Variables
6. Final Review

## 1. Setup

Before running this script, you will need to install and load the following packages into your R environment:

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following functions from *dplyr*.

* [*case_when*](https://rdrr.io/cran/dplyr/man/case_when.html) · a general vectorized if-else
* [*mutate*](https://rdrr.io/cran/dplyr/man/mutate.html) · create, modify, and delete columns
* [*select*](https://rdrr.io/cran/dplyr/man/select.html) · keep or drop columns using their names and types
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator, which is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows.  The *pipe* operator is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**haven**](https://cran.r-project.org/web/packages/haven/index.html) Import and Export 'SPSS', 'Stata' and 'SAS' Files.  This notebook uses the the following functions from *haven*.

* [*as_factor()*](https://rdrr.io/cran/haven/man/as_factor.html) · for formating categorical variables as factors.

[**sf**](https://cran.r-project.org/web/packages/sf/index.html) Support for simple features, a standardized way to encode spatial vector data. Binds to 'GDAL' for reading and writing data, to 'GEOS' for geometrical operations, and to 'PROJ' for projection conversions and datum transformations. Uses by default the 's2' package for spherical geometry operations on ellipsoidal (long/lat) coordinates.  This notebook uses the following functions from *sf*.

* [*st_as_sf*](https://rdrr.io/cran/sf/man/st_as_sf.html) · Convert foreign object to an sf object
* [*st_crs*](https://rdrr.io/cran/sf/man/st_crs.html) · Retrieve coordinate reference system from object
* [*st_transform*](https://rdrr.io/cran/sf/man/st_transform.html) · Transform or convert coordinates of simple feature

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [21]:
# install.packages(c("dplyr", "haven", "sf"))

Load the packages into your workspace.

In [22]:
library(dplyr)
library(haven)
library(sf)

Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE



### 1b. Read the Data File

Run the following line of code to read in the *ipums_nhgis_example.rds* file into memory.  You may need to update the file path to reflect the file's location on your machine or in your working directory.

The *ipums_nhgis_example.rds* file contains information from the 2010 Decennial Census.

In [5]:
dat <- readRDS("ipums_nhgis_example.rds")

As a refresher, let's take a look at the number of observations and the variables in the data.

In [6]:
dim(dat)

The data includes information on 32 variables for 72,765 observations.

In [7]:
colnames(dat)

Below is a referece list of the variables included in this data extraction.

**Population Variables**
* Geography Year (GEOGYEAR)
* State Name (STATE)
* State Code (STATEA)
* County Name (COUNTY)
* County Code (COUNTYA)
* Census Tract Code (TRACTA)
* 1990 Total Persons (CL8AA1990)
* 1990 Total Persons (Lower Bound) (CL8AA1990L)
* 1990 Total Persons (Upper Bound) (CL8AA1990U)
* 2000 Total Persons (CL8AA2000)
* 2000 Total Persons (Lower Bound) (CL8AA2000L)
* 2000 Total Persons (Upper Bound) (CL8AA2000U)
* 2010 Total Persons (CL8AA2010)
* 2020 Total Persons (CL8AA2020)
* 2020 Total Persons (Lower Bound) (CL8AA2020L)
* 2020 Total Persons (Upper Bound) (CL8AA2020U)

**Geography Variables**
* State FIPS Code (STATEFP10)
* County FIPS Code (COUNTYFP10)
* Census Tract FIPS Code (TRACTCE10)
* Geographic Identifier (GEOID10)
* Census Tract Name (NAME10)
* Legal/Statistical Area Description (LSAD) Name (NAMELSAD10)
* MAF/TIGER Feature Class Code (MTFCC10)
* Functional Status (FUNCSTAT10)
* Land Area in Square Meters (ALAND10)
* Water Area in Square Meters (AWATER10)
* Internal Point (Centroid) Latitude Coordinate (INTPTLAT10)
* Internal Point (Centroid) Longitude Coordinate (INTPTLON10)
* Shape Area (Shape_area)
* Shape Length (Shape_len)
* Spatial Geometry (geometry)

**GIS Join**
* GIS Join Match Code (GISJOIN)

In [25]:
head(dat)

Unnamed: 0_level_0,GISJOIN,GEOGYEAR,STATE,STATEA,COUNTY,COUNTYA,TRACTA,CL8AA1990,CL8AA1990L,CL8AA1990U,⋯,NAMELSAD10,MTFCC10,FUNCSTAT10,ALAND10,AWATER10,INTPTLAT10,INTPTLON10,Shape_area,Shape_len,geometry
Unnamed: 0_level_1,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,⋯,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<MULTIPOLYGON [m]>
1,G0100010020100,2010,Alabama,1,Autauga County,1,20100,1772.67,1750,1773,⋯,Census Tract 201,G5020,S,9809944,36312,32.4771112,-86.4903033,9846259,16193.875,MULTIPOLYGON (((888438 -515...
2,G0100010020200,2010,Alabama,1,Autauga County,1,20200,2031.0,2031,2031,⋯,Census Tract 202,G5020,S,3340505,5846,32.475758,-86.4724678,3346347,9844.309,MULTIPOLYGON (((889844.1 -5...
3,G0100010020300,2010,Alabama,1,Autauga County,1,20300,2952.0,2952,2952,⋯,Census Tract 203,G5020,S,5349274,9054,32.4740243,-86.4597033,5358330,10519.641,MULTIPOLYGON (((891383.8 -5...
4,G0100010020400,2010,Alabama,1,Autauga County,1,20400,4401.0,4401,4401,⋯,Census Tract 204,G5020,S,6382705,16244,32.4710782,-86.4446805,6398946,12500.859,MULTIPOLYGON (((892527.3 -5...
5,G0100010020500,2010,Alabama,1,Autauga County,1,20500,3120.68,3119,3433,⋯,Census Tract 205,G5020,S,11397725,48412,32.4589157,-86.4218165,11446139,17113.378,MULTIPOLYGON (((895451 -522...
6,G0100010020600,2010,Alabama,1,Autauga County,1,20600,3330.0,3330,3330,⋯,Census Tract 206,G5020,S,8020366,60048,32.447347,-86.4768023,8080417,14306.062,MULTIPOLYGON (((889098.5 -5...


### 1c. Convert the Data to Simple Features (sf) Format

In [28]:
dat_sf <- st_as_sf(dat)

## 2. Coordinate Reference Systems (CRS)

Coordinate Reference Systems (CRS) define how spatial data is mapped onto the Earth's surface. A CRS specifies the coordinate system and projection used to represent geographic locations. Ensuring that all spatial data layers in a project share the same CRS is critical for accurate spatial analyses.

This section covers the following:

* Understanding CRS: The difference between geographic and projected coordinate systems.
* Inspecting CRS: Checking the CRS of spatial data.
* Reprojecting Data: Converting spatial data to a desired CRS.

* Explain CRS concepts and the importance of ensuring all spatial layers use the same CRS.
* Show how to reproject spatial data to a desired CRS using R (sf::st_transform()).
* Provide examples of common CRSs (e.g., WGS84 for global data or NAD83/Albers Equal Area for U.S. data).

### 2a. Understanding CRS

Spatial data can use:

* **Geographic CRS:** Coordinates are represented as latitude and longitude (e.g., WGS84). Suitable for global data but not for distance or area calculations due to distortion.
* **Projected CRS:** Coordinates are represented in linear units (e.g., meters or feet), minimizing distortion for specific areas (e.g., UTM, NAD83).

### 2b. Inspecting the CRS

Use the st_crs() function from the sf package to inspect the CRS of a spatial object.

In [29]:
# Check CRS of the spatial data
st_crs(dat_sf)

# Output includes:
# - EPSG code (e.g., 4326 for WGS84)
# - Proj4string describing the CRS

Coordinate Reference System:
  User input: USA_Contiguous_Albers_Equal_Area_Conic 
  wkt:
PROJCRS["USA_Contiguous_Albers_Equal_Area_Conic",
    BASEGEOGCRS["NAD83",
        DATUM["North American Datum 1983",
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4269]],
    CONVERSION["USA_Contiguous_Albers_Equal_Area_Conic",
        METHOD["Albers Equal Area",
            ID["EPSG",9822]],
        PARAMETER["Latitude of false origin",37.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-96,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",29.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Latitude of 2nd standard parallel",4

### 2b. Reprojecting Data

To ensure compatibility between layers or to use a CRS suitable for your analysis, you can reproject spatial data using st_transform().

In [30]:
# Reproject data to WGS84 (EPSG:4326)
dat_wgs84 <- st_transform(dat_sf, crs = 4326)

# Reproject data to NAD83 / UTM Zone 15N (EPSG:26915)
dat_utm <- st_transform(dat_sf, crs = 26915)

Commonly Used CRSs
Here are a few common CRSs and their EPSG codes:

* WGS84 (EPSG:4326): Default geographic CRS for global data.
* NAD83 / Albers Equal Area (EPSG:5070): Suitable for U.S.-wide analyses.
* NAD83 / UTM Zone 15N (EPSG:26915): Ideal for regional studies in the central U.S.

## 3. Spatial Data Formats

Spatial data comes in various formats, each suited for specific types of analyses or visualization tasks. Understanding these formats and how to work with them in R is essential for spatial data preparation and analysis. This section covers the following:

* Common Spatial Data Formats: Overview of popular formats and their typical use cases.
* Reading Spatial Data: Importing spatial data from different formats using the sf package.
* Writing Spatial Data: Exporting spatial data to commonly used formats.

### 3a. Common Spatial Data Formats
Spatial data can be broadly categorized into two types: vector and raster. Below are common formats:

**Vector Formats (Points, Lines, Polygons)**
* Shapefile (.shp): A widely used format, often with multiple accompanying files (.shx, .dbf, etc.).
* GeoJSON (.geojson): A lightweight, web-friendly format for vector data.
* Geopackage (.gpkg): A modern format that supports multiple layers and data types in a single file.

**Raster Formats (Grids, Images)**
* GeoTIFF (.tif): A versatile format for raster data, including satellite imagery and elevation data.
* ASCII Grid (.asc): A simple text-based raster format.

**Other Formats**
* CSV with Coordinates: Tabular data containing latitude/longitude or other coordinates, convertible to spatial data.
* KML (.kml): Used for Google Earth visualizations.


* Highlight key formats (e.g., shapefiles, GeoJSON, geopackages, rasters) and how to read/write them in R.
* Demonstrate converting between spatial and non-spatial formats (e.g., sf to data.frame).

### 3b. Reading Spatial Data

Use the sf package to read vector data and the terra package for raster data.

In [31]:
# Reading a Shapefile
#shapefile <- st_read("path/to/shapefile.shp")

# Reading a GeoJSON file
#geojson <- st_read("path/to/file.geojson")

# Reading a Geopackage layer
#geopackage <- st_read("path/to/file.gpkg", layer = "layer_name")

# Viewing metadata
#st_crs(shapefile)  # Check CRS
#st_geometry_type(shapefile)  # Geometry type (POINT, LINESTRING, etc.)

For tabular data with coordinates, convert it into a spatial object:

In [32]:
# Example CSV with latitude and longitude
#csv_data <- read.csv("path/to/coordinates.csv")

# Convert to spatial object
#csv_spatial <- st_as_sf(csv_data, coords = c("longitude", "latitude"), crs = 4326)

### 3c. Writing Spatial Data

The sf package also supports exporting spatial data to various formats.

In [None]:
# Write to a Shapefile
#st_write(shapefile, "path/to/exported_shapefile.shp")

# Write to a GeoJSON file
#st_write(shapefile, "path/to/exported_file.geojson")

# Write to a Geopackage (with layer name)
#st_write(shapefile, "path/to/exported_file.gpkg", layer = "exported_layer")

# Write to CSV (if attributes are non-spatial)
#write.csv(st_drop_geometry(shapefile), "path/to/exported_data.csv")

### 3d. Summary of Format Advantages

(table of summary of format advantages)

* Shapefile
  * Advantages: widely supported, simple format
  * Used For: general-purpose GIS use
* GeoJSON
  * Advantages: lightweight, web-friendly
  * Used For: web-based mapping and visualizations
* Geopackage
  * Advantages: multi-layer, company, supports raster and vector formats
  * Used For: large projects with multiple datasets
* GeoTIFF
  * Advantages: high-resolution raster data
  * Used For: satellite imagery, elevation models
* CSV
  * Advantages: simple tabular format
  * Used For: lightweight point data storage

# Recommended Next Steps

* **Continue with Chapter 3: Data Cleaning and Preparation**
  * 3.1: Data Preparation and Transformation with IPUMS USA
* **Move on to Chapter 4: Exploratory Data Analysis (EDA)**
  * 4.2: Spatial Exploratory Data Analysis (SEDA) with IPUMS NHGIS