## 1 Setup

### 1a. Package Installation

Before running this script, you will need to install and load the following packages into your R environment:

[**corrplot**](https://cran.r-project.org/web/packages/corrplot/index.html) A package which provides a visualization of a correlation matrix.  This notebook uses the following function from *corrplot*.

* [*corrplot()*](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) for creating a correlation matrix visualization

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A package for data manipulation that provides a consistent set of functions to filter, arrange, summarize, and transform data. *dplyr* makes it easy to work with data frames and perform operations efficiently.  This notebook uses the the following functions from *dplyr*.

* [*arrange()*](https://dplyr.tidyverse.org/reference/arrange.html) for ordering the rows of a data frame by selected columns
* [*case_when()*](https://dplyr.tidyverse.org/reference/case_when.html) for creating or modifying new dataframe columns based on multiple conditions
* [*filter()*](https://dplyr.tidyverse.org/reference/filter.html) for subsetting a dataframe based on specified conditions
* [*group_by()*](https://dplyr.tidyverse.org/reference/group_by.html) for grouping by one or more variables
* [*mutate()*](https://dplyr.tidyverse.org/reference/mutate.html) for modifying dataframes
* [*rename()*](https://dplyr.tidyverse.org/reference/rename.html) for changing the names of individual variables in a dataframe
* [*select()*](https://dplyr.tidyverse.org/reference/select.html) for selecting variables in a dataframe by name
* [*summarize()*](https://dplyr.tidyverse.org/reference/summarise.html) for creating a new data frame using combinations of grouping variables
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator, which is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows.  The *pipe* operator is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ggmap**](https://cran.r-project.org/web/packages/ggmap/index.html) A collection of functions to visualize spatial data and models on top of static maps from various online sources (e.g Google Maps and Stamen Maps). It includes tools common to those tasks, including functions for geolocation and routing.  This notebook uses the following functions from *ggmap*.

* [*get_stamenmap*](https://rdrr.io/cran/ggmap/man/get_stamenmap.html) for accessing a tile server from Stamen Maps
* [*register_stadiamaps*](https://search.r-project.org/CRAN/refmans/ggmap/html/register_stadiamaps.html) for registering a Stadia Maps API key

[**ggplot2**](https://cran.r-project.org/web/packages/ggplot2/index.html) A package for creating graphics based on the "Grammar of Graphics".  This notebook uses the following functions from *ggplot2*.

* [*geom_bar()*](https://ggplot2.tidyverse.org/reference/geom_bar.html) for creating bar charts
* [*geom_boxplot()*](https://ggplot2.tidyverse.org/reference/geom_boxplot.html) for creating boxplots, aka box-and-whisker plots
* [*geom_density()*](https://ggplot2.tidyverse.org/reference/geom_density.html) for creating density plots
* [*geom_histogram()*](https://ggplot2.tidyverse.org/reference/geom_histogram.html) for creating histograms
* [*geom_point()*](https://ggplot2.tidyverse.org/reference/geom_point.html) for creating scatterplots
* [*geom_sf()*](https://ggplot2.tidyverse.org/reference/ggsf.html) for visualizing (mapping) sf objects
* [*geom_vline()*](https://www.rdocumentation.org/packages/ggplot2/versions/0.9.0/topics/geom_vline) for annotating a plot with a vertical line

[**gridExtra**](https://cran.r-project.org/web/packages/gridExtra/index.html) A package which provides a number of functions to work with grid graphics such as arranging multiple plots.  This notebook uses the following functions from *gridExtra*.

* *grid.arrange()* for arranging multiple graphics on a page

[**haven**](https://cran.r-project.org/web/packages/haven/index.html) A package for importing and exporting SPSS, Stata, and SAS files.  This notebook uses the following functions from *haven*.

* [*as_factor()*](https://haven.tidyverse.org/reference/as_factor.html) for formating categorical variables as factors.

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) A package for interacting with IPUMS datasets and the IPUMS API. It allows users to define and submit data extraction requests, download data, and read it directly into R for analysis.  This notebook uses the the following functions from *ipumsr*.

* *set_ipums_api_key()* for setting your IPUMS API key
* *get_sample_info()* for retrieving sample identification codes and descriptions for IPUMS microdata collections
* *get_metadata_nhgis()* for listing available data sources from IPUMS NHGIS
* *define_extract_micro()* for defining the parameters of an IPUMS microdata extract request to be submitted via the IPUMS API
* *define_extract_nhgis()* for defining an IPUMS NHGIS extract request
* *tst_spec()* for creating a tst_spec object containing a time-series table specification
* *submit_exract()* for submitting an extract request via the IPUMS API and return an *ipums_extract* object
* *wait_for_extract()* wait for an extract to finish processing
* *download_extract()* download an extract's data files
* *read_ipums_ddi()* for reading metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* *read_ipums_micro()* for reading data from an IPUMS microdata extract
* *read_nhgis()* for reading tabular data from an NHGIS extract
* *read_ipums_sf()* for reading spatial data from an IPUMS extract

[**purrr**](https://cran.r-project.org/web/packages/purrr/index.html) A functional programming toolkit that simplifies the process of working with lists and vectors. It is particularly useful for applying functions to multiple elements or data frames, making it easier to write clean, efficient code.  This notebook uses the the following functions from *purr*.

* [*map()*](https://www.rdocumentation.org/packages/purrr/versions/0.2.5/topics/map) and [*map_dfr()*](https://purrr.tidyverse.org/reference/map_dfr.html) for applying a function to each element in the given input

[**RAQSAPI**](https://cran.r-project.org/web/packages/RAQSAPI/index.html) A package designed to interface with the [United States Environmental Protection Agency's (EPA) Air Quality System (AQS) Data Mart API](https://aqs.epa.gov/aqsweb/documents/data_api.html) and retrieve air monitoring data and associated metadata from the EPA's air quality monitoring service.  This notebook uses the the following functions from *RAQSAPI*.

* *aqs_credentials()* for setting the user credentials for the AQS API
* *aqs_annualsummary_by_state()* for returning annual data from the AQS API aggregated at the state level

[**rnaturalearth**](https://cran.r-project.org/web/packages/rnaturalearth/index.html) A package for extracting geographic data from the open-source online repository [Natural Earth](https://www.naturalearthdata.com).  This notebook uses the folloing functions from *rnationalearth*.

* [*ne_countries()*](https://www.rdocumentation.org/packages/rnaturalearth/versions/1.0.1/topics/ne_countries) for downloading world country polygons from Natural Earth
* [*ne_download()*](https://www.rdocumentation.org/packages/rnaturalearth/versions/1.0.1/topics/ne_download) for downloading data from Natural Earth

[**rnaturalearthdata**](https://cran.r-project.org/web/packages/rnaturalearthdata/index.html)

[**sf**](https://cran.r-project.org/web/packages/sf/index.html) A package providing support for simple features (sf) geometry objects, a standardized way to encode spatial vector data.  This notebook uses the following functions from *sf*.

* [*st_area()*](https://r-spatial.github.io/sf/reference/geos_measures.html) for computing the area or the length of a set of geometries
* [*st_as_sf()*](https://r-spatial.github.io/sf/reference/st_as_sf.html) for converting a foreign object on a sf object
* [*st_buffer()*](https://r-spatial.github.io/sf/reference/geos_unary.html) for carrying out a unary buffer operation on simple feature geometries
* [*st_crs()*](https://r-spatial.github.io/sf/reference/st_crs.html) for retrieveing, setting, and replacing coordinate reference systems (CRS) for sf objects
* [*st_intersection()*](https://r-spatial.github.io/sf/reference/geos_binary_ops.html) for performing geometric set operations with simple feature geometry collections
* [*st_make_valid()*](https://r-spatial.github.io/sf/reference/valid.html) for checking whether geometry is valid or making an invalid geometry valid
* [*st_read()*](https://www.rdocumentation.org/packages/sf/versions/0.2-2/topics/st_read) for readinb simple features from file or database
* [*st_transform()*](https://r-spatial.github.io/sf/reference/st_transform.html) for transforming or converting coordinates of simple feature

If you are working in the I-GUIDE environment, the these packages should be already be installed.  If you are working on your local machine or another environment, you may need to install them before continuing.

Load the packages into your workspace.

### 1b. API Setup

#### Connect your IPUMS API Key

Run the following code to enter your [IPUMS API key](https://account.ipums.org/api_keys).  Refer to the **Introduction to IPUMS and the IPUMS API** notebook for background on the IPUMS data repository and for instructions on setting up your IPUMS account and API key.

In [None]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

#### Connect your EPA AQS API Key

Run the following code to enter your [EPA AQS API](https://aqs.epa.gov/aqsweb/documents/data_api.html) username (email) and key.  Refer to the **Introduction to the Environmental Protection Agency (EPA) Air Quality System (AQS) Data Repository and API** notebook for background on the EPA AQS data repository and for instructions on setting up your EPA AQS API key.

In [None]:
epaaqs_api_email = readline("Please enter your EPA AQS API username (email): ")
epaaqs_api_key = readline("Please enter your EPA AQS API key: ")
# this step commented until the RAQSAPI package is installed
#aqs_credentials(epaaqs_api_email, epaaqs_api_key)

#### Connect your Stadia Maps API Key

Run the following code to enter your [Stadia Maps API](https://client.stadiamaps.com/signup) key.  Refer to the **Introduction to Basemaps with ggmap** notebook for background on the Stadia Maps data repository and for instructions on setting up your Stadia Maps API key.

In [None]:
stadia_api_key = readline("Please enter your Stadia API key: ")
register_stadiamaps(stadia_api_key)

## Next Steps

From here, we recommend exploring the following notebooks:

* **IPUMS USA Data Extraction** Extract IPUMS USA data.
* **IPUMS NHGIS Data Extraction** Extract IPUMS NHGIS data and corresponding geographic boundary files.
* **IPUMS ACS Data Cleaning** Work through a data cleaning workflow using the IPUMS USA data extraction you just downloaded.

## 6. Bonus: A Quick Population Map Using ggplot

Now that we have merged the population data with the shapefiles, we can create a quick visualization to explore how the population is distributed across a specific state. This section code for generating simple choropleth map, which uses colors to represent population density across different geographic areas.

The example map visualizes the population distribution within Rhode Island for 2020 using the merged datset. The code selects the state of Rhode Island by referencing the [state FIPS code](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt).  Mapping the data allows you to visually analyze patterns in population density, identifying areas with higher or lower populations.

In [1]:
install.packages(c("ggplot2", "sf"))
library(ggplot2)
library(sf)

# filter the dataset for a single state (e.g., Rhode Island - FIPS code: 44)
state_data <- dat %>% filter(STATEFP10 == "44")  # Rhode Island FIPS code

# quick map of population for 2020
ggplot(state_data) +
  geom_sf(aes(fill = pop2020), color = "white", size = 0.2) +
  scale_fill_viridis_c(name = "Population", option = "plasma") +
  labs(
    title = "Population Distribution in Rhode Island (2020)",
    subtitle = "Data Source: IPUMS NHGIS",
    caption = "Created using ggplot2 and sf"
  ) +
  theme_minimal()

Installing packages into ‘/home/jovyan/R/x86_64-conda-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)

“installation of package ‘sf’ had non-zero exit status”
Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE



ERROR: Error in eval(expr, envir, enclos): object 'dat' not found
